SQL in QGIS Database Manager

Spatial SQL with Spatialite and QGIS

I’ve recently given a few presentations on the Ocean State Spatial Database, which is a basic geodatabase for Rhode Island that we’ve created in our lab. The database was designed so that new and experienced users alike could easily access a curated collection of foundational layers and data tables for thematic mapping and geospatial analysis. The database is available for download on GitHub, and there is documentation that describes the layers and tables that are included. The database comes in two formats: SQLite/ Spatialite that’s great for QGIS, and a File Geoadatabase version for ArcGIS Pro users.

One of the big advantages of using the Spatialite database in QGIS is that you can take advantage of the Database Manager, and write SQL and spatial SQL queries for selecting records and doing spatial analysis. Instead of using a series of point and click tools that create a bunch of new files, you can write a single block of code to perform an entire operation, and you can save that code to document your work. Access the Database Manager above the toolbars at the top of the QGIS interface. Once you’re in, you can select the Spatialite option, right click and then browse your file system to point to the database to establish a connection. At the top of the DB Manager is a button (piece of paper with wrench) to open a SQL query window.

Database Manager in QGIS with SQL Window Open

The following commands are basic SQL: SELECT some columns FROM some tables WHERE some criteria is met. This returns all rows and columns from the public libraries layer in the database:

SELECT *
FROM d_public_libraries;

This returns just some of the columns for all rows:

SELECT libid, libname, city, cnty
FROM d_public_libraries;

While this returns some of the columns and rows that meet specific criteria, in this case where libraries are located in Providence County, RI:

SELECT libid, libname, city, cnty, geom
FROM d_public_libraries
WHERE cnty='PROVIDENCE'
ORDER BY city;

Traditional database column types include strings (aka text), integers, and decimal numbers, which limit the values that can be stored in the column, and allow specific functions that can operate on values of that type (math on numeric columns, string operations on text columns). Beyond the basic data types, many databases have special ones, such as date types that allow you to store and manipulate dates and times as distinct objects.

Spatial databases incorporate special columns for storing the geometry of features as strings of coordinates, and provide functions that can operate on that geometry. In the example above, the values stored in the geometry column were returned in a binary format. But we can apply a spatial function called ST_AsText to display the geometry as readable text:

SELECT libid, libname, city, cnty, ST_AsText(geom) AS geom
FROM d_public_libraries
WHERE cnty='PROVIDENCE'
ORDER BY city;

We can see that this is point geometry (as opposed to lines or polygons), and we have an X and Y coordinate for each point. The layers in this database are in the Rhode Island State Plane System, so the coordinates that are returned are in that system. We can convert these to longitude and latitude using the ST_Transform function:

SELECT libid, libname, city, cnty, ST_AsText(ST_Transform(geom,4269)) AS geom
FROM d_public_libraries
WHERE cnty='PROVIDENCE'
ORDER BY city;

This illustrates that the functions can be nested, first we transform the geometry and then display the result of that function as text. The number in the transform function is the unique identifier of the spatial reference system that we wish to transform the geometry to. In the open source world these are EPSG codes, and 4269 is the identifier for NAD 83, the basic long / lat system for North America (alternatively, we could use 4326 for WGS 84, the standard global long / lat system). The geometry column in a spatial table is connected to a series of internal tables that store all the definitions of the spatial reference systems. You can view the spatial reference system table:

SELECT * from spatial_ref_sys;

You can also get a read out of all the spatial tables in the database which include their type of geometry and the spatial reference system (3438 is the EPSG code for the RI State Plane zone, geometry of type 6 is a multipolygon, while type 1 is a point):

SELECT * from geometry_columns;

With a spatial database, you perform operations within and between tables by running functions against the geometry columns. For example, to return all public libraries and schools that are within a mile of a library while measuring the distance:

SELECT pl.libid, pl.libname, s.name, s.grade_span, ST_Distance(pl.geom, s.geom) AS dist
FROM d_public_libraries pl, d_schools_pk12 s
WHERE PtDistWithin(pl.geom, s.geom, 5280)
ORDER BY dist;

The ST_Distance function returns the actual distance in a new column, while the PtDistWithin function only returns libraries that have a school within one mile (5,280 feet – we have to express the measurement in the units used by the spatial reference system of both layers). In the FROM statement we provide aliases after each table name, so we can use those as shorthand (if our statement includes multiple tables, we need to indicate which table each column comes from).

You can also do summaries, like you would in standard SQL using GROUP BY. To count the number of schools that are within a mile of every library:

SELECT pl.libid, pl.libname, CAST(COUNT (s.name) AS integer) AS school_count, pl.geom
FROM d_public_libraries pl, d_schools_pk12 s
WHERE PtDistWithin(pl.geom, s.geom, 5280)
GROUP BY pl.libid, pl.libname, pl.geom
ORDER BY school_count DESC;

The rule for GROUP BY is that every column in the select statement must be used as a grouping variable, or has an aggregate function applied to it (COUNT, SUM, MEAN, etc). In this example we added the CAST function, which defines the data type for new columns that you create. Unless we explicitly declare it as an integer or real (decimal), values are returned as strings.

You can save your statements as views, by adding CREATE VIEW [view name] AS followed by the statement. Views are saved statements that appear as objects in the database; by opening a view, the statement is rerun and the result is returned. This approach works if you want to save a non-spatial view, i.e. a table without geometry. To save a spatial one with geometry, omit the VIEW statement and hit the Create a view button below the SQL window (each record must have a unique identifier and the geometry column in order for this to work). That registers the geometry column of the view in the database. Then, you can return to the main QGIS window, add the view and symbolize it. Alternatively, there is a Load as new layer button at the bottom of the screen, which allows you to see a temporary result without saving anything (while you can see features and records returned, you won’t be able to symbolize or manipulate the layer).

Count schools within 1 mile of libraries, and save as a spatial view
Symbolize the spatial query out in the main QGIS window

One of the primary reasons to use a database is to join related data stored in separate tables. This statement has two joins: a tabular join between the census tracts and an ACS data table, and a spatial join between the geometry of public libraries and tracts:

SELECT pl.libid, pl.libname, a.geoidshort, a.name, c.hshd01_e, c.hshd01_m
FROM d_public_libraries pl, a_census_tracts a
INNER JOIN c_tracts_acs2021_socecon c
ON a.geoidlong=c.geoidlong
WHERE ST_Intersects(pl.geom, a.geom);

This returns all public libraries and their intersecting tracts based on the relationship between their two geometries (could also have done ST_Within in this case to get the same result). Spatialite supports most of the spatial relationship functions defined by the OGC. The estimated number of households for these tracts are returned based on the shared unique census identifier between the two census tract tables.

You can visit the following references for a full list of SQLite functions and Spatialite functions. As it’s designed to be “Lite”, SQLite contains a smaller subset of the SQL standard. Spatialite contains a pretty full range of OGC spatial SQL functions, but there are instances where it deviates from the standard. PostgreSQL / PostGIS provides a greater range of functions that adhere more closely to the standard; it also provides you with greater storage, efficiency, and processing power. As a file-based database, SQLite / Spatialite’s strengths are that it’s compact and transportable, and gives you the option to write SQL rather than relying solely on the point and click tools of a desktop GIS package.

In addition to the QGIS DB Manager, you could also use the Spatialite command line tools provided by the developer, and the Spatialite GUI (graphic user interface) that gives you a standard, stand-alone database interface. Downloading it is a bit confusing; Windows users can grab one of the binaries at the bottom of this page. If you’re a Linux person, search for it in your package manager. Mac users can get it via Homebrew.

US Stamp of Antarctic Treaty Map

Maps in Miniature: Geography on Postage Stamps

I carted several boxes of old stuff from my mom’s basement to mine about a year ago, since I finally had a basement of my own for storing boxes of old stuff. This stuff included a bin with my old stamp collection, one of many childhood hobbies. Leafing through it for the first time in decades, my interest was rekindled and I thought this would make for a relaxing, non-screen-based hobby that I could work on for an hour or so in the evenings. I’ve been transferring and re-organizing this collection over the past year.

Similar to other leisurely pursuits that I’ve written about (usually around this time each year), such as video games and hunting for USGS survey markers while hiking, this hobby has a strong connection to geography. Stamps express the geography, culture, and history of the world’s countries in miniature, and collecting them familiarizes you with different places, languages, and currencies. Postage stamps were introduced in the mid 19th century, so any collection will recount modern history, documenting the: aggregation of states into nations and empires, collapse of empires and states into smaller countries, emergence of colonies as independent states, shifting boundaries as countries fought and occupied one another, and coalescence of nations into larger alliances and supra-national bodies.

This natural connection between stamps and geography becomes even more literal when countries depict themselves on stamps through maps, landscapes, and places. In this post I’ll share examples from my collection that illustrate these themes. I recently finished teaching and consulting for S4’s two week GIS Institute; my favorite lecture is the one I give on cartography, a topic that I generally don’t get to cover in classes where I guest lecture. For that talk, I use a gallery of maps to illustrate different aspects of cartographic design, good and bad. I’ll take a similar approach here. While I’ve endeavored to select stamps from a cross-section of the world’s nations, my selection is a bit skewed by luck of the draw, in terms of stamps I happen to have that fit the theme, and as my collection is largely frozen in time. I stopped around 1992, right when the world’s map changed quite a bit at the abrupt end of the cold war.

Maps on Stamps

Reference maps are the most basic of maps, designed to show you where places are located. Many countries have issued stamps depicting their location, such as this 1980s stamp from New Zealand. In this case, latitude and longitude coordinates are used to help you identify precisely where the Kiwis are; a southerly spot at 42 degrees south latitude and 174 degrees east longitude.

A broader frame of reference can be used for putting a place in context, in order to locate it. This 1930s stamp of Argentina depicts its location in the southern cone of South America, with the Atlantic and Pacific Ocean labeled. We can clearly see that Argentina is the focus of the map based on the shading and title, and figure ground relationship (the distinction between a foreground and background on a flat surface) is established between the country, continent, and ocean so that Argentina is in the foreground (although the bold frame gives us the impression of looking at a painting). There is something else going on though…

Which is more apparent in this subsequent map from the 1950s. The Falklands Islands (a territory of the UK) are claimed by Argentina and shaded as part of the country in both maps, and in this later stamp so is a big chunk of Antarctica. Nations use maps, and stamps, to assert their authority and control over space, in a message that is affixed to envelopes and sent around the world. A counterpoint to this example is the stamp displayed in the header of this post, a 1971 US stamp commemorating the Antarctic Treaty (which essentially states that no nation can claim or own Antarctica).

This detailed reference map of Angola was issued as part of a series of stamps in the 1950s, when Angola was a Portuguese colony. The issuance of stamps and depiction of territories was one method for empires to assert authority over their colonies. Visually this is a busy map, as they squeezed in as many cities, roads, railroads, and rivers as they could (emphasizing the development of the colony). The white on grey contrast with a blue halo brings the country to the foreground, and if you look closely you’ll see latitude and longitude coordinates around the edges. Ultimately, too much material squeezed into this little map makes it hard to read.

As colonies gained independence throughout the post World War II period, the depiction of nations switched from being one of colonial authority and control to independence and national pride. It was common for many western states at this time to issue series of stamps that depicted a head of state, using the same design but with bold primary colors for different denominations. India put a unique spin on this practice by depicting their country instead, on a large series issued in the late 1950s (Gandhi had received a multicolored depiction in 1948, one year after independence). The map shows the physical geography of India, framed with a motif that evokes Indian design and culture.

Rwanda also celebrated their independence in a series of multicolored stamps in different denominations. Beyond national pride, these stamps also assert the authority of the new government in the new state (the new president stands in the foreground). Rwanda’s location in Africa is clearly illuminated, emphasized by a halo of white around a dark fill. The geometric frame evokes a distinct African aesthetic.

You can emphasize specific locations by modifying the extent and scale of a map. This 1960s stamp from Hungary emphasizes the location of its capital city Budapest as being in the center of Europe. The railroad traffic light and prominent label draw your eye right to the location, while simultaneously blotting out surrounding areas (of lesser importance). Clearly there is no better place for hosting the… Esperanto Congress.

This 1960s Chilean stamp celebrates the Alliance for Progress, a ten year plan launched by US President Kennedy to strengthen economic ties between the US and Latin America and to promote democracy (which meant, stop communism). The choice of a globe rather than a flat map better emphasizes the scope and reach of an initiative than spans vast distances and ties nations together. Not very well as it turned out – the initiative was considered a big failure.

Reference maps show us where things are, while thematic maps show us what’s going on there. As the name suggests, they illustrate a specific theme. This fun 1960s stamp from Poland illustrates the mix of architecture, settlements, and industry across the nation, for tourists (Mapa Turystyczna). Figure ground relationship is clearly established with a white foreground for the state and a red background (solid on land and hashed on water).

It’s one thing to make map stamps of your own country, but the implications are quite different when your neighbor is making map stamps of you. This map of Poland was part of a series of stamps the USSR issued of its Warsaw Pact neighbors, featuring their friendly and productive comrades, and all the wonderful resources their nations have to share… Also a way of broadcasting to the rest of the world the alliances between nations.

Maps are a visual means of communicating messages about places, sometimes through data, sometimes with symbols or images. These messages can be pretty overt, as in this 1980s stamp from Iran that expresses solidarity with the Afghan resistance to Soviet occupation. Angry, red clenched fists and bayonets – no friendly comrades here. The bayonets come from the direction in which the invaders came, and their downward thrust draws your eyes to the raised fists.

Messages can be more subtle, as in this East German stamp that proclaims the Baltic as the “Friendly Sea” (a better translation is the “Sea of Peace”). The muted blues evoke a nautical theme, and are also subdued and non-threatening, while the halo effect along the coast and variation in tone distinguishes land from water. The DDR began constructing the Berlin Wall one year before this stamp was issued, and it was not a Friendly Wall.

Maps for navigation are a distinct type of reference map, designed to help us get from point A to point B. This 1970s East German stamp was part of a series that depicted lighthouses over nautical charts, which display varying levels of ocean depth (by using shading and labeling depths at spot locations) and a selection of prominent features on land that can be spotted from ships. Useful for navigating the Friendly Sea no doubt.

This 1960s map from the US depicts the Mississippi River as the “River Road”. The broad white buffer around the river plants it within the foreground, while the orange arrow imparts a dynamic sense of north / south movement. The tributaries feed into this central trunk, giving us a sense of the breadth of the network. The extent of the map omits the east and west coasts, so we don’t see the overall context for where the river is situated. But it’s a good trade-off, as it focuses our attention squarely on the river system, leaving out the empty spaces that the network doesn’t reach. Still, it would have made sense to indicate that this is the Mississippi River somewhere on the stamp.

This map of Columbia depicts the Ferrocarril del Atlántico, a railroad line that connects the Atlantic coast to the capital of Bogota and whose construction was quite an achievement. The line is in red, and presumably the brown lines are connecting railroads. The overlapping labels (there are too many of them) make it difficult to read, and the thickness of the lines for railroads, rivers, and country boundaries are the same, making them hard to distinguish. There is no visual hierarchy for the labels to distinguish importance, as the font sizes are all the same, and the labels for the oceans and neighboring countries use the same color, a cartographic no-no. But there is a nice compass rose.

Air travel was often celebrated on mid 20th century stamps, sometimes in relation to air mail services. This compact map of Bolivia from 1945 depicts the seemingly comprehensive system of the national airline within the country, with a vector graph of points and lines, and with labels for the major nodes. The labels and LAB logo fill in and hide areas that don’t have much service, particularly the eastern part of the country. The traditional Mesoamerican motif in the frame is an interesting contrast to the modern subject matter of the map.

Landscapes and Places

Beyond maps, geography is also communicated on stamps through the depiction of landscapes. The gallery below includes a sample of stamps that illustrate both natural and built environments. The depictions of landscape can be literal, such as the photograph of Wakanoura Bay on the Japanese stamp, or artistic representations like the painting of Rural America and drawing of a Pakistani river valley, or more abstract views such as the stylized image of Swiss (Helvetia) farmland. Stamps can depict specific places, such as the Himalayas in India or the skyline of Singapore, or can be general representations of a landscape, such as an idyllic lakeside in Finland.

A birds-eye view offers a different perspective. The gliders are the focus of the Luxembourg stamp below, but we get to share the pilot’s view of the villages and countryside. Canada issued a series of stamps in the 1970s that depicted its diverse terrain, including this oblique aerial photo of farmland on the Canadian prairie. The US issued a series entitled Earthscapes in 2012, that celebrated both its landscapes and the technology used for capturing them as orthophotos and satellite images.

The depiction of places and administrative subdivisions on stamps is a common theme, particularly in nations that are federated states. Canada issued a series of stamps in 1981 that illustrate the evolution of the nation from individual settlements and colonies into provinces and territories that formed the Canadian Confederation. Each stamp represents a specific point in time.

The constituent states and territories of the United States are popular subjects on American stamps. These can be singular, commemorative stamps that celebrate the founding or statehood of a particular state, such as this 1991 stamp marking Vermont’s 200th anniversary of statehood. Or, they can be large series issued as sets that include one stamp for each state. State maps, landscapes, flags, and even official state birds and flowers have served as subjects over the years.

The French postal service published a large series that showcased its historic provinces, releasing stamps individually and in small sets from the 1940s to the 1960s. These stamps depicted the coat of arms for each place, connecting heraldry from the medieval past to the modern French Republic.

Wrap-up

I hope you enjoyed this little, and by no means exhaustive, tour of cartography and geography on postage stamps! There are countless other avenues we could have strolled down in our travels, such as the depiction of explorers and exploration, climate and weather, environmentalism, how map projections are employed, and views from space. My parting example is a reminder that maps are 2D representations of our spherical 3D world. And that “E ” is for “Earth”.

(In the 1980s and 90s the US Postal Service issued non-denominational stamps for domestic 1st class mail during transition periods when postal rates increased. They featured letters instead of currency values, and you could use them before and after a rate changed. A through D featured a stylized eagle from the postal service logo, but by the time they got to E they followed Sesame Street’s lead and depicted objects that began with each letter. The USPS dropped the letter convention in the 2000s, and in 2011 dropped denominations altogether in favor of Forever stamps.)

End of Year Reflections

I’ve missed my once-a-month goal for writing posts several times this year. This is partially for good reasons, as I’ve been busy supporting students and faculty with coursework and projects, and have been supervising the excellent work of my own students in the lab. We’ve made great progress, releasing a spatial database for Rhode Island mapping projects, writing new tutorials, inventorying thousands of USGS topo maps, and supporting hundreds of students and faculty with their geospatial and demographic research.

But in order to effectively support the work of others, academic librarians need to have a research agenda of their own; to keep up with evolving technology and scholarship to remain effective, and to sustain your own intellectual interests as a professional. Which brings us to the bad reasons behind my posting inactivity. My professional development has come to a screeching halt since I began my new position three years ago. My employer is adverse to supporting scholarly activities for professional librarians (although they gladly share credit if you do the work on evenings, weekends, and vacation time), and a heavy workload makes it impossible to find time for professional development. There are many reasons behind this for which I can’t go into detail – I’ll generally say that bad management and an over-sized library managerial caste are the primary culprits.

Unfortunately this is all too common in academic librarianship. Some high-profile articles have discussed this recently, surveys show that morale is low, and there’s a small but budding branch of scholarship that focuses on library dysfunction. It’s a shame, because both traditional “core-research” librarians and data services-oriented librarians play vital roles within higher ed, and there is no shortage of students and professors who remind me of this on a regular basis. In my opinion, while many students and professors understand and value the work of librarians, many library administrators do not. They dismiss traditional subject librarians as legacy service providers, and they completely do not understand the work of data librarians.

I’ve heard several depressing stories from colleagues at other schools who have been undermined, shuffled around, and in some cases put out of business by incompetent leadership within their library. Within GIS and data librarianship I know several folks who have given up, leaving higher ed for the private sector or independent consulting.

Towards the end of the semester, as I was finishing an hour-long GIS consultation with a grateful undergrad, he asked me what research projects I was currently working on, and what kind of research I do. I was embarrassed to admit that I haven’t been working on anything of my own. After having written a book and publishing several well-received reports, I’m doing nothing more than the intellectual equivalent of shoveling snow. I can’t help but think that I’ve taken a wrong turn, and as the new year begins it’s time to consider the options: focus more sharply on the positive aspects of my position while minimizing the negatives? And somehow, carve out time to do work that I’m interested in? Or, consider moving on, being mindful to avoid exchanging one set of bad circumstances with another? For the latter, this may mean leaving academic librarianship behind.

I am most fortunate in that I don’t have to return to work until the second week of January, and it’s good to have this time to recuperate and reflect. Best wishes to you in the coming new year – Frank

Comparing ACS Estimates Over Time: Are They Really Different?

I often get questions about comparing American Community Survey (ACS) estimates from the US Census Bureau over time. This process is more complicated than you’d think, as the ACS wasn’t designed as a time series dataset. The Census Bureau does publish comparative profile tables that compare two period estimates (in data.census.gov), but for a limited number of geographies (states, counties, metro areas).

For me, this question often takes the form of comparing change at the census tract-level for mapping and GIS projects. In this post, we’ll look at the primary considerations for comparing estimates over time, and I will walk through an example with spreadsheet formulas for calculating: change and percent change (estimates and margins of error), coefficients of variation, and tests for statistical difference. We’ll conclude with examples of mapping this data.

Primary considerations

  1. The ACS is published in 1-year and 5-year period estimates. 1-year estimates are only available for areas that have at least 65,000 people, which means if you’re looking at small geographies (census tracts, ZCTAs) or rural areas that have small populations (most counties, county subdivisions, places) you will need to use the 5-year series. When comparing 5-year estimates, you should only compare non-overlapping time periods. For example, you would not compare the 2021 ACS (2017-2021) with the 2020 ACS (2016-2020) as these estimates have four years of sample data in common. In contrast, 2021 and 2016 (2012-2016) could be compared as they do not overlap…
  2. …but, census geography changes over time. All statistical areas (block groups, tracts, ZCTAs, PUMAs, census designated-places, etc.) are updated every ten years with each decennial census. Areas can be re-numbered, aggregated, subdivided, or modified as populations change. This complicates comparisons; 2021 data uses geography created in 2020, while 2016 data uses geography from 2010. The only non-overlapping ACS periods with identical geographic areas would be 2014 (2010-2014) and 2019 (2015-2019). The only other alternative would be to use normalized census data, which involves additional work. While most legal areas (states, counties) can change at any time, they are generally more stable and you can make comparisons over a longer-period with modest adjustments.
  3. All ACS estimates are fuzzy, representing a midpoint within a possible range of values (indicated with a margin of error) at a 90% confidence level. Because of sampling variability, any difference that you see between one time period and the next could be noise and not actual change. If you’re working with small geographies or small population groups, you’ll encounter large margins of error and it will be difficult to measure actual change. In addition, it’s often difficult to detect change in any area that isn’t experiencing either substantive growth or decline.

ACS Formulas

Let’s look at an example where we’ll use formulas to: calculate change over time, measure the reliability of a difference estimate, and determine whether two estimates are significantly different. I downloaded table B25064 Median Gross Rent (dollars) from the 5-year 2014 (2010-2014) and 2019 (2015-2019) ACS for all census tracts in Providence County, RI, and stitched them together into one spreadsheet. In this post I’ve replaced the cell references with an abbreviated label that indicates what should be referenced (i.e. Est1_MOE is the margin of error for the first estimate). You can download a copy of the spreadsheet with these examples.

  1. To calculate the change / difference for an estimate, subtract one from the other.
  2. To calculate the margin of error for this difference, take the square root of the sum of the squares for each estimate’s margin of error (MOE):
=ROUND(SQRT((Est1_MOE^2)+(Est2_MOE^2)),0)
Spreadsheet with ACS formula to compute margin of error for change / difference
  1. To calculate percent change, divide the difference by the earliest estimate (Est1), and multiply by 100.
  2. To calculate the margin of error for the percent change, use the ACS formula for computing a ratio:
=ROUND(((SQRT(Est2_MOE^2+((Est2/Est1)^2*Est1_MOE^2)))/Est1)100,1)

Divide the 2nd estimate by the 1st and square it, multiply that by the square of the 1st estimate’s MOE, add that to the square of the 2nd estimate’s MOE. Take the square root of that result, then divide by the 1st estimate and multiply by 100. Note that this is formula for percent change is different from the one used for calculating a percent total (the latter uses the formula for a proportion; switch the plus symbol under the square root to a minus for percent totals).

Spreadsheet with ACS formula to compute margin of error for percent change / difference
  1. To characterize the overall accuracy of the new difference estimate, calculate its coefficient of variation (CV):
=ROUND(ABS((Est_MOE/1.645)/Est)*100,0)

Divide the MOE for the difference by 1.645, which is the Z-value for a 90% confidence interval. Divide that by the difference itself, and multiply by 100. Since we can have positive or negative change, we take the absolute value of the result.

Spreadsheet with ACS formula to compute coefficient of variation
  1. To convert the CV into the generally recognized reliability categories:
=IF(CV<=12,"high",IF(CV>=35,"low","medium"))

If the CV value is between 0 to 12, then it’s considered to be highly reliable, else if the CV value is greater than or equal to 35 it’s considered to be of low reliability, else it is considered to be of medium reliability (between 13 and 34). Note: this is a conservative range; search around and you’ll find more liberal examples that use 0-15, 16-40, 41+.

  1. To measure whether two estimates are significantly different from each other, use the statistical difference formula:
=ROUND(ABS((Est2-Est1)/(SQRT((Est1_MOE/1.645)^2+(Est2_MOE/1.645)^2))),3)

Divide the MOE for both the 1st and 2nd estimate by 1.645 (Z value for 90% confidence), take the sum of their squares, and then square root. Subtract the 1st estimate from the 2nd, and then divide. Again in this case, since we could have a positive or negative value we take the absolute value.

Spreadsheet with ACS formula to compute significant difference
  1. To create a boolean significant or not value:
=IF(SigDif>1.645,1,0)

If the significant difference value is greater than 1.645, then the two estimates are significantly different from each other (TRUE 1), implying that some actual change occurred. Otherwise, the estimates are not significantly different (FALSE 0), which means any difference is likely the result of variability in the sample, or any true difference is hidden by this variability.

ALWAYS CHECK YOUR WORK! It’s easy to put parentheses in the wrong place or transpose a cell reference. Take one or two examples and plug them into Cornell PAD’s ACS Calculator, or into Fairfax County VA’s ACS Tools (spreadsheets with formulas – bottom of page). The Census Bureau also provides a spreadsheet that lets you test multiple values for significant difference. Caveat: for the Cornell calculator use the ratio option instead of change when testing. For some reason its change formula never matches my results, but the Fairfax spreadsheets do. I’ve also checked my formulas against the Census Bureau’s ACS Handbooks, and they clearly say to use the ratio formula for percent change.

Interpreting Results

Let’s take a look at a few of the records to understand the results. In Census Tract 1.01, median gross rent increased from $958 (+/- 125) in 2014 to $1113 (+/- 73) in 2019, a change of $155 (+/- 145) and a percent change of 16.2% (+/- 17%). The CV for the change estimate was 57, indicating that this estimate has low reliability; the margin of error is almost equal to the estimate, and the change could have been as little as $10 or as great as $300! The rent estimates for 2014 and 2019 are statistically different but not by much (1.761, higher than 1.645). The margins of error for the two estimates do overlap slightly (with $1,083 being the highest possible value in 2014 and $1,040 the lowest possible value in 2019).

Spreadsheet comparing values for different census tracts

In Census Tract 4, rent increased from $863 (+/- 122) to $1003 (+/- 126), a change of $140 (+/- 175) and percent change of 16.2% (+/- 22%). The CV for the change estimate was 76, indicating very low reliability; indeed the MOE exceeds the value of the estimate. With a score of 1.313 the two estimates for 2014 / 2019 are not significantly different from each other, so any difference here is clouded by sample noise.

In Census Tract 9, rent increased from $875 (+/- 56) to $1083 (+/- 62), a change of $208 (+/- 84) or 23.8% (+/- 10.6%). Compared to the previous examples, these MOEs are much lower than the estimates, and the CV value for the difference is 25, indicating medium reliability. With a score of 4.095, these two estimates are significantly different from each other, indicating substantive change in rent in this tract. The highest possible value in 2014 was $931, and the lowest possible value in 2019 was $1021, so there is no overlap in the value ranges over time.

Mapping Significant Difference and CVs

I grabbed the Census Cartographic Boundary File for tracts for Rhode Island in 2019, and selected out just the tracts for Providence County. I made a copy of my worksheet where I saved the data as text and values in a separate sheet (removing the formulas and encoding the actual outputs), and joined this sheet to the shapefile using the AFFGEOID. The City of Providence and surrounding cities and suburban areas appear in the southeast corner of the county.

The map on the left displays simple percent change over time. In the map on the right, I applied a filter to select just tracts where change was significantly different (the non-significant tracts are symbolized with hash marks). In the screenshots, the count of the number of tracts in each class appears in brackets; I used natural breaks, then modified to place all negative values in the same class. Of the 141 tracts, only 49 had statistically different values. The first map is a gross misrepresentation, as change for most of the tracts can’t be distinguished from sampling variability.

Map of difference on left, significant difference on right
Percent Change in Median Gross Rent 2010-14 to 2015-19: Change on Left, Change Where Both Rent Estimates were Significantly Different on Right

A refined version of the map on the right appears below. In this one, I converted the tracts from polygons to points in a new layer, applied a filter to select significantly different tracts, and symbolized the points by their CV category. Of the 49 statistically different tracts, the actual estimate of change was of low reliability for 32 and medium reliability for the rest. So even if the difference is significant, the precision of most of these estimates is poor.

Providence County, Significant Difference in Median Rent Map
Percent Change in Median Gross Rent 2010-14 to 2015-19 with CV Values, for Tracts with Significantly Different Estimates, Providence County RI

Conclusion

Comparing change over time for ACS estimates is complex, time consuming, and yields many dubious results. What can you do? The size of the MOE relative to the estimate tends to decline as you look at either larger or more populous areas, or larger and fewer subcategories (i.e. 4 income brackets instead of 8). You could also look at two period estimates that are further apart, making it more likely that you’ll see changes; say 2005-2009 compared to 2016-2020. But – you’ll have to cope with normalizing the data. Places that are rapidly changing will exhibit more difference than places that aren’t. If you are studying basic demographics (age / sex / race / tenure) and not socio-economic indicators, use the decennial census instead, as that’s a count and not a sample survey. Ultimately, it’s important to address these issues, and be honest. There’s a lot of bad research where people ignore these considerations, and thus make faulty claims.

For more information, visit the Census Bureau’s page on Comparing ACS Data. Chapter 6 of my book Exploring the US Census covers the American Community Survey and has additional examples of these formulas. As luck would have it, it’s freely accessible as a preview chapter from my publisher, SAGE.

Final caveat: dollar values in the ACS are based on the release year of the period estimate, so 2010-2014 rent is in 2014 dollars, and 2015-2019 is in 2019 dollars. When comparing dollar values over time you should adjust for inflation; I skipped that here to keep the examples a bit simpler. Inflation in the 2010s was rather modest compared to the 2020s, but still could push tracts that had small changes in rent to none when accounted for.

2020 Census Demographic Profile

2020 Census Data Wrap-up

Right before the semester began, I updated the Rhode Island maps on my census research guide so that they link to the recently released Demographic Profile tables from the 2020 Census. I feel like the release of the 2020 census has flown lower on the radar compared to 2010 – it hasn’t made it into the news or social media feeds to the same degree. It has been released much later than usual for a variety of reasons, including the COVID pandemic and political upheaval and shenanigans. At this point in Sept 2023, most of what we can expect has been released, and is available via data.census.gov and the census APIs.

Here are the different series, and what they include.

  • Apportionment data. Released in Apr 2021. Just the total population counts for each state, used to reapportion seats in Congress.
  • Redistricting data. Released in Aug 2021. Also known as PL 91-171 (for the law that requires it), this data is intended for redrawing congressional and legislative districts. It includes just six tables, available for several geographies down to the block level. This was our first detailed glimpse of the count. The dataset contains population counts by race, Hispanic and Latino ethnicity, the 18 and over population, group quarters, and housing unit occupancy. Here are the six US-level tables.
  • Demographic and Housing Characteristics File. Released in May 2023. In the past, this series was called Summary File 1. It is the “primary” decennial census dataset that most people will use, and contains the full range of summary data tables for the 2020 census for practically all census geographies. There are fewer tables overall relative to the 2010 census, and fewer that provide a geographically granular level of detail (ostensibly due to privacy and cost concerns). The Data Table Guide is an Excel spreadsheet that lists every table and the variables they include.
  • Demographic Profile. Released in May 2023. This is a single table, DP1, that provides a broad cross-section of the variables included in the 2020 census. If you want a summary overview, this is the table you’ll consult. It’s an easily accessible option for folks who don’t want or need to compile data from several tables in the DHC. Here is the state-level table for all 50 states plus.
  • Detailed Demographic and Housing Characteristics File A. Released in Sept 2023. In the past, this series was called Summary File 2. It is a subset of the data collected in the DHC that includes more detailed cross-tabulations for race and ethnicity categories, down to the census tract level. It is primarily used by researchers who are specifically studying race, and the multiracial population.
  • Detailed Demographic and Housing Characteristics File B. Not released yet. This will be a subset of the data collected in the DHC that includes more detailed cross-tabulations on household relationships and tenure, down to the census tract level. Primarily of interest to researchers studying these characteristics.

There are a few aspects of the 2020 census data that vary from the past – I’ll link to some NPR stories that provide a good overview. Respondents were able to identify their race or ethnicity at a more granular level. In addition to checking the standard OMB race category boxes, respondents could write in additional details, which the Census Bureau standardized against a list of races, ethnicities, and national origins. This is particularly noteworthy for the Black and White populations, for whom this had not been an option in the recent past. It’s now easier to identify subgroups within these groups, such as Africans and Afro-Caribbeans within the Black population, and Middle Eastern and North Africans (MENA) within the White population. Another major change is that same-sex marriages and partnerships are now explicitly tabulated. In the past, same-sex marriages were all counted as unmarried partners, and instead of having clearly identifiable variables for same-sex partners, researchers had to impute this population from other variables.

Another major change was the implementation of the differential privacy mechanism, which is a complex statistical process to inject noise into the summary data to prevent someone from reverse engineering it to reveal information about individual people (in violation of laws to protect census respondent’s privacy). The social science community has been critical of the application of this procedure, and IPUMS has published research to study possible impacts. One big takeaway is that published block-level population data is less reliable than in the past (housing unit data on the other hand is not impacted, as it is not subjected to the mechanism).

When would you use decennial census data versus other census data? A few considerations – when you:

  • Want or need to work with actual counts rather than estimates
  • Only need basic demographic and housing characteristics
  • Need data that provides detailed cross-tabulations of race, which is not available elsewhere
  • Need a detailed breakdown of the group quarters population, which is not available elsewhere
  • Are explicitly working with voting and redistricting
  • Are making historical comparisons relative to previous 10-year censuses

In contrast, if you’re looking for detailed socio-economic characteristics of the population, you would need to look elsewhere as the decennial census does not collect this information. The annual American Community Survey or monthly Current Population Survey would be likely alternatives. If you need basic, annual population estimates or are studying the components of population change, the Population and Housing Unit Estimates Program is your best bet.

USGS Topographic Vector Layers

USGS Topo Map Vector Layers for GIS

I was working with a graduate student last month who was looking for contour lines for specific towns within the US, for large-scale (small area) mapping and analysis. They were specifically interested in elevation for landfills, and some of the contour data they found didn’t map these as they aren’t natural features. We looked at current USGS topographic maps, and they do indeed map contours for landfills. But the topo maps are raster images, and they wanted vectors. Is it possible to access the underlying GIS data that was used to create the topo maps?

Indeed, it is! Option 1 is to use the National Map Download app. Search for a place name to zoom into your area of interest. Use the Show Map Index dropdown menu to draw the quad boundaries for the topo scale you’re interested in on the map; the 7.5 minute / 1:24,000 series is the USGS topo scale that most people are familiar with. Adjust the zoom so your area of interest fits within the map window; that way when you search in the Datasets tab on the left, the default search looks within this map extent.

Next, choose the specific data product you’re interested in. Here’s a list and description of all the National Map Datasets. For example, if you just wanted contour lines, you can select that under Small-scale Datasets. Note that raster imagery and data that’s used to derive the vectors is also available for download. If you want all the vector features that appear on a particular topo map, check the Topo Map Data and Topo Stylesheet option. Once you check a product, you can choose a file format for the data. Given the size of these datasets, the FileGDB option is probably best.

USGS TNM Download
The National Map Download Interface, Showing the Datasets Tab for Selecting and Searching

Then, click the blue Search Products button. That flips you to the Products tab, and displays data available within the extent of the map view. If you chose Topo Map Data and Topo Stylesheet, the results will be maps of individual quads. You can add a bunch of maps to your shopping cart by clicking on the little cart icon, or download one immediately by clicking the Download Link (ZIP).

USGS Download Topo Map Vector Data
On the Product Tab, click Download Link (ZIP) to get data for a specific map

Option 2 for downloading data: skip the map interface and use the Stage Products Directory. This no frills option is good if you know exactly which products you’re looking for. For example, you can drill down through TopoMapVector, then by state, and then data format to get to the same files you would have downloaded via option 1. You would need to know the name of the quad that encompasses the area you want; consult an index to figure it out.

Once you download and unzip the file, you can launch your desktop GIS package to connect to the database and view the contents. In ArcGIS Pro, use the Catalog Pane, select the Databases option, right click, and Add Database. Browse to the location where you unzipped it, and select it. Then hit the dropdown for the newly added database and browse the contents, which are divided into schemas or groups. Foundation and Hydrography contain most of the features. GazVector has place name labels not captured in other features, and Cells contains outlines of the quad grid cells. Drag them into the Map Pane to view them.

USGS Topo Vector Data in ArcGIS Pro
USGS Topo Map Vector Data in ArcGIS Pro

QGIS users can use the Data Source Manager. With the Vector option selected, change the Source Type from File to Directory, and in the Type dropdown choose OpenFileGDB. Then hit the dots button to browse your file system and select the database folder. Click Add, and you’ll be prompted to choose layers and tables to add to a project. You’ll see the same schema organization described previously, and you can use the CTRL and / or Shift keys to select what you want. Add the Layers, hit OK, and close the Manager.

Adding File Geodatabase Features to QGIS
Adding File Geodatabase Features in the QGIS Data Source Manager

From there, it takes some artful manipulation of the overlays, color schemes, and labels to clearly symbolize the features. Both ArcGIS and QGIS have default symbol styles for topographic features that you can choose from. Apparently there’s a stylesheet packaged with the data, but I haven’t dug in enough yet to find and apply it. The attributes for the features seem fairly rich; the table includes columns that indicate the original data source for each feature, dates when records were added or updated, and a number of identifiers, labels, and categories. Some of the features, like bodies of water and county boundaries, extend beyond the quad cell for the map, as the USGS opted to keep whole features rather than clipping them. If the area you’re interested in happens to fall across two maps, you can download the topo map vector data for both quads, and use the Merge tool to combine them. The default CRS is un-projected NAD83 (EPSG 4269). You’ll probably want to reproject to a state plane or UTM zone that’s appropriate for your area. These post that describe styling and labeling contour lines in QGIS and ArcGIS Pro are helpful. Happy mapping!

USGS Topo Vector Data in QGIS
USGS Topo Map Vector Data in QGIS