census geography

atcoordinates YouTube Channel

Video Tutorials for Finding US Census Data

I have recently created an atcoordinates YouTube channel that features a series of how-to videos on finding and accessing US census data using a variety of websites and tools. I explain basic census concepts while demonstrating how to access data. At this point there are four videos:

  1. Exploring US Census Data: Basic Concepts. This is a narrated slide show where I cover the essential choices you need to make and concepts you need to understand in order to access census data, regardless of the tool or platform: data set, time period, subjects or topics, and geography. I discuss the decennial census, American Community Survey, and population estimates. This video is intended as a prerequisite for viewing the others, so I don’t have to explain the same concepts each time and can focus on demonstrating each particular application.
  2. American Community Survey Census Profiles with MCDC Apps. This screencast illustrates how you can quickly and easily access census profiles for any place in the US using the Missouri Census Data Center’s profile applications. It’s also a good introduction to census data in general, if you’re unfamiliar with the scope of data that’s available.
  3. Search Strategies for data.census.gov. I demonstrate how to use the Census Bureau’s primary application for accessing current census data, using the advanced search tool and filters.
  4. Using TIGERweb to Explore US Census Geography. I show you how to use this web map application for viewing census geography, while explaining what some of the small-area census geographies are.

I plan on adding additional videos every month or so. The pandemic lock down and uncertainty over whether classes will be back in session this fall inspired me to do this. While I prefer written tutorials, I find that I’ve been watching YouTube more often for learning how to do certain tasks with particular software, so I thought this would be useful for others. The videos average about 10 to 15 minutes in length, although the introductory one is a bit longer. The length is intentional; I wanted to explain the concepts and describe why you’re making certain choices, instead of simply pointing and clicking without any explanation.

Feel free to spread the news, share and embed the videos in research guides or web pages, and use them in classes or workshops. Of course, for a more in-depth look at US census data, check out my book: Exploring the US Census: Your Guide to America’s Data published by SAGE.

USPS mailbox

The Trouble with ZIP Codes: Solutions for Data Analysis and Mapping

Since the COVID-19 pandemic began, I’ve received several questions about finding census data and boundary files for ZIP Codes (aka US postal codes), as many states are publishing ZIP Code-level data for cases and deaths. ZIP Codes are commonly used for summarizing address data, as it’s easy to do and most Americans are familiar with them. However, there are a number of challenges associated with using ZIP Codes as a unit of analysis that most people are unaware of (until they start using them). In this post I’ll summarize these challenges and provide some solutions.

The short story is: you can get boundary files and census data from the decennial census and 5-year American Community Survey (ACS) for ZIP Code Tabulation Areas (ZCTAs, pronounced zicktas) which are approximations of ZIP Codes that have delivery areas. Use any census data provider to get ZCTA data: data.census.gov, Census Reporter, Missouri Census Data Center, NHGIS, or proprietary library databases like PolicyMap or the Social Explorer. The longer story: if you’re trying to associate ZIP Code-level data with census ZCTA boundary files or demographic data, there are caveats. I’ll cover the following issues in detail:

  1. ZIP Codes are actually not areas with defined boundaries, and there are no official USPS ZIP Code maps. Areas must be derived using address files. The Census Bureau has done this in creating ZIP Code Tabulation Areas (ZCTAs).
  2. The Census Bureau publishes population data by ZCTA and boundary files for them. But ZCTAs are not strictly analogous with ZIP Codes; there isn’t a ZCTA for every ZIP Code, and if you try to associate ZIP data with them some of your records won’t match. You need to crosswalk your ZIP Code data to the ZCTA-level to prevent this.
  3. ZCTAs do not nest or fit within any other census geographies, and the postal city name associated with a ZIP Code does not correlate with actual legal or municipal areas. This can make selecting and downloading ZIP Code data for a given area difficult.
  4. ZIP Codes were designed for delivering mail, not for studying populations. They vary tremendously in size, shape, and population.
  5. Analyzing data at either the ZIP Code or ZCTA level over time is difficult to impossible.
  6. ZIP Code and ZCTA numbers must be saved as text in data files, and not as numbers. Otherwise codes that have leading zeros get truncated, and the code becomes incorrect.

ZIP Codes versus ZCTAs and Boundaries

Contrary to popular belief, ZIP Codes are not areas and the US Postal Service does not delineate boundaries for them. They are simply numbers assigned to ranges of addresses along street segments, and the codes are associated with a specific post office. When we see ZIP Code boundaries (on Google Maps for example), these have been derived by creating areas where most addresses share the same ZIP Code.

The US Census Bureau creates areal approximations for ZIP Codes called ZIP Code Tabulation Areas or ZCTAs. The Bureau assigns census blocks to a ZIP number based on the ZIP that’s used by a majority of the addresses within each block, and aggregates blocks that share the same ZIP to form a ZCTA. After this initial assignment, they make some modifications to aggregate or eliminate orphaned blocks that share the same ZIP number but are not contiguous. ZCTAs are delineated once every ten years in conjunction with the decennial census, and data from the decennial census and the 5-year American Community Survey (ACS) are published at the ZCTA-level. You can download ZCTA boundaries from the TIGER / Line Shapefiles page, and there is also a generalized cartographic boundary file for them.

Crosswalking ZIP Code Data to ZCTAs

There isn’t a ZCTA for every ZIP Code. Some ZIP Codes represent large clusters of Post Office boxes or are assigned to large organizations that process lots of mail. As census blocks are aggregated into ZCTAs based on the predominate ZIP Code for addresses within the block, these non-areal ZIPs fall out of the equation and we’re left with ZCTAs that approximate ZIP Codes for delivery areas.

As a result, if you’re trying to match either your own summarized address data or sources that use ZIP Codes as the summary level (such as the Census Bureau’s Business Patterns and Economic Census datasets), some ZIP Codes will not have a matching ZCTA and will fall out of your dataset.

To prevent this from happening, you can aggregate your ZIP Code data to ZCTAs prior to joining it to boundary files or other datasets. The UDS Mapper project publishes a ZIP Code to ZCTA Crosswalk file that lists every ZIP Code and the ZCTA it is associated with. For the ZIP Codes that don’t have a corresponding area (the PO Box clusters and large organizations), these essentially represent points that fall within ZCTA polygons. Join your ZIP-level data to the ZIP Code ID in the crosswalk file, and then group or summarize the data using the ZCTA number in the crosswalk. Then you can match this ZCTA-summarized data to boundaries or census demographic data at the ZCTA-level.

ZIP Code to ZCTA Crosswalk

UDS ZIP Code to ZCTA Crosswalk. ZIP Code 99501 is an areal ZIP Code with a corresponding ZCTA number, 99501. ZIP Code 99520 is a post office or large volume customer that falls inside ZCTA 99501, and thus is assigned to that ZCTA.

Identifying ZIPs and ZCTAs within Other Areas

ZCTAs are built from census blocks and nest within the United States; they do not fit within any other geographies like cities and towns, counties, or even states. The boundaries of a ZCTA will often cross these other boundaries, so for example a ZCTA may fall within two or three different counties. This makes it challenging to select and download census data for all ZCTAs in a given area.

You can get lists of ZIP Codes for places, for example by using the MCDC’s ZIP Code Lookup. The problem is, the postal city that appears in addresses and is affiliated with a ZIP Code does not correspond with cities as actual legal entities, so you can’t count on the name to select all ZIPs within a specific place. For example, my hometown of Claymont, Delaware has its own ZIP Code, even though Claymont is not an incorporated city with formal, legal boundaries. Most of the ZIP Codes around Claymont are affiliated with Wilmington as a place, even though they largely cover suburbs outside the City of Wilmington; the four ZIP Codes that do cover the city cross the city boundary and include outside areas. In short, if you select all the ZIP Codes that have Wilmington, DE as their place name, they actually cover an area that’s much larger than the City of Wilmington. The Census Bureau does not associate ZCTAs with place names.

ZCTAs and Places in northern Delaware

Lack of correspondence between postal city names and actual city boundaries. Most ZCTAs with the prefix 198 are assigned to Wilmington as a place name, even though many are partially or fully outside the city.

So how can you determine which ZIP Codes fall within a certain area? Or how they do (or don’t) intersect with other areas? You can overlay and eyeball the areas in TIGERweb to get a quick idea. For something more detailed, here are three options:

  1. The Missouri Census Data Center’s Geocorr application lets you calculate overlap between a source geography and a target geography using either total population or land area for any census geographies. So in a given state, if you select ZCTAs as a source, and counties as the target, you’ll get a list that displays every ZCTA that falls wholly or partially within each county. An allocation factor indicates the percentage of the ZCTA (population or land) that’s inside and outside a county, and you can make decisions as to whether to include a given ZCTA in your study area or not. If a ZCTA falls wholly inside one county, there will be only one record with an allocation factor of 1. If it intersects more than one county, there will be a record with an allocation factor for each county.
  2. The US Department of Housing and Urban Development (HUD) publishes a series of ZIP Code crosswalk files that associates ZIP Codes with census tracts, counties, CBSAs (metropolitan areas), and congressional districts. They create these files by geocoding all addresses and calculating the ratio of residential, business, and other addresses that fall within each of these areas and that share the same ZIP Code. The files are updated quarterly. You can use them to select, assign, or apportion ZIP Codes to a given area. There’s a journal article that describes this resource in detail.
  3. Some websites allow you to select all ZCTAs that fall within a given geography when downloading data, essentially by selecting all ZCTAs that are fully or partially within the area. The Census Reporter allows you to do this: search for a profile for an area, click on a table of interest, and then subdivide the areas by smaller areas. You can even look at a map to see what’s been selected. data.census.gov currently does not provide this option; you have to select ZCTAs one by one (or if you’re using the census API, you’ll need to create a list of ZCTAs to retrieve).

MCDC Geocorr

Sample output from MCDC Geocorr. ZCTAs 08251 and 08260 fall completely within Cape May County, NJ. ZCTA 08270’s population is split between Cape May (92.4%) and Atlantic (7.6%) counties. The ZCTA names are actually postal place names; these ZCTAs cover areas that are larger than these places.

Do You Really Need to Use ZIP Codes?

ZIP Codes were an excellent mid-20th century solution for efficiently processing and delivering mail that continues to be useful for that purpose. They are less ideal for studying populations or other forms of human activity. They vary tremendously in size, shape, and population which makes them inconsistent as a unit of analysis. They have no legal or administrative meaning or function, other than delivering mail. While all American’s are familiar with them, they do not have any relevant social meaning. They don’t represent neighborhoods, and when you ask someone where they’re from, they won’t say “19703”.

So what are your other options?

  1. If you don’t have to use ZIP Code or ZCTA data for your project, don’t. For the United States as a whole, consider using counties, PUMAs, or metropolitan areas. Within states: counties, PUMAs, and county subdivisions. For smaller areas: municipalities, census tracts, or aggregates of census tracts.
  2. If you have the raw, address-based data, consider geocoding it. Once you geocode an address, you can use GIS to assign it to any type of geography that you have a boundary file for (spatial join), and then you can aggregate it to that geography. Some geocoders even provide geographies like counties or tracts in the match result. If your data is sensitive, strip all the attributes out except for the address and a serial integer to use as an ID, and after geocoding you can associate the results back to your original data using that ID. The Census Geocoder is free, requires no log in, allows you to do batches of 1,000 addresses at a time, and forces you to use these safety precautions. For bigger jobs, there’s an API.
  3. Sometimes you’ll have no choice and must use ZIP Code / ZCTA data, if what you’re interested in studying is only provided in that summary form, or if there are privacy concerns around geocoding the raw address data. You may want to modify the ZCTA geography for your area to aggregate smaller ZCTAs into larger ones surrounding them, for both visual display and statistical analysis. For example, in New York City there are several ZCTAs that cover only one city and census block, as they’re occupied by one large office building that processes a lot of mail (and thus have their own ZIP number). Also, unlike most census geographies, ZCTAs have large holes in them. Any area that does not have streets and thus no addresses isn’t included in a ZCTA. In urban areas, this means large parks and cemeteries. In rural areas, vast tracts of unpopulated forest, desert, or mountain terrain. And large bodies of water in every place.

Midtown ZCTAs

One-block ZCTAs in Midtown Manhattan, NYC that have either low or zero population.

Analyzing ZIP Code Data Over Time…

In short – forget it. The Census Bureau introduced ZCTAs in the year 2000, and in 2010 they modified their process for creating them. For a variety of reasons, they’re not strictly compatible. ACS data for ZCTAs wasn’t published until 2013. Even the economic datasets don’t go that far back; the ZIP Code Business Patterns didn’t appear until the early 1990s. Use areas that have more longevity and are relatively stable: counties, census tracts.

Why Do my ZIP Codes Look Wrong in Excel?

Regardless of whether you’re using a spreadsheet, database, or scripting language, always make sure to define ZIP / ZCTA columns as strings or text, and not as numeric types. ZIP Codes and ZCTAs begin with zeros in several states. Columns that contain ZIP / ZCTA codes must be saved as text to preserve the 5-digit code. If they’re saved as numbers, the leading zeros are dropped and the numbers are rendered incorrectly. This often happens if you’re working with data in a CSV file and you click on it to open it in Excel. In parsing the CSV, Excel assumes the ZIP / ZCTA field is a number and saves it as a number, which drops the zero and truncates the code. To prevent this from happening: open Excel to a blank project, go to the Data ribbon, click the button to import text data, choose delimited text on the import screen, choose the delimiter (comma or tab, etc), and when prompted you can select the ZIP / ZCTA column and designate it as text so that it imports properly.

Importing text files in Excel

To import CSV files in Excel, go to the Data ribbon and under Get External Data select From Text.

Conclusion

That’s all you ever (or maybe never) wanted to know about ZIP Codes and ZCTAs! For more information see the Census Bureau’s page about ZCTAs, a thorough write up by the Missouri Census Data Center, and these informative and fun blog posts from PolicyMap (complete with photos of Mr. ZIP). I wrote an article a few years back that demonstrates how to use some of these resources (the UDS mapper file, Geocorr) to process ZIP data with SQL and python. And of course, check out my book, Exploring the U.S. Census: Your Guide to America’s Data, to explore these concepts and resources in greater detail with hands-on exercises.

Percentage of Children in Households Without the Internet

Kids with No Internet at Home: Data Processing for US Census Mapping

In this post I’ll demonstrate some essential data processing steps prior to joining census American Community Survey (ACS) tables downloaded from data.census.gov to TIGER shapefiles, in order to create thematic maps. I thought this would be helpful for students in my university who are now doing GIS-related courses from home, due to COVID-19. I’ll illustrate the following with Excel and QGIS: choosing an appropriate boundary file for making your map, manipulating geographic id codes (GEOIDs) to insure you can match data file to shapefile, prepping your spreadsheet to insure that the join will work, and calculating new summaries and percent totals with ACS formulas. Much of this info is drawn from the chapters in my book that cover census geography (chapter 3), ACS data (chapter 6), and GIS (chapter 10). I’m assuming that you already have some basic spreadsheet, GIS, and US census knowledge.

For readers who are not interested in the technical details, you still may be interested in the map we’ll create in this example: how many children under 18 lack access to a computer with internet access at home? With COVID-19 there’s a sudden expectation that all school children will take classes remotely from home. There are 73.3 million children living in households in the US, and approximately 9.3 million (12.7%) either have no computer at home, or have a computer but no internet access. The remaining children have a computer with either broadband or dial-up at home. Click on the map below to explore the county distribution of the under 18 population who lack internet access at home, or follow this link: https://arcg.is/0TrGTy.

arcgis_webmap

Click on the Map to View Full Screen and Interact

Preliminaries

First, we need to get some ACS data. Read this earlier post to learn how to use data.census.gov (or for a shortcut download the files we’re using here). I downloaded ACS table B28005 Age by Presence of a Computer and Types of Internet Subscription in Household at the county-level. This is one of the detailed tables from the latest 5-year ACS from 2014-2018. Since many counties in the US have less than 65,000 people, we need to use the 5-year series (as opposed to the 1-year) to get data for all of them. The universe for this table is the population living in households; it does not include people living in group quarters (dormitories, barracks, penitentiaries, etc.).

Second, we need a boundary file of counties. You could go to the TIGER Line Shapefiles, which provides precise boundaries of every geographic area. Since we’re using this data to make a thematic map, I suggest using the Cartographic Boundary Files (CBF) instead, which are generalized versions of TIGER. Coastal water has been removed and boundaries have been smoothed to make the file smaller and less detailed. We don’t need all the detail if we’re making a national-scale map of the US that’s going on a small screen or an 8 1/2 by 11 piece of paper. I’m using the medium (5m) generalized county file for 2018. Download the files, put them together in a new folder on your computer, and unzip them.

TIGER Line shapefile

TIGER Line shapefile

CBF shapefile

CBF shapefile

GEOIDs

Downloads from data.census.gov include three csv files per table that contain: the actual data (data_with_overlays), metadata (list of variable ids and names), and a description of the table (table_title). There are some caveats when opening csv files with Excel, but they don’t apply to this example (see addendum to this post for details). Open your csv file in Excel, and save it as an Excel workbook (don’t keep it in a csv format).

The first column contains the GEOID, which is a code that uniquely identifies each piece of geography in the US. In my file, 0500000US28151 is the first record. The part before ‘US’ indicates the summary level of the data, i.e. what the geography is and where it falls in the census hierarchy. The 050 indicates this is a county. The part after the ‘US’ is the specific identifier for the geography, known as an ANSI / FIPS code: 28 is the state code for Mississippi, and 151 is the county code for Washington County, MS. You will need to use this code when joining this data to your shapefile, assuming that the shapefile has the same code. Will it?

That depends. There are two conventions for storing these codes; the full code 0500000US28151 can be used, or just the ANSI / FIPS portion, 28151. If your shapefile uses just the latter (find out by adding the shapefile in GIS and opening its attribute table), you won’t have anything to base the join on. The regular 2018 TIGER file uses just the ANSI / FIPS, but the 2018 CBF has both the full GEOID and the ANSI FIPS. So in this case we’re fine, but for the sake of argument if you needed to create the shorter code it’s easy to do using Excel’s RIGHT formula:

Excel formula: RIGHT

The formulas RIGHT, LEFT, and MID are used to return sub-strings of text

The formulas reads X characters from the right side of the value in the cell you reference and returns the result. You just have to count the number of characters up to the “S’ in the “US”. Copy and paste the formula all the way down the column. Then, select the entire column, right click and chose copy, select it again, right click and choose Paste Special and Values (in Excel, the little clipboard image with numbers on top of it). This overwrites all the formulas in the column with the actual result of the formula. You need to do this, as GIS can’t interpret your formulas. Put some labels in the two header spaces, like GEO_ID2 and id2.

Excel: Paste Special

Copy a column, and use Paste Special – Values on top of that column to overwrite formulas with values

Subsets and Headers

It’s common that you’ll download census tables that have more variables than you need for your intended purpose. In this example we’re interested in children (people under 18) living in households. We’re not going to use the other estimates for the population 18 to 64 and 65 and over. Delete all the columns you don’t need (if you ever needed them, you’ve got them saved in your csv as a backup).

Notice there are two header rows: one has a variable ID and the other has a label. In ACS tables the variables always come in pairs, where the first is the estimate and the second is the margin of error (MOE). For example, in Washington County, Mississippi there are 46,545 people living in households +/- 169. Columns are arranged and named to reflect how values nest: Estimate!!Total is the total number of people in households, Estimate!!Total!!Under 18 years is the number people under 18 living in households, which is a subset of the total estimate.

The rub here is that we’re not allowed to have two header rows when we join this table to our shapefile – we can only have one. We can’t keep the labels because they’re too long – once joined, the labels will be truncated to 10 characters and will be indistinguishable from each other. We’ll have to delete that row, leaving us with the cryptic variable IDs. We can choose to keep those IDs – remember we have a separate metadata csv file where we can look up the labels – or we can rename them. The latter is feasible if we don’t have too many. If you do rename them, you have to keep them short, no more than 10 characters or they’ll be truncated. You can’t use spaces (underscores are ok), any punctuation, and can’t begin variables names with a number. In this example I’m going to keep the variable IDs.

Two odd gotchas: first, find the District of Columbia in your worksheet and look at the MOE for total persons in households (variable 001M). There is a footnote for this value, five asterisks *****. Replace it with a zero. Keep an eye out for footnotes, as they wreak havoc. If you ever notice that a numeric column gets saved as text in GIS, it’s probably because there’s a footnote somewhere. Second, change the label for the county name from NAME to GEO_NAME (our shapefile already has a column called NAME, and it will cause problems if we have duplicates). If you save your workbook now, it’s ready to go if you want to map the data in it. But in this example we have some more work to do.

Create New ACS Values

We want to map the percentage of children that do not have access to either a computer or the internet at home. In this table these estimates are distinct for children with a computer and no internet (variable 006), and without a computer (variable 007). We’ll need to aggregate these two. For most thematic maps it doesn’t make sense to map whole counts or estimates; naturally places that have more people are going to have more computers. We need to normalize the data by calculating a percent total. We could do this work in the GIS package, but I think it’s easier to use the spreadsheet.

To calculate a new estimate for children with no internet access at home, we simply add the two values together (006_E and 007_E). To calculate a new margin of error, we take the square root of the sum of the squares for the MOEs that we’re combining (006_M and 007_M). We also use the ROUND formula so our result is a whole number. Pretty straightforward:

Excel Sum of Squares

When summing ACS estimates, take the square root of the sum of the squares for each MOE to calculate a MOE for the new estimate.

To calculate a percent total, divide our new estimate by the number of people under 18 in households (002_E). The formula for calculating a MOE for a percent total is tougher: square the percent total and the MOE for the under 18 population (002_M), multiply them, subtract that result from the MOE for the under 18 population with no internet, take the square root of that result and divide it by the under 18 population (002_E):

MOE for percentage

The formula for calculating the MOE for a proportion includes: the percentage, MOE for the subset population (numerator), and the estimate and MOE for the total population (denominator)

In Washington County, MS there are 3,626 +/- 724 children that have no internet access at home. This represents 29.4% +/- 5.9% of all children in the county who live in a household. It’s always a good idea to check your math: visit the ACS Calculator at Cornell’s Program for Applied Demographics and punch in some values to insure that your spreadsheet formulas are correct.

You should scan the results for errors. In this example, there is just one division by zero error for Kalawao County in Hawaii. In this case, replace the formula with 0 for both percentage values. In some cases it’s also possible that the MOE proportion formula will fail for certain values. Not a problem in our example, but if it does the solution is to modify the formula for the failed cases to calculate a ratio instead. Replace the percentage in the formula with the ratio (the total population divided by the subset population) AND change the minus sign under the square root to a plus sign.

Some of these MOE’s look quite high relative to the estimate. If you’d like to quantify this, you can calculate a coefficient of variation for the estimate (not the percentage). This formula is straightforward: divide the MOE by 1.645, divide that result by the estimate, and multiply by 100:

Calculate coefficient of variation

A CV can be used to gauge the reliability of an estimate

Generally speaking, a CV value between 0-15 indicates that as estimate is highly reliable, 12-34 is of medium reliability, and 35 and above is low reliability.

That’s it!. Make sure to copy the columns that have the formulas we created, and do a paste-special values over top of them to replace the formulas with the actual values. Some of the CV values have errors because of division by zero. Select the CV column and do a find and replace, to find #DIV/0! and replace it with nothing. Then save and close the workbook.

For more guidance on working with ACS formulas, take a look at this Census Bureau guidebook, or review Chapter 6 in my book.

Add Data to QGIS and Join

In QGIS, we select the Data Source Manager buttonQGIS Data Source Manager, and in the vector menu add the CBF shapefile. All census shapefiles are in the basic NAD83 system by default, which is not great for making a thematic map.  Go to the Vector Menu – Data Management Tools – Reproject Layer. Hit the little globe beside Target CRS. In the search box type ‘US National’, select the US National Atlas Equal Area option in the results, and hit OK. Lastly, we press the little ellipses button beside the Reprojected box, Save to File, and save the file in a good spot. Hit Run to create the file.

In the layers menu, we remove the original counties file, then select the new one (listed as Reprojected), right click, Set CRS, Set Project CRS From Layer. That resets our window to match the map projection of this layer. Now we have a projected counties layer that looks better for a thematic map. If we right click the layer and open its attribute table, we can see that there are two columns we could use for joining: AFFGEOID is the full census code, and GEOID is the shorter ANSI / FIPS.

Hit the Data Source Manager button again, stay under the vector menu, and browse to add the Excel spreadsheet. If our workbook had multiple sheets we’d be prompted to choose which one. Close the menu and we’ll see the table in the layers panel. Open it up to insure it looks ok.

To do a join, select the counties layer, right click, and choose properties. Go to the Joins tab. Hit the green plus symbol at the bottom. Choose the spreadsheet as the join layer, GEO_ID as the join field in the spreadsheet, and AFFGEOID as the target field in the counties file. Go down and check Custom Field Name, and delete what’s in the box. Hit OK, and OK again in the Join properties. Open the attribute table for the shapefile, scroll over and we should see the fields from the spreadsheet at the end (if you don’t, check and verify that you chose the correct IDs in the join menu).

QGIS Join Menu

QGIS Map

We’re ready to map. Right click the counties and go to the properties. Go to the Symbology tab and flip the dropdown from Single symbol to Graduated. This lets us choose a Column (percentage of children in households with no internet access) and create a thematic map. I’ve chosen Natural Breaks as the Mode and changed the colors to blues. You can artfully manipulate the legend to show the percentages as whole numbers by typing *100 in the Column box beside the column name, and adding a % at the end of the Legend format string. I also prefer to alter the default settings for boundary thickness: click the Change button beside Symbol, select Simple fill, and reduce the width of the boundaries from .26 to .06, and hit OK.

QGIS Symbology Menu

There we have a map! If you right click on the counties in the layers panel and check the Show Feature Count box, you’ll see how many counties fall in each category. Of course, to make a nice finished map with title, legend, and inset maps for AK, HI, and PR, you’d go into the Print Layout Manager. To incorporate information about uncertainty, you can add the county layer to your map a second time, and style it differently – maybe apply crosshatching for all counties that have a CV over 34. Don’t forget to save your project.

QGIS Map

Percentage of Children in Households without Internet Access by County 2014-2018

How About that Web Map?

I used my free ArcGIS Online account to create the web map at the top of the page. I followed all the steps I outlined here, and at the end exported the shapefile that had my data table joined to it out as a new shapefile; in doing so the data became fused to the new shapefile. I uploaded the shapefile to ArcGIS online, chose a base map, and re-applied the styling and classification for the county layer. The free account includes a legend editor and expression builder that allowed me to show my percentages as fractions of 100 and to modify the text of the entries. The free account does not allow you to do joins, so you have to do this prep work in desktop GIS. ArcGIS Online is pretty easy to learn if you’re already familiar with GIS. For a brief run through check out the tutorial Ryan and I wrote as part of my lab’s tutorial series.

Addendum – Excel and CSVs

While csv files can be opened in Excel with one click, csv files are NOT Excel files. Excel interprets the csv data (plain text values separated by commas, with records separated by line breaks) and parses it into rows and columns for us. Excel also makes assumptions about whether values represents text or numbers. In the case of ID codes like GEOIDs or ZIP Codes, Excel guesses wrong and stores these codes as numbers. If the IDs have leading zeros, the zeros are dropped and the codes become incorrect. If they’re incorrect, when you join them to a shapefile the join will fail. Since data.census.gov uses the longer GEOID this doesn’t happen, as the letters ‘US’ are embedded in the code, which forces Excel to recognize it as text. But if you ever deal with files that use the shorter ANSI / FIPS you’ll run into trouble.

Instead of clicking on csvs to open them in Excel: launch Excel to a blank workbook, go to the data ribbon and choose import text files, select your csv file from your folder system, indicate that it’s a delimited text file, and select your ID column and specify that it’s text. This will import the csv and save it correctly in Excel.

Census Book

Exploring the US Census Book Published!

My book, Exploring the US Census: Your Guide to America’s Data, has been published! You can purchase it directly from SAGE Publishing or from Barnes and Nobles, Amazon, or your bookstore of choice (it’s currently listed for pre-order on Amazon but its availability there is imminent). It’s $45 for the paperback, $36 for the ebook. Data for the exercises and supplemental material is available on the publisher’s website, and I’ve created a landing page for the book on this site.

Exploring the US Census is the definitive researcher’s guide to working with census data. I place the census within the context of: US society, the open data movement, and the big data universe, provide a crash course on using the new data.census.gov, and introduce the fundamental concepts of census geography and subject categories (aka universes). One chapter is devoted to each of the primary datasets: decennial census (with details about the 2020 census that’s just over the horizon), American Community Survey, Population Estimates Program, and business data from the Business Patterns, Economic Census, and BLS. Subsequent chapters demonstrate how to: integrate census data into writing and research, map census data in GIS, create derivative measures, and work with historic data and microdata with a focus on the Current Population Survey.

I wrote the book as a hybrid between a techie guidebook and an academic text. I provide hands-on exercises so that you learn by doing (techie) while supplying sufficient context so you can understand and evaluate why you’re doing it (academic). I demonstrate how to find and download data from several different sources, and how to work with the data using free and open source software: spreadsheets (LibreOffice Calc), SQL databases (DB Browser for SQLite), and GIS (QGIS). I point out the major caveats and pitfalls of working with the census, along with many helpful tools and resources.

The US census data ecosystem provides us with excellent statistics for describing, studying, and understanding our communities and our nation. It is a free and public domain resource that’s a vital piece of the country’s social, political, and economic infrastructure and a foundational element of American democracy. This book is your indispensable road map for navigating the census. Have a good trip!

See the series – census book tag for posts about the content of the book, additional material that expands on that content (but didn’t make it between the covers), and the writing process.

datacensusgov

Navigating the New data.census.gov

June 2019 is the final month that the Census Bureau will post new data in the American Factfinder (AFF). From this point forward, all new datasets will be published via the new data dissemination platform data.census.gov. The second chapter of my book (now available for pre-order!) is devoted to navigating this new interface. In this post I’ll provide a preview / brief tutorial of the advanced search functions.

The new interface is search-driven, so you can type the names of topics and geographies or table ID numbers to find and explore data tables. There are spiffy data profiles for several geographies, and you have the ability to make basic thematic maps. The search interface makes it much easier to casually browse and discover data, so go ahead and explore.

I’d still recommend having a search strategy to find precisely what you need. Keyword searching alone isn’t going to cut it, because you’re searching across tens of thousands of tables in dozens of datasets. The good news is that the same strategy I’ve used for the AFF can be applied to data.census.gov: use the advanced search to filter by survey, year, geography, and topic to narrow down the list of possible tables to a manageable number, and then search or browse through those results to find what you need.

Let’s say we want to download the most recent data on home values for all the counties in Pennsylvania (or a state of your choosing). On data.census.gov click on the advanced search link under the search box. On the advanced search page scroll to the bottom to the filters. We’ll address them one by one:

Surveys. These represent all the different census datasets. Select ACS 5-Year Estimates Detailed Tables. Detailed socio-economic characteristics of the population are primarily published in the ACS. The 1-Year estimates are published for all geographies that have at least 65k people. Since most states have rural counties that have less than this threshold, we’ll have to use the 5-year estimates to get all the counties. The detailed tables are narrow, focusing on estimates for a single variable. The other options include profiles (lots of different data for one place) and subject tables (narrower in scope than profiles, but broader than the detailed tables).

filter by survey

Years. At the moment 2017 is the latest year for the ACS, so let’s select that. This quickly eliminates a lot of tables that we’re not interested in.

Geography. Choose 050 – County, then scroll down and choose Pennsylvania in the County (State) list, then All counties in Pennsylvania in the final list.

filter by geography

Topics. For this example choose Housing, then Financial Characteristics, then Housing Value and Purchase Price. Of all the filter options, this one is the most opened-ended and may require some experimentation based on what you’re looking for.

filter by topic

Codes. We don’t need to filter by codes in this example, but if we were searching for labor or business-related data we’d use this filter to limit results to specific sectors or industries by NAICS codes.

Underneath the filter menu, click the View All Results button. This brings us to the first results page, which provides a list of tables, maps, and pages related to our search. Click the button to View All Tables under the tables section.

This brings us to the table results page; the list of tables is displayed on the left, and the currently selected table is displayed on the right; in this case Value of owner-occupied housing units is shown, with counts of units by value brackets. At this stage, we can scroll through the list and browse to find tables with data that we’re interested in. We can also access the filters at the top of the list, if we want to modify our search parameters.

table results

A little further down the results list is a table for Median Value. Selecting that table will preview it on the right. Hit the Customize Table button. This opens the table in its own dedicated view. Hit the blue drop down arrow to the right of the table name, and you can modify the geography, year, or time-period on the left. On the right is a Download option. Hit download and you’ll be prompted to download a CSV file. In the download you’ll get three text files that contain metadata, the data, and descriptive information about the download. Click Download and you can save it.

customize table

Back on the customize table page, you can navigate back to the table results by clicking on “Tables” in the breadcrumb links that appear in the top left-hand corner. Then you can browse and choose additional tables.

That’s it! Not bad, right? Well, there are always caveats. At the moment, the biggest one is that you can’t easily download most geographies that are contained within other geographies. With one click we can filter to select all counties within a state, or all states within the nation. But if we wanted all census tracts in a county or all county subdivisions in a state, there aren’t any “All geographies in…” options for these geographies. We’d have to select each and every tract within a county, one at a time…

While data.census.gov is now relatively stable, it’s still under development and additional features like this should (hopefully) be implemented as time passes between now and the 2020 census. This is one reason why the American Factfinder will survive for another year, as we’ll still need to lean on it to accomplish certain tasks. Of course, there are other options within the Census Bureau (the API, the FTP site) and without (NHGIS, MCDC, Census Reporter) for accessing data.

The new platform currently provides access to several datasets from the present back to the year 2010: the decennial census, the ACS, population estimates, and several of the business datasets. The first new datasets that will be published in data.census.gov (and NOT in the AFF) include the 2017 Economic Census this summer and the 1-year 2018 ACS in September.

View the Release Notes and FAQs for more details about the platform: general documentation, recent developments, bugs, and planned enhancements. The Census Bureau also has an archived webinar with slides that discuss the transition.

Calculate margin of error for ratio (mean income)

Calculating Mean Income for Groups of Geographies with Census ACS Data

When aggregating small census geographies to larger ones (census tracts to neighborhoods for example) when you’re working with American Community Survey (ACS) data, you need to sum estimates and calculate new margins of error. This is straightforward for most estimates; you simply sum them, and take the square root of the sum of squares for the margins of error (MOEs) for each estimate that you’re aggregating. But what if you need to group and summarize derived estimates like means or medians? In this post, I’ll demonstrate how to calculate mean household income by aggregating ZCTAs to United Hospital Fund neighborhoods (UHF), which is a type of public health area in NYC created by aggregating ZIP Codes.

I’m occasionally asked how to summarize median household income from tracts to neighborhood-like areas. You can’t simply add up the medians and divide them, the result would be completely erroneous. Calculating a new median requires us to sort individual household-level records and choose the middle-value, which we cannot do as those records are confidential and not public. There are a few statistical interpolation methods that we can use with interval data (number of households summarized by income brackets) to estimate a new median and MOE, but the calculations are rather complex. The State Data Center in California provides an excellent tutorial that demonstrates the process, and in my new book I’ll walk through these steps in the supplemental material.

While a mean isn’t as desirable as a median (as it can be skewed by outliers), it’s much easier to calculate. The ACS includes tables on aggregate income, including the sum of all income earned by households and other population group (like families or total population). If we sum aggregate household income and number of households for our small geographic areas, we can divide the total income by total households to get mean income for the larger area, and can use the ACS formula for computing the MOE for ratios to generate a new MOE for the mean value. The Census Bureau publishes all the ACS formulas in a detailed guidebook for data users, and I’ll cover many of them in the ACS chapter of my book (to be published by the end of 2019).

Calculating a Derived Mean in Excel

Let’s illustrate this with a simple example. I’ve gathered 5-year 2017 ACS data on number of households (table B11001) and aggregate household income (table B19025) by ZCTA, and constructed a sheet to correlate individual ZCTAs to the UHF neighborhoods they belong to. UHF 101 Kingsbridge-Riverdale in the Bronx is composed of just two ZCTAs, 10463 and 10471. We sum the households and aggregate income to get totals for the neighborhood. To calculate a new MOE, we take the square root of the sum of squares for each of the estimate’s MOEs:

Calculate margin of error for new sum

Calculate margin of error for new sum

To calculate mean income, we simply divide the total aggregate household income by total households. Calculating the MOE is more involved. We use the ACS formula for derived ratios, where aggregate income is the numerator of the ratio and households is the denominator. We multiply the square of the ratio (mean income) by the square of the MOE of the denominator (households MOE), add that product to the square of the MOE of the numerator (aggregate income MOE), take the square root, and divide the result by the denominator (households):

=(SQRT((moe_ratio_numerator^2)+(ratio^2*moe_ratio_denominator^2))/ratio_denominator)

Calculate margin of error for ratio (mean income)

Calculate margin of error for ratio (mean income)

The 2013-2017 mean household income for UHF 101 is $88,040, +/- $4,223. I always check my math using the Cornell Program on Applied Demographic’s ACS Calculator to make sure I didn’t make a mistake.

This is how it works in principle, but life is more complicated. When I downloaded this data I had number of households by ZCTA and aggregate household income by ZCTA in two different sheets, and the relationship between ZCTAs and UHFs in a third sheet. There are 42 UHF neighborhoods and 211 ZCTAs in the city, of which 182 are actually assigned to UHFs; the others have no household population. I won’t go into the difference between ZIP Codes and ZCTAs here, as it isn’t a problem in this particular example.

Tying them all together would require using the ZCTA in the third sheet in a VLOOKUP formula to carry over the data from the other two sheets. Then I’d have to aggregate the data to UHF using a pivot table. That would easily give me sum of households and aggregate income by UHF, but getting the MOEs would be trickier. I’d have to square them all first, take the sum of these squares when pivoting, and take the square root after the pivot to get the MOEs. Then I could go about calculating the means one neighborhood at a time.

Spreadsheet-wise there might be a better way of doing this, but I figured why do that when I can simply use a database? PostgreSQL to the rescue!

Calculating a Derived Mean in PostgreSQL

In PostgreSQL I created three empty tables for: households, aggregate income, and the ZCTA to UHF relational table, and used pgAdmin to import ZCTA-level data from CSVs into those tables (alternatively you could use SQLite instead of PostgreSQL, but you would need to have the optional math module installed as SQLite doesn’t have the capability to do square roots).

Portion of households table. A separate aggregate household income table is structured the same way, with income stored as bigint type.

Portion of households table. A separate aggregate household income table is structured the same way, with income stored as bigint type.

Portion of the ZCTA to UHF relational table.

Portion of the ZCTA to UHF relational table.

In my first run through I simply tried to join the tables together using the 5-digit ZCTA to get the sum of households and aggregate incomes. I SUM the values for both and use GROUP BY to do the aggregation to UHF. In PostgreSQL pipe-forward slash: |/ is the operator for square root. I sum the squares for each ZCTA MOE and take the root of the total to get the UHF MOEs. I omit ZCTAs that have zero households so they’re not factored into the formulas:

SELECT z.uhf42_code, z.uhf42_name, z.borough,
    SUM(h.households) AS hholds,
    ROUND(|/(SUM(h.households_me^2))) AS hholds_me,
    SUM(a.agg_hhold_income) AS agghholds_inc,
    ROUND(|/(SUM(a.agg_hhold_income_me^2))) AS agghholds_inc_me
FROM zcta_uhf42 z, hsholds h, agg_income a
WHERE z.zcta=h.gid2 AND z.zcta=a.gid2 AND h.households !=0
GROUP BY z.uhf42_code, z.uhf42_name, z.borough
ORDER BY uhf42_code;

Portion of query result, households and income aggregated from ZCTA to UHF district.

Portion of query result, households and income aggregated from ZCTA to UHF district.

Once that was working, I modified the statement to calculate mean income. Calculating the MOE for the mean looks pretty rough, but it’s simply because we have to repeat the calculation for the ratio over again within the formula. This could be avoided if we turned the above query into a temporary table, and then added two columns and populated them with the formulas in an UPDATE – SET statement. Instead I decided to do everything in one go, and just spent time fiddling around to make sure I got all the parentheses in the right place. Once I managed that, I added the ROUND function to each calculation:

SELECT z.uhf42_code, z.uhf42_name, z.borough,
    SUM(h.households) AS hholds,
    ROUND(|/(SUM(h.households_me^2))) AS hholds_me,
    SUM(a.agg_hhold_income) AS agghholds_inc,
    ROUND(|/(SUM(a.agg_hhold_income_me^2))) AS agghholds_inc_me,
    ROUND(SUM(a.agg_hhold_income) / SUM(h.households)) AS hhold_mean_income,
    ROUND((|/ (SUM(a.agg_hhold_income_me^2) + ((SUM(a.agg_hhold_income)/SUM(h.households))^2 * SUM(h.households_me^2)))) / SUM(h.households)) AS hhold_meaninc_me
FROM zcta_uhf42 z, hsholds h, agg_income a
WHERE z.zcta=h.gid2 AND z.zcta=a.gid2 AND h.households !=0
GROUP BY z.uhf42_code, z.uhf42_name, z.borough
ORDER BY uhf42_code;

Query in pgAdmin and portion of result for calculating mean household income

Query in pgAdmin and portion of result for calculating mean household income

I chose a couple examples where a UHF had only one ZCTA, and another that had two, and tested them in the Cornell ACS calculator to insure the results were correct. Once I got it right, I added:

CREATE VIEW household_sums AS

To the top of the statement and executed again to save it as a view. Mission accomplished! To make doubly sure that the values were correct, I connected my db to QGIS and joined this view to a UHF shapefile to visually verify that the results made sense (could also have imported the shapefile into the DB as a spatial table and incorporated it into the query).

Mean household income by UHF neighborhood in QGIS

Mean household income by UHF neighborhood in QGIS

Conclusion

While it would be preferable to have a median, calculating a new mean for an aggregated area is a fair alternative, if you simply need some summary value for the variable and don’t have the time to spend doing statistical interpolation. Besides income, the Census Bureau also publishes aggregate tables for other variables like: travel time to work, hours worked, number of vehicles, rooms, rent, home value, and various subsets of income (earnings, wages or salary, interest and dividends, social security, public assistance, etc) that makes it possible to calculate new means for aggregated areas. Just make sure you use the appropriate denominator, whether it’s total population, households, owner or renter occupied housing units, etc.