population

Comparing ACS Estimates Over Time: Are They Really Different?

I often get questions about comparing American Community Survey (ACS) estimates from the US Census Bureau over time. This process is more complicated than you’d think, as the ACS wasn’t designed as a time series dataset. The Census Bureau does publish comparative profile tables that compare two period estimates (in data.census.gov), but for a limited number of geographies (states, counties, metro areas).

For me, this question often takes the form of comparing change at the census tract-level for mapping and GIS projects. In this post, we’ll look at the primary considerations for comparing estimates over time, and I will walk through an example with spreadsheet formulas for calculating: change and percent change (estimates and margins of error), coefficients of variation, and tests for statistical difference. We’ll conclude with examples of mapping this data.

Primary considerations

  1. The ACS is published in 1-year and 5-year period estimates. 1-year estimates are only available for areas that have at least 65,000 people, which means if you’re looking at small geographies (census tracts, ZCTAs) or rural areas that have small populations (most counties, county subdivisions, places) you will need to use the 5-year series. When comparing 5-year estimates, you should only compare non-overlapping time periods. For example, you would not compare the 2021 ACS (2017-2021) with the 2020 ACS (2016-2020) as these estimates have four years of sample data in common. In contrast, 2021 and 2016 (2012-2016) could be compared as they do not overlap…
  2. …but, census geography changes over time. All statistical areas (block groups, tracts, ZCTAs, PUMAs, census designated-places, etc.) are updated every ten years with each decennial census. Areas can be re-numbered, aggregated, subdivided, or modified as populations change. This complicates comparisons; 2021 data uses geography created in 2020, while 2016 data uses geography from 2010. The only non-overlapping ACS periods with identical geographic areas would be 2014 (2010-2014) and 2019 (2015-2019). The only other alternative would be to use normalized census data, which involves additional work. While most legal areas (states, counties) can change at any time, they are generally more stable and you can make comparisons over a longer-period with modest adjustments.
  3. All ACS estimates are fuzzy, representing a midpoint within a possible range of values (indicated with a margin of error) at a 90% confidence level. Because of sampling variability, any difference that you see between one time period and the next could be noise and not actual change. If you’re working with small geographies or small population groups, you’ll encounter large margins of error and it will be difficult to measure actual change. In addition, it’s often difficult to detect change in any area that isn’t experiencing either substantive growth or decline.

ACS Formulas

Let’s look at an example where we’ll use formulas to: calculate change over time, measure the reliability of a difference estimate, and determine whether two estimates are significantly different. I downloaded table B25064 Median Gross Rent (dollars) from the 5-year 2014 (2010-2014) and 2019 (2015-2019) ACS for all census tracts in Providence County, RI, and stitched them together into one spreadsheet. In this post I’ve replaced the cell references with an abbreviated label that indicates what should be referenced (i.e. Est1_MOE is the margin of error for the first estimate). You can download a copy of the spreadsheet with these examples.

  1. To calculate the change / difference for an estimate, subtract one from the other.
  2. To calculate the margin of error for this difference, take the square root of the sum of the squares for each estimate’s margin of error (MOE):
=ROUND(SQRT((Est1_MOE^2)+(Est2_MOE^2)),0)
Spreadsheet with ACS formula to compute margin of error for change / difference
  1. To calculate percent change, divide the difference by the earliest estimate (Est1), and multiply by 100.
  2. To calculate the margin of error for the percent change, use the ACS formula for computing a ratio:
=ROUND(((SQRT(Est2_MOE^2+((Est2/Est1)^2*Est1_MOE^2)))/Est1)100,1)

Divide the 2nd estimate by the 1st and square it, multiply that by the square of the 1st estimate’s MOE, add that to the square of the 2nd estimate’s MOE. Take the square root of that result, then divide by the 1st estimate and multiply by 100. Note that this is formula for percent change is different from the one used for calculating a percent total (the latter uses the formula for a proportion; switch the plus symbol under the square root to a minus for percent totals).

Spreadsheet with ACS formula to compute margin of error for percent change / difference
  1. To characterize the overall accuracy of the new difference estimate, calculate its coefficient of variation (CV):
=ROUND(ABS((Est_MOE/1.645)/Est)*100,0)

Divide the MOE for the difference by 1.645, which is the Z-value for a 90% confidence interval. Divide that by the difference itself, and multiply by 100. Since we can have positive or negative change, we take the absolute value of the result.

Spreadsheet with ACS formula to compute coefficient of variation
  1. To convert the CV into the generally recognized reliability categories:
=IF(CV<=12,"high",IF(CV>=35,"low","medium"))

If the CV value is between 0 to 12, then it’s considered to be highly reliable, else if the CV value is greater than or equal to 35 it’s considered to be of low reliability, else it is considered to be of medium reliability (between 13 and 34). Note: this is a conservative range; search around and you’ll find more liberal examples that use 0-15, 16-40, 41+.

  1. To measure whether two estimates are significantly different from each other, use the statistical difference formula:
=ROUND(ABS((Est2-Est1)/(SQRT((Est1_MOE/1.645)^2+(Est2_MOE/1.645)^2))),3)

Divide the MOE for both the 1st and 2nd estimate by 1.645 (Z value for 90% confidence), take the sum of their squares, and then square root. Subtract the 1st estimate from the 2nd, and then divide. Again in this case, since we could have a positive or negative value we take the absolute value.

Spreadsheet with ACS formula to compute significant difference
  1. To create a boolean significant or not value:
=IF(SigDif>1.645,1,0)

If the significant difference value is greater than 1.645, then the two estimates are significantly different from each other (TRUE 1), implying that some actual change occurred. Otherwise, the estimates are not significantly different (FALSE 0), which means any difference is likely the result of variability in the sample, or any true difference is hidden by this variability.

ALWAYS CHECK YOUR WORK! It’s easy to put parentheses in the wrong place or transpose a cell reference. Take one or two examples and plug them into Cornell PAD’s ACS Calculator, or into Fairfax County VA’s ACS Tools (spreadsheets with formulas – bottom of page). The Census Bureau also provides a spreadsheet that lets you test multiple values for significant difference. Caveat: for the Cornell calculator use the ratio option instead of change when testing. For some reason its change formula never matches my results, but the Fairfax spreadsheets do. I’ve also checked my formulas against the Census Bureau’s ACS Handbooks, and they clearly say to use the ratio formula for percent change.

Interpreting Results

Let’s take a look at a few of the records to understand the results. In Census Tract 1.01, median gross rent increased from $958 (+/- 125) in 2014 to $1113 (+/- 73) in 2019, a change of $155 (+/- 145) and a percent change of 16.2% (+/- 17%). The CV for the change estimate was 57, indicating that this estimate has low reliability; the margin of error is almost equal to the estimate, and the change could have been as little as $10 or as great as $300! The rent estimates for 2014 and 2019 are statistically different but not by much (1.761, higher than 1.645). The margins of error for the two estimates do overlap slightly (with $1,083 being the highest possible value in 2014 and $1,040 the lowest possible value in 2019).

Spreadsheet comparing values for different census tracts

In Census Tract 4, rent increased from $863 (+/- 122) to $1003 (+/- 126), a change of $140 (+/- 175) and percent change of 16.2% (+/- 22%). The CV for the change estimate was 76, indicating very low reliability; indeed the MOE exceeds the value of the estimate. With a score of 1.313 the two estimates for 2014 / 2019 are not significantly different from each other, so any difference here is clouded by sample noise.

In Census Tract 9, rent increased from $875 (+/- 56) to $1083 (+/- 62), a change of $208 (+/- 84) or 23.8% (+/- 10.6%). Compared to the previous examples, these MOEs are much lower than the estimates, and the CV value for the difference is 25, indicating medium reliability. With a score of 4.095, these two estimates are significantly different from each other, indicating substantive change in rent in this tract. The highest possible value in 2014 was $931, and the lowest possible value in 2019 was $1021, so there is no overlap in the value ranges over time.

Mapping Significant Difference and CVs

I grabbed the Census Cartographic Boundary File for tracts for Rhode Island in 2019, and selected out just the tracts for Providence County. I made a copy of my worksheet where I saved the data as text and values in a separate sheet (removing the formulas and encoding the actual outputs), and joined this sheet to the shapefile using the AFFGEOID. The City of Providence and surrounding cities and suburban areas appear in the southeast corner of the county.

The map on the left displays simple percent change over time. In the map on the right, I applied a filter to select just tracts where change was significantly different (the non-significant tracts are symbolized with hash marks). In the screenshots, the count of the number of tracts in each class appears in brackets; I used natural breaks, then modified to place all negative values in the same class. Of the 141 tracts, only 49 had statistically different values. The first map is a gross misrepresentation, as change for most of the tracts can’t be distinguished from sampling variability.

Map of difference on left, significant difference on right
Percent Change in Median Gross Rent 2010-14 to 2015-19: Change on Left, Change Where Both Rent Estimates were Significantly Different on Right

A refined version of the map on the right appears below. In this one, I converted the tracts from polygons to points in a new layer, applied a filter to select significantly different tracts, and symbolized the points by their CV category. Of the 49 statistically different tracts, the actual estimate of change was of low reliability for 32 and medium reliability for the rest. So even if the difference is significant, the precision of most of these estimates is poor.

Providence County, Significant Difference in Median Rent Map
Percent Change in Median Gross Rent 2010-14 to 2015-19 with CV Values, for Tracts with Significantly Different Estimates, Providence County RI

Conclusion

Comparing change over time for ACS estimates is complex, time consuming, and yields many dubious results. What can you do? The size of the MOE relative to the estimate tends to decline as you look at either larger or more populous areas, or larger and fewer subcategories (i.e. 4 income brackets instead of 8). You could also look at two period estimates that are further apart, making it more likely that you’ll see changes; say 2005-2009 compared to 2016-2020. But – you’ll have to cope with normalizing the data. Places that are rapidly changing will exhibit more difference than places that aren’t. If you are studying basic demographics (age / sex / race / tenure) and not socio-economic indicators, use the decennial census instead, as that’s a count and not a sample survey. Ultimately, it’s important to address these issues, and be honest. There’s a lot of bad research where people ignore these considerations, and thus make faulty claims.

For more information, visit the Census Bureau’s page on Comparing ACS Data. Chapter 6 of my book Exploring the US Census covers the American Community Survey and has additional examples of these formulas. As luck would have it, it’s freely accessible as a preview chapter from my publisher, SAGE.

Final caveat: dollar values in the ACS are based on the release year of the period estimate, so 2010-2014 rent is in 2014 dollars, and 2015-2019 is in 2019 dollars. When comparing dollar values over time you should adjust for inflation; I skipped that here to keep the examples a bit simpler. Inflation in the 2010s was rather modest compared to the 2020s, but still could push tracts that had small changes in rent to none when accounted for.

2020 Census Demographic Profile

2020 Census Data Wrap-up

Right before the semester began, I updated the Rhode Island maps on my census research guide so that they link to the recently released Demographic Profile tables from the 2020 Census. I feel like the release of the 2020 census has flown lower on the radar compared to 2010 – it hasn’t made it into the news or social media feeds to the same degree. It has been released much later than usual for a variety of reasons, including the COVID pandemic and political upheaval and shenanigans. At this point in Sept 2023, most of what we can expect has been released, and is available via data.census.gov and the census APIs.

Here are the different series, and what they include.

  • Apportionment data. Released in Apr 2021. Just the total population counts for each state, used to reapportion seats in Congress.
  • Redistricting data. Released in Aug 2021. Also known as PL 91-171 (for the law that requires it), this data is intended for redrawing congressional and legislative districts. It includes just six tables, available for several geographies down to the block level. This was our first detailed glimpse of the count. The dataset contains population counts by race, Hispanic and Latino ethnicity, the 18 and over population, group quarters, and housing unit occupancy. Here are the six US-level tables.
  • Demographic and Housing Characteristics File. Released in May 2023. In the past, this series was called Summary File 1. It is the “primary” decennial census dataset that most people will use, and contains the full range of summary data tables for the 2020 census for practically all census geographies. There are fewer tables overall relative to the 2010 census, and fewer that provide a geographically granular level of detail (ostensibly due to privacy and cost concerns). The Data Table Guide is an Excel spreadsheet that lists every table and the variables they include.
  • Demographic Profile. Released in May 2023. This is a single table, DP1, that provides a broad cross-section of the variables included in the 2020 census. If you want a summary overview, this is the table you’ll consult. It’s an easily accessible option for folks who don’t want or need to compile data from several tables in the DHC. Here is the state-level table for all 50 states plus.
  • Detailed Demographic and Housing Characteristics File A. Released in Sept 2023. In the past, this series was called Summary File 2. It is a subset of the data collected in the DHC that includes more detailed cross-tabulations for race and ethnicity categories, down to the census tract level. It is primarily used by researchers who are specifically studying race, and the multiracial population.
  • Detailed Demographic and Housing Characteristics File B. Not released yet. This will be a subset of the data collected in the DHC that includes more detailed cross-tabulations on household relationships and tenure, down to the census tract level. Primarily of interest to researchers studying these characteristics.

There are a few aspects of the 2020 census data that vary from the past – I’ll link to some NPR stories that provide a good overview. Respondents were able to identify their race or ethnicity at a more granular level. In addition to checking the standard OMB race category boxes, respondents could write in additional details, which the Census Bureau standardized against a list of races, ethnicities, and national origins. This is particularly noteworthy for the Black and White populations, for whom this had not been an option in the recent past. It’s now easier to identify subgroups within these groups, such as Africans and Afro-Caribbeans within the Black population, and Middle Eastern and North Africans (MENA) within the White population. Another major change is that same-sex marriages and partnerships are now explicitly tabulated. In the past, same-sex marriages were all counted as unmarried partners, and instead of having clearly identifiable variables for same-sex partners, researchers had to impute this population from other variables.

Another major change was the implementation of the differential privacy mechanism, which is a complex statistical process to inject noise into the summary data to prevent someone from reverse engineering it to reveal information about individual people (in violation of laws to protect census respondent’s privacy). The social science community has been critical of the application of this procedure, and IPUMS has published research to study possible impacts. One big takeaway is that published block-level population data is less reliable than in the past (housing unit data on the other hand is not impacted, as it is not subjected to the mechanism).

When would you use decennial census data versus other census data? A few considerations – when you:

  • Want or need to work with actual counts rather than estimates
  • Only need basic demographic and housing characteristics
  • Need data that provides detailed cross-tabulations of race, which is not available elsewhere
  • Need a detailed breakdown of the group quarters population, which is not available elsewhere
  • Are explicitly working with voting and redistricting
  • Are making historical comparisons relative to previous 10-year censuses

In contrast, if you’re looking for detailed socio-economic characteristics of the population, you would need to look elsewhere as the decennial census does not collect this information. The annual American Community Survey or monthly Current Population Survey would be likely alternatives. If you need basic, annual population estimates or are studying the components of population change, the Population and Housing Unit Estimates Program is your best bet.

IPUMS CPS Table

Creating Geographic Estimates with the IPUMS CPS Online Data Analysis System

Introduction

In this post I’ll demonstrate how to use the IPUMS CPS Online Data Analysis System to generate summary data from the US Census Bureau’s Current Population Survey (CPS). The tool employs the Survey Documentation and Analysis system (SDA) created at UC Berkeley.

The CPS is a monthly stratified sample survey of 60k households. It includes a wide array of statistics, some captured routinely each month, others at various intervals (such as voter registration and participation, captured every November in even-numbered years). The same households are interviewed over a four-month period, then rotated out for four months, then rotated back in for a final four months. Given its consistency, breadth, high response rate and accuracy (interviews are conducted in-person and over the phone), researchers use the CPS microdata (individual responses to surveys that have been de-identified) to study demographic and socio-economic trends among and between different population groups. It captures many of the same variables as the American Community Survey, but includes a fair number that are not.

I think the CPS is used less often by geographers, as the sample size is too small to produce reliable estimates below the state or metropolitan area levels. I find that students and researchers who are only familiar with working with summary data often don’t use it; generating your own estimates from microdata records can be time consuming.

The IPUMS project provides an online analyzer that lets you generate summary estimates from the CPS without having to handle the individual sample records. I’ve used it in undergraduate courses where students want to generate extracts of data at the regional or state level, and who are interested in variables not collected in the ACS, such as generational households for immigrants. The online analyzer doesn’t include the full CPS, but only the data that’s collected in March as part of the core CPS series, and the Annual Social & Economic Supplement (ASEC). It includes data from 1962 to the present.

To access any of the IPUMS tools, you must register and create an account, but it’s free and non-commercial. They provide an ample amount of documentation; I’ll give you the highlights for generating a basic geographic-based extract in this post.

Creating a Basic Geographic Summary Table

Once you launch the tool, the first thing you need to do is select some variables. You can use the drill-down folder menus at the bottom left, but I find it’s easier to hit the Codebook button and peruse the alphabetical list. Let’s say we want to generate state-level estimates for nativity for a recent year. If we go into the codebook and look-up nativity, we see it captures foreign birthplace or parentage. Also in the list is a variable called statefip, which are the two-digit codes that uniquely identify every state.

Codebook for Nativity – Foreign Birthplace or Parentage

Back on the main page for the Analyzer in the tables tab, we provide the following inputs:

  1. Row represents our records or observations. We want a record for every state, so we enter the variable: statefip.
  2. Column represents our attributes or variables. In this example, it’s: nativity.
  3. Selection filter is used to specify that we want to generate estimates from a subset of all the responses. To generate estimates for the most recent year, we enter year as the variable and specify the filter value in parentheses: year(2020). If we didn’t specify a year, the program would use all the responses back to 1962 to generate the estimates.
  4. Weight is the value that’s used to weight the samples to generate the estimates. The supplemental person weight sdawt is what we’ll use, as nativity is measured for individual persons. There is a separate weight for household-level variables.
  5. Under the output option dropdown, we change the Percentaging option from column to row. Row calculates the percentage of the population in each nativity category within the state (row). The column option would provide the percentage of the population in each nativity category between the states. In this example, the row option makes more sense.
  6. For the confidence interval, check the box to display it, as this will help us gauge the precision of the estimate. The default level is 95%; I often change this to 90% as that’s what the American Community Survey uses.
  7. At the bottom of the screen, run the table to see the result.
CPS Online Analyzer – Generate Basic Extract for Nativity by State for a Single Year

In the result, the summary of your parameters appears at the top, with the table underneath. At the top of the table, the Cells contain legend lists what appears in each of the cells in order. In this example, the row percent is first, in bold. For the first cell in Alabama: 91.0% of the total population have parents who were both born in the US, the confidence interval is 90.2% to 91.8% (and we’re 90% confident that the true value falls within this range), and the large number at the bottom is the estimated number of cases that fall in this category. Since we filtered our observations for one year, this represents the total population in that category for that year (if we check the totals in the last column against published census data, they are roughly equivalent to the total population of each state).

Output Table – Nativity by State in 2020

Glancing through the table, we see that Alabama and Alaska have more cases where both parents are born in the US (91.0% and 85.1%) relative to Arizona (68.7%). Arizona has a higher percentage of cases where both parents are foreign born, or the persons themselves are foreign-born. The color coding indicates the Z value (see bottom of the table for legend), which indicates how far a variable deviates from the mean, with dark red being higher than the expected mean and dark blue being lower than expected. Not surprisingly, states with fewer immigrants have higher than average values for both parents native born (Alabama, Alaska, Arkansas) while this value is lower than average for more diverse states (Arizona, California).

To capture the table, you could highlight / copy and paste the screen from the website into a spreadsheet. Or, if you return to the previous tab, at the bottom of the screen before running the table, you can choose the option to export to CSV.

Variations for Creating Detailed Crosstabs

To generate a table to show nativity for all races:

Input Parameters – Generate Tables for Nativity by State for each Race

In the control box, type the variable race. The control box will generate separate tables in the results for each category in the control variable. In this case, one nativity table per racial group.

To generate a table for nativity specifically Asians:

Input Parameters – Generate Table for Nativity by State for Asians

Remove race from the control box, and add it in the filter box after the year. In parentheses, enter the race code for Asians; you can find this in the codebook for race: year(2020), race(651).

Now that we’re drilling down to smaller populations, the reliability of the estimates is declining. For example, in Arkansas the estimate for Asians for both parents foreign born is 32.4%, but the value could be as low as 22.2% or as high as 44.5%. The confidence interval for California is much narrower, as the Asian population is much larger there. How can we get a better estimate?

Output Table – Nativity by State for Asians in 2020

Generate a table for nativity for Asians over a five year period:

Input Parameters – Generate Table for Nativity by State for Asians for 5-year Period

Add more years in the year filter, separated by commas. In this version, our confidence intervals are much narrower; notice for Asians for both parents foreign born in Arkansas is now 19.2% with a range of 14.1% to 25.6%. By increasing the sample pool with more years of data, we’ve increased the precision of the estimate at the cost of accepting a broader time period. One big caveat here: the Weighted N represents the total number of estimated cases, but since we are looking at five years of data instead of one it no longer represents a total population value for a single year. If we wanted to get an average annual estimate for this 5-year time period (similar to what the ACS produces), we’d have to divide each of estimates by five for a rough approximation. You can download the table and do these calculations in a spreadsheet or stats program.

Output Table – Nativity by State for Asians between 2016-2020 (weighted N = estimate of total cases over 5 years, not an average annual value)

You can also add control variables to a crosstab. For example, if you added sex as a control variable, you would generate separate female and male tables for the nativity by state for the Asian population over a given time period,

Example of a Profile Table

If we wanted to generate a profile for a given place as opposed to a comparison table, we can swap the variables we have in the rows and columns. For example, to see nativity for all Hispanic subgroups within California for a single year:

Input Parameters – Generate a Profile Table for California of Nativity by Hispanic Groups in 2020

In this case, you could opt to calculate percentages by column instead of row, if you wanted to see percent totals across groups for the categories. You could show both in the same chart, but I find it’s difficult to read. In this last example, note the large confidence intervals and thus lower precision for smaller Hispanic groups in California (Cuban, Dominican) versus larger groups (Mexican, Salvadoran).

Output Table – Nativity by Hispanic Groups in California 2020 (confidence interval is much larger for smaller groups)

In short – this is handy tool if you want to generate some quick estimates and crosstabs from the CPS without having to download and weight microdata records yourself. You can create geographic data for regions, divisions, states, and metro areas. Just be mindful of confidence intervals as you drill down to smaller subgroups; you can aggregate by year, geography, or category / group to get better precision. What I’ve demonstrated is the tip of the iceberg; read the documentation and experiment with creating charts, statistical summaries, and more.

BEA Population Change Map

Population and Economic Time Series Data from the BEA

In this post, I’ll demonstrate how to access and download multiple decades of annual population data for US states, counties, and metropolitan areas in a single table. Last semester, I was helping a student in a GIS course find data on tuberculosis cases by state and metro area that stretched back several decades. In order to make meaningful comparisons, we needed to calculate rates, which meant finding an annual time series for total population. Using data directly from the Census Bureau is tough going, as they don’t focus on time series and you’d have to stitch together several decades of population estimates. Metropolitan areas add more complexity, as they are modified at least a few times each decade.

In searching for alternatives, I landed at the Bureau of Economic Analysis (BEA). As part of their charge is studying the economy over time, they gather data from the Census Bureau, Bureau of Labor Statistics, and others to build time series, and they update them as geography and industrial classification schemes change. They publish national, state, metropolitan area, and county-level GDP, employment, income, wage, and population tables that span multiple decades. Their economic profile table for metros and counties covers 1969 to present, while the state profile table goes back to 1958. Metropolitan areas are aggregates of counties. As metro boundaries change, the BEA normalizes the data, adjusting the series by taking older county-level data and molding it to fit the most recent metro definitions.

Finding the population data was a bit tricky, as it is embedded as one variable in the Economic Profile table (identified as CAINC30) that includes multiple indicators. Here’s the path to get it:

  • From the BEA website, choose Tools – Interactive Data.
  • From the options on the next page, choose Regional from the National, Industry, International or Regional data options. There’s also a link to a video that illustrates how to use the BEA interactive data tool.
  • From the Regional Data page, click “Begin using the data” (but note you can alternatively “Begin mapping the data” to make some basic web maps too, like the one in the header of this post).
  • On the next page are categories, and under each category are data tables for specific series. In this case, Personal Income and Employment by County and Metropolitan Area was what I wanted, and under that the Economic Profile CAINC30 table (states appear under a different heading, but there’s a comparable Economic Profile table for them, SAINC30).
  • On the multi-screen table builder, you choose a type of geography (county or different metro area options), and on the next tab you can choose individual places, hold down CTRL and select several, or grab them all with the option at the top of the dropdown. Likewise for the Statistic, choose Population (persons), or grab a selection, or take all the stats. Under the Unit of Measure dropdown, Levels gives you the actual statistic, but you can also choose percent change, index values, and more. On the next tab, choose your years.
  • On the final page, if your selection is small enough you can preview the result and then download. In this case it’s too large, so you’re prompted to grab an Excel or CSV file to download.

And there you have it! One table with 50+ years of annual population data, using current metro boundaries. The footnotes at the bottom of the file indicate that the latest years of population data are based on the most recent vintage estimates from the Census Bureau’s population estimates. Once the final intercensal estimates for the 2010s are released, the data for that decade will be replaced a final time, and the estimates from the 2020s will be updated annually as each new vintage is released until we pass the 2030 census. Their documentation is pretty thorough.

BEA 5-decade Time Series of Population Data by Metro Area

The Interactive Data table approach allows you to assemble your series step by step. If you wanted to skip all the clicking you can grab everything in one download (all years for all places for all stats in a given table). Of course, that means some filtering and cleaning post-download to isolate what you need. There’s also an API, and several other data series and access options to explore. The steps for creating the map that appears at the top of this post were similar to the steps for building the table.

US Census Data ALA Tech Report

ALA Tech Report on Using Census Data for Research

I have written a new report that’s just been released: US Census Data: Concepts and Applications for Supporting Research, was published as the May / June 2022 issue of the American Library Association’s Library and Technology Reports. It’s available for purchase digitally or in hard copy from the ALA from now through next year. It will also be available via EBSCOhost as full text, sometime this month. One year from now, the online version will transition to become a free and open publication available via the tech report archives.

The report was designed to be a concise primer (about 30 pages) for librarians who want to be knowledgeable with assisting researchers and students with finding, accessing, and using public summary census data, or who want to apply it to their own work as administrators or LIS researchers. But I also wrote it in such a way that it’s relevant for anyone who is interested in learning more about the census. In some respects it’s a good distillation of my “greatest hits”, drawing on work from my book, technical census-related blog posts, and earlier research that used census data to study the distribution of public libraries in the United States.

Chapter Outline

  1. Introduction
  2. Roles of the Census: in American society, the open data landscape, and library settings
  3. Census Concepts: geography, subject categories, tables and universes
  4. Datasets: decennial census, American Community Survey, Population Estimates, Business Establishments
  5. Accessing Data: data.census.gov, API with python, reports and data summaries
  6. GIS, historical research, and microdata: covers these topics plus the Current Population Survey
  7. The Census in Library Applications: overview of the LIS literature on site selection analysis and studying library access and user populations

I’m pleased with how it turned out, and in particular I hope that it will be used by MLIS students in data services and government information courses.

Although… I must express my displeasure with the ALA. The editorial team for the Library Technology Reports was solid. But once I finished the final reviews of the copy edits, I was put on the spot to write a short article for the American Libraries magazine, primarily to promote the report. This was not part of the contract, and I was given little direction and a month at a busy time of the school year to turn it around. I submitted a draft and never heard about it again – until I saw it in the magazine last week. They cut and revised it to focus on a narrow aspect of the census that was not the original premise, and they introduced errors to boot! As a writer I have never had an experience where I haven’t been given the opportunity to review revisions. It’s thoroughly unprofessional, and makes it difficult to defend the traditional editorial process as somehow being more accurate or thorough compared to the web posting and tweeting masses. They were apologetic, and are posting corrections. I was reluctant to contribute to the magazine to begin with, as I have a low opinion of it and think it’s deteriorated in recent years, but that’s a topic for a different discussion.

Stepping off the soapbox… I’ll be attending the ALA annual conference in DC later this month, to participate on a panel that will discuss the 2020 census, and to reconnect with some old colleagues. So if you want to talk about the census, you can buy me some coffee (or beer) and check out the report.

A final research and publication related note – the map that appears at the top of my post on the distribution of US public libraries from several years back has also made it into print. It appears on page 173 of The Argument Toolbox by K.J. Peters, published by Broadview Press. It was selected as an example of using visuals for communicating research findings, making compelling arguments in academic writing, and citing underlying sources to establish credibility. I’m browsing through the complimentary copy I received and it looks excellent. If you’re an academic librarian or a writing center professional and are looking for core research method guides, I would recommend checking it out.

Census ACS 2020 and Pop Estimates 2021

Last week, the Census Bureau released the latest 5-year estimates for the American Community Survey for 2016-2020. This latest dataset uses the new 2020 census geography, which means if you’re focused on using the latest data, you can finally move away from the 2010-based geography which had been used for the ACS from 2010 to 2019 (with some caveats: 2020 ZCTAs won’t be utilized until the 2021 ACS, and 2020 PUMAs until 2022). As always, mappers have a choice between the TIGER Line files that depict the precise boundaries, or the generalized cartographic boundary files with smoothed lines and large sections of coastal water bodies removed to depict land areas. The 2016-2020 ACS data is available via data.census.gov and the ACS API.

This release is over 3 months late (compared to normal), and there was some speculation as to whether it would be released at all. The pandemic (chief among several other disruptive events) hampered 2020 decennial census and ACS operations. The 1-year 2020 ACS numbers were released over 2 months later than usual, in late November 2021, and were labeled as an experimental release. Instead of the usual 1,500 plus tables in 40 subject areas for all geographic areas with over 65,000 people, only 54 tables were released for the 50 states plus DC. This release is only available from the experimental tables page and is not being published via data.census.gov.

What happened? The details were published in a working paper, but in summary fewer addresses were sampled and the normal mail out and follow-up procedures were disrupted (pg 8). The overall sample size fell from 3.5 to 2.9 million addresses due to reduced mailing between April and June 2020 (pg 18), and total interviews fell from 2 million to 1.4 million with most of the reductions occurring in spring and summer (pg 18). The overall housing unit response rate for 2020 was 71%, down from 86% in 2019 and 92% in 2018 (pg 20). The response rate for the group quarters population fell from 91% in 2019 to 47% in 2020 (pg 21). Responses were differential, varying by time period (with the lowest rates during the peak pandemic months) and geography. Of the 818 counties that meet the 65k threshold, response rates in some were below 50% (pg 21). The data contained a large degree of non-response bias, where people who did respond to the survey had significantly different social, economic and housing characteristics from those who didn’t. As a consequence of all of this, margins of error for the data increased by 20 to 30% over normal (pg 18).

Thus, 2020 will represent a hole in the ACS estimates series. The Bureau made adjustments to weighting mechanisms to produce the experimental 1-year estimates, but is generally advising policy makers and researchers who normally use this series to choose alternatives: either the 1-year 2019 ACS, or the 5-year 2016-2020 ACS. The Bureau was able to make adjustments to produce satisfactory 5-year estimates to reduce non-response bias, and the 5-year pool of samples is balanced somewhat by having at least 4 years of good data.

The Population Estimates Program has also released its latest series of vintage 2021 estimates for counties and metropolitan areas. This dataset gives us a pretty sharp view of how the pandemic affected the nation’s population. Approximately 73% of all counties experienced natural decrease in 2021 (between July 1st 2020 and 2021), where the number of deaths outnumbered births. In contrast, 56% of counties had natural decrease in 2020 and 46% in 2019. Declining birth rates and increasing death rates are long term trends, but COVID-19 magnified them, given the large number of excess deaths on one hand and families postponing child birth due to the virus on the other hand. Net foreign migration continued its years-long decline, but net domestic migration increased in a number of places, reflecting pandemic moves. Medium to small counties benefited most, as did large counties in the Sunbelt and Mountain West. The biggest losers in overall population were counties in California (Los Angeles, San Francisco, and Alameda), Cook County (Chicago), and the counties that constitute the boroughs of NYC.

Census Bureau 2021 Population Estimates Map