US Census data isn’t “big data” in the technical sense, as it’s not being captured and updated in real time and it isn’t fine-grained enough to pinpoint specific coordinates. But it’s big in the conventional sense: it consists of many different datasets that record a variety of aspects about the entire population at many scales, and it’s relational and flexible in nature (tables can be joined, new data can be added or modified). And, there’s a LOT of data!
So which census dataset do you choose for a particular application? In this post I provide a summary overview of what I consider to be the big five: what they are, how they’re constructed, and what’s available. I’ll be describing summary data here, which is data that’s aggregated and published by geographic area and population groups. I won’t be addressing sample microdata (individual responses to census questions) which are available for the decennial census, American Community Survey, and Current Population Survey.
ALL of these datasets are available through the American Factfinder, and via the Census Bureau’s APIs. The smaller datasets can also be downloaded directly from the individual program pages (I note this when it’s available). Outside of he Census Bureau, the Census Reporter is a nice tool for exploring data, while the Missouri Census Data Center and the NHGIS are good alternatives for generating summaries and downloading data in bulk.
The Decennial Census (DEC)
When people think of “the census” the decennial census (DEC) is the dataset that typically comes to mind. It’s the 100% count of the population that’s conducted every ten years on April 1st in years ending with a zero. Required by the Constitution and taken since 1790, its primary purpose is to provide detailed population counts that are used to re-apportion seats in the US House of Representatives. It’s also used to study population distribution and change at the smallest geographic levels, and serves as baseline data for many of the other Census Bureau statistical programs.
The modern census (from the year 2010 forward) collects just basic demographic variables about the population and housing units: gender, age, race, household relationships, group quarters, occupied and vacant units, owner and renter units. This data is published in a series of collections; the primary one is summary file 1 (SF1), but there’s also a summary file 2 (SF2) that contains more detailed cross-tabs. The Redistricting Data file (PL 94-171) is always the first to be released, and contains just the basics.
From the year 2000 back, the DEC had additional summary files that included detailed socio-economic characteristics of the population that were captured on a longer sample form sent to one in six households. The on-going American Community Survey has since replaced it, so if you are looking for anything beyond the basics you need to look at the ACS. For older DEC data, you can find the 2000 census in the American Factfinder but if you want to go back further in time use the NHGIS.
For a sample of what’s included in the DEC, look at the demographic profile table (DP-1), which contains a good cross-section of variables.
Use the DEC when:
- You need 100% counts of the population
- You need to use the smallest geographies available (census blocks, block groups, tracts)
- You don’t need anything more than basic demographic variables
- You’re studying very small population groups in a given area
- You’re making historical comparisons with earlier DEC data
The American Community Survey (ACS)
The American Community Survey (ACS) was launched in 2005 to provide more timely data about the US population on an on-going basis. In addition to the basic demographic variables captured in the DEC, the ACS also captures all the detailed socio-economic statistics that the older census used to capture, such as: employment, marital status, educational enrollment and attainment, veteran status, income, poverty, place of origin, housing value and rent, housing characteristics, and much more.
The ACS is a rolling sample survey that’s conducted each month, and is sent to 3.5 million households annually. The data is published as 1-year averages for any geographical area (state, county, place, etc) that has more than 65k people. 5-year averages are published for all geographies down to the census tract level (some block group level data is available, but it’s highly unreliable). The 5-year average is updated each year by adding a new year of data and dropping the oldest year.
ACS estimates are published at a 90% confidence interval with a margin of error that indicates the possible range of the estimate. For example, if the population for an area is 20,000 people plus or minus 1,000, that means we’re 90% confident that the population is between 19,000 and 21,000 people, and there’s a 10% chance the true population falls outside this range. The timeliness, geographic depth, and variety of variables make the ACS an essential dataset. However, it’s more complicated to work with compared to the simple counts in the DEC, and as a researcher you must pay close attention to the margins of error; estimates for small areas and small population groups can be highly unreliable. To manage this, you can aggregate the data into larger geographies or into fewer population groups.
The 1-year averages are available for all states and metropolitan areas, and for statistical areas called PUMAS that are designed to have 100k people. 1-year averages are available for large counties or places (cities and towns), but since many of these areas have less than 65k people coverage will not be complete. Use the 5-year averages if you need complete coverage of all counties or places in an area, or if you need small areas like census tracts and ZCTAs. When making historical comparisons, it’s only appropriate to compare five year periods that do not overlap. For example, comparing 2007-2011 to 2012-2016 would be appropriate.
For a thorough sample of what’s included in the ACS, look at the demographic profile tables for social (DP02), economic (DP03), housing (DP04), and demographic (DP05) variables.
Use the ACS when:
- You need detailed socio-economic indicators about the population
- You need the most recent data for these indicators
- You’re not working with data below the census tract level
- You can live with the margins of error associated with the estimates
- Use the 1-year averages when you are looking at just large places with more than 65k people and large population groups
- Use 5-year averages to study all areas of a given type, small areas and population groups, and to reduce the size of the margin of error for larger areas and groups
Population Estimates Program (PEP)
The Population Estimates Program (PEP) is used to create basic, annual estimates of the US population for large areas. Using the latest DEC as a starting point, the Bureau takes data on births, deaths, and domestic and international migration to estimate what the population is the following year, and then creates a new estimate each year on July 1st. The estimates are created at the county level, and are rolled up to states and metropolitan areas and disaggregated down to census places (cities and towns). Once a new DEC is taken, the Bureau will go back to the previous decade and issue a set of revised estimates to approximate what actually happened.
Besides the total population, estimates are created for age, gender, race, and housing units. The PEP is also a source for the components of population change for each place (births, deaths, migration), which the Census Bureau compiles from other sources. Since this is a much smaller dataset compared to the DEC or ACS, PEP data can be downloaded in pre-compiled spreadsheets directly from the PEP website, in addition to the American Factfinder and APIs.
Use the PEP when:
- You just need basic demographic variables for large geographic areas
- You’re interested in annual population change
- You want a simple dataset to work with
- You’re interested in the components of population change
Current Population Survey (CPS)
The Current Population Survey (CPS) is a monthly survey of 60,000 households that’s sponsored by the Census Bureau and the Bureau of Labor Statistics (BLS). It’s designed to provide national estimates for a variety of demographic and labor force indicators on a regular basis. The same household is: interviewed for 4 consecutive months, not interviewed again for 8 months, interviewed again for 4 consecutive months, and then is removed from the survey.
Some questions like employment and unemployment are asked repeatedly each month, other questions are asked only during certain months at regular intervals (for example, every two years in November a question is asked about voter registration and participation), and other questions are special topics that are asked on a one-time or limited basis.
Some of the most important indicators are tabulated and published as annual estimates directly on the CPS website, while many of the labor force statistics are published monthly or annually on the BLS website. Given the small sample size of the CPS relative to the ACS, it’s more common that researchers will manipulate the raw CPS data (the individual responses) to create their own estimates and cross-tabulations. The CPS website has some tools for doing this, and IPUMS USA is a popular tool as well.
Given the size of the sample, CPS estimates tabulated by the Census or BLS are published at a national or regional level, and in limited cases at the state level. Aggregated, summarized data is published with margins of error at a 90% confidence interval.
Use the CPS when:
- You need monthly or annual, detailed demographic or labor force data for the entire country or for large regions
- You’re looking for special topic data that’s not published in any other dataset
- You’re comfortable working with microdata if the data you’re looking for has not been aggregated or summarized
Business Patterns and Economic Census
The previous four sources provide data on population and housing units. If you’re looking for data on businesses, here are the two most common options.
The County and ZIP Code Business Patterns provides annual counts of the number of businesses for states, metropolitan areas, counties, and ZIP Codes that includes the number of employees, establishments, and wages. It also provides counts of establishments classified by the North American Industrial Classification System (NAICS). You can look at broad (i.e. manufacturing, retail, finance & insurance) or narrow (auto parts manufacturers, department stores, commercial banks) NAICS summaries. If a particular area has fewer than 3 businesses of a specific type the data isn’t disclosed for confidentiality reasons. The data is generated from the Business Register, which is a government master file of businesses that’s updated on an on-going basis from several sources.
The Economic Census is conducted every five years in years that end in two or seven. It is an actual count of businesses that captures all of the fields that are published in the Business Patterns, but it also: captures sales as well as wages, provides place-level data (cities and towns), and is published in a variety of topical as well as geographic summaries. Because there is quite a time-lag between the collection and publication of this data (several years), the Economic Census is better suited for studying the economy in retrospect. The USDA publishes a Census of Agriculture which covers farming in more detail.
For business statistics:
- Use the Business Patterns for basic counts of the latest data
- Use the Economic Census for more detailed information that’s a bit older
- Use the USDA’s Census of Agriculture to study farming