census data

US Census data, or official census for other countries

ZBP Data in a Notebook

Examples of using the Census Bureau’s API with Python

At the end of my book I briefly illustrate how the Census Bureau’s API works using Python. I’ll expand on that in this post; we’ll pull data from the Population Estimates Program, transform it, and create a chart using Python with Pandas in a Notebook. I’ll conclude with an additional example using the ZIP Code Business Patterns.

The Census Bureau has dedicated API pages for each dataset (decennial, acs, pop estimates, and more), and you need to familiarize yourself with the geographies and variables that are available for each. The API is a basic REST API, where you insert parameters into a base url and retrieve data based on the link you submit. Python has several modules you can use for interacting with APIs – the requests module is a popular choice.

The following pop estimates example is on github (but if github flops see the nbviewer example instead).

The top of the script contains basic stuff – import the modules you need, read in your key, and define the variables that you want to pull. You don’t have to use an API key, but if you don’t you’re limited to pulling in 500 records a day. Requesting a key is simple and free. A best practice is to store your key (a big integer) in a file that you read in, so you’re not exposing it in the script. Most of the census APIs require that you pass in a year and a dataset (dsource). Larger datasets may be divided into subsets (dname); for example the population estimates is divided into estimates, components of change, and characteristics (age, sex, race, etc.). Save the columns and geographies that you want to get in a comma-separated string. You have to consult the documentation and variable lists that are available for each dataset to build these, and the geography requires ANSI / FIPS codes.

%matplotlib inline
import requests,pandas as pd

with open('census_key.txt') as key:
    api_key=key.read().strip()

year='2018'
dsource='pep'
dname='components'
cols='GEONAME,NATURALINC,DOMESTICMIG,INTERNATIONALMIG'
state='42'
county='017,029,045,091,101'

Next, you can create the url. I’ve been doing this in two parts. The first part:

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'

Includes the base https://api.census.gov/data/ followed by parameters that you fill in. The year, data source, and dataset name are the standard pieces. The output looks like this:

'https://api.census.gov/data/2018/pep/components'

Then you take that base_url and add additional parameters that are going to vary within the script, in this case the columns and the geography, which all appear in the ‘get’ portion of the url. The ‘for’ and ‘in’ options allow you to select the type of geography within another geography, in this case counties within states, and you pass in the appropriate ANSI FIPS codes from the string you’ve created. The key appears at the end of the url, but if you opt not to use it you can omit that part. Once the link is fully constructed you use the requests module to fetch the data using that url. You can print the result out as text (assuming it’s not too long).

data_url = f'{base_url}?get={cols}&for=county:{county}&in=state:{state}&key={api_key}'
response=requests.get(data_url)
print(response.text)

The result looks like a nested list, but is actually a string that’s structured in a non-standard JSON format:

[["GEONAME","NATURALINC","DOMESTICMIG","INTERNATIONALMIG","state","county"],
["Bucks County, Pennsylvania","-178","-605","862","42","017"],
["Chester County, Pennsylvania","1829","-887","1374","42","029"],
["Delaware County, Pennsylvania","1374","-2513","1579","42","045"],
["Montgomery County, Pennsylvania","1230","-1987","2315","42","091"],
["Philadelphia County, Pennsylvania","8617","-11796","8904","42","101"]]

To do anything with it, convert it to JSON with response.json(). Then you can convert it into a list, dictionary, or in this example a Pandas dataframe. Here, I build the dataframe with everything from row one forward [1:]; row zero contains the column headers[0]. I rename some of the columns, build a unique ID by concatenating the state and county FIPS codes and set that as the new index, and drop the individual county and state FIPS columns. By default every object that’s returned is a string, so I convert the numeric columns to integers:

data=response.json()
df=pd.DataFrame(data[1:], columns=data[0]).\
    rename(columns={"NATURALINC": "Natural Increase", "DOMESTICMIG": "Net Domestic Mig", "INTERNATIONALMIG":"Net Foreign Mig"})
df['fips']=df.state+df.county
df.set_index('fips',inplace=True)
df.drop(columns=['state','county'],inplace=True)
df=df.astype(dtype={'Natural Increase':'int64','Net Domestic Mig':'int64','Net Foreign Mig':'int64'},inplace=True)
df

Then I can see the result:

pep dataframe

Once the data is in good shape, you can begin to analyze and visualize it. Here’s the components of population change for Philadelphia and the surrounding suburban counties in Pennsylvania from 2017 to 2018 – natural increase is the difference between births and deaths, and there’s net migration within the US (domestic) and between the US and other countries (foreign):

labels=df['GEONAME'].str.split(' ',expand=True)[0]
ax=df.plot.bar(rot=0, title='Components of Population Change 2017-18')
ax.set_xticklabels(labels)
ax.set_xlabel('')

Components of Population Change Plot

Each request is going to vary based on your specific needs and the construction of the particular dataset. Here’s another example where I pull data on business establishments, employees, and wages (in $1,000s of dollars) from the ZIP Code Business Patterns (ZBP). This dataset is smaller, so it doesn’t have a dataset name, just a data source. To get all the ZIP Codes in Delaware I use the asterisk * wildcard. Because ZIP Codes do not nest within states I can’t use the ‘in’ option, it’s simply not available. A state code is stored in a special field called ST, and I can use it as a general limiter with equals in the query:

year='2016'
dsource='zbp'
cols='ESTAB,EMP,PAYQTR1,PAYANN'
state='10'

base_url = f'https://api.census.gov/data/{year}/{dsource}'

data_url = f'{base_url}?get={cols}&for=zipcode:*&ST={state}&key={api_key}'
response=requests.get(data_url)
print(response.text)
[["ESTAB","EMP","PAYQTR1","PAYANN","ST","zipcode"],
["982","26841","448380","1629024","10","19713"],
["22","628","3828","15848","10","19716"],
["8","15","371","2030","10","19732"],
["7","0","0","0","10","19718"],
["738","9824","83844","353310","10","19709"]...
data=response.json()
zbp_data=pd.DataFrame(data[1:], columns=data[0]).set_index('zipcode')
zbp_data.drop(columns=['ST'],inplace=True)
for field in cols.split(','):
    zbp_data=zbp_data.astype(dtype={field:'int64'},inplace=True)
zbp_data.head()

ZBP Data for Delaware

One of the issues with the ZBP is that many variables are not disclosed due to privacy regulations; instead of returning nulls a zero is returned, but in this dataset they are not true zeros. Once you retrieve the data and set the types you can replace zeros with NaNs, which are numpy / Panda nulls – although there’s a quirk in that dataframe columns declared as integers cannot contain null values. Instead you can use a float, or a workaround that’s been implemented for new Pandas versions (for my specific use case this data will be inserted into a database, so I’ll use SQL to accomplish the zero to null conversion). ZBP data is also injected with noise to protect privacy, and you can retrieve special columns that contain noise flags.

The API is convenient for automating the data acquisition process, and allows you to cherry pick the variables you want. To avoid accessing the API over and over again as you build your scripts (which is prohibitive when requesting lots of data) you can pickle the data right after you retrieve it – a pickle is a python data object that efficiently stores data locally, and pandas has special functions for creating and accessing them. Once you pull your data and pickle it, you can comment out (or in a notebook, don’t rerun) the requests block, and subsequently pull the data from the pickle as you tweak your code (see caveat in the postscript – perhaps best to use json instead of pickle).

#Write to a pickle
zbp_data.to_pickle('insert path here.pickle')
#Read from a pickle to dataframe
zbp_new=pd.read_pickle('insert path here.pickle')

Take a look at the Census Data API User Guide to learn more. The guide focuses just on the REST API, and is not specific to a scripting language. Of course, you also need to familiarize yourself with the datasets and how they’re created and organized, and with census geography (which is why I wrote this book).

Postscript

Since I’ve finished this post I’ve created a notebook that pulls ZBP data from the API (alt nbviewer here) and have some additional thoughts I’d like to share:

  1. I decided to dump the data I retrieved from the API to a json file and then pull data from it instead of using a pickle. Pickles come with serious security issues. If you don’t intend to share your code with anyone pickles are fine, otherwise consider an alternative.
  2. My method for parsing the retrieved data into a dataframe worked fine because the census API uses non-standard JSON; essentially the string that’s returned resembles a nested Python list. If this was true JSON, we may need to employ a different method to account for the fact that the number of elements per record may vary.
  3. Wildcards are not always available to build urls for certain data; for example to download the number of establishments classified by industry I wasn’t able to grab everything for one state using the method I illustrated in this post. Instead I had to loop through a list of ZIP and NAICS codes to retrieve what I wanted one at a time.
  4. In the case of retrieving establishments classified by industry there were many cases when there was no data for a particular ZIP Code (i.e. no farms and mines in midtown Manhattan). Since I needed records that showed zero establishments, I had to insert them myself if the API returned no result. Even if you didn’t need records with zeros, it’s important to consider the potential impact of getting nothing back from the API on your subsequent code.
  5. Given my experience thus far these APIs were pretty reliable, in that I haven’t had issues with time outs and partially returned data. If this was not the case and you had lots of data to retrieve, you would need to build in some try – except statements to handle exceptions, save data as you go along, and pick up where you left off if something breaks. Read about this geocoding script I wrote a few years back for examples.
FRED Chart - Pesronal Savings Rate

Finding Economic Data with FRED

I attended ALA’s annual conference in DC last month, where I met FRED. Not a person, but a database. I can’t believe I hadn’t met FRED before – it is an amazingly valuable resource for national, time-series economic data.

FRED was created by the Economic Research unit of the Federal Reserve Bank of St. Louis. It was designed to aggregate economic data from many government sources into a centralized database, with straightforward interface for creating charts and tables. At present, it contains 567,000 US and international time series datasets from 87 sources.

Categories of data include banking and finance (interest and exchange rates, lending, monetary data), labor markets (basic demographics, employment and unemployment, job openings, taxes, real estate), national accounts (national income, debt, trade), production and business (business cycles, production, retail trade, sector-level information about industries),  prices (commodities, consumer price indexes) and a lot more. Sources include the Federal Reserve, the Bureau of Labor Statistics, the Census Bureau, the Bureau of Economic Analysis, the Treasury Department, and a mix of other government and corporate sources from the US and around the world.

On their home page at https://fred.stlouisfed.org/ you can search for indicators or choose one of several options for browsing. The default dashboard shows you some of the most popular series and newest releases at a glance. Click on Civilian Unemployment Rate, and you retrieve a chart with monthly stats that stretch from the late 1940s to the present. Most of FRED’s plots highlight periods of recession since these have a clear impact on economic trends. You can modify the chart’s date range, change the frequency (monthly, quarterly, annually – varies by indicator), download the chart or the underlying data in a number of formats, and share a link to it. There are also a number of advanced customization features, such as adding other series to the chart. Directly below the chart are notes that provide a clear definition of the indicator and its source (in this case, the Bureau of Labor Statistics) and links to related tables and resources.

FRED - Chart of Civilian Unemployment Rate

The unemployment rate is certainly something that you’d expect to see, but once you browse around a bit you’ll be surprised by the mix of statistics and the level of detail. I happened to stumble across a monthly Condo Price Index for the New York City Metro Area.

Relative to other sources or portals, FRED is great for viewing and retrieving national (US and other countries) economic and fiscal data and charts gathered from many sources. It’s well suited for time-series data; there are lots of indexes and you can opt for seasonally adjusted or unadjusted values. Many of the series include data for large regions of the US, states, metro areas, and counties. The simplest way to find sub-national data is to do a search, and once you do you can apply filters for concepts, frequencies, geographies, and sources. FRED is not the place to go if you need data for small geographies below the county level. If you opt to create a FRED account (purely optional) you’ll be able to save and track indicators that you’re interested in and build your own dashboards.

If you’re interested in maps, visit FRED’s brother GeoFRED at https://geofred.stlouisfed.org/.  The homepage has a series of sample thematic maps for US counties and states and globally for countries. Choose any map, and once it opens you can change the geography and indicator to something else. You can modify the frequency, units, and time periods for many of the indicators, and you have basic options for customizing the map (colors, labels, legend, etc.) The maps are interactive, so you can zoom in and out and click on a place to see its data value. Most of the county-level data comes from the Census Bureau, but as you move up to states or metro areas the number of indicators and sources increase. For example, the map below shows individual income taxes collected per capita by state in 2018.

GeoFRED - State Income Tax

There’s a basic search function for finding specific indicators. Just like the charts, maps can be downloaded as static images, shared and embedded in websites, and you can download the data behind the map (it’s simpler to download the same indicator for multiple geographies using GeoFRED compared to FRED).

Take a few minutes and check it out. For insights and analyses of data published via FRED, visit FRED’s blog at https://fredblog.stlouisfed.org/.

datacensusgov

Navigating the New data.census.gov

June 2019 is the final month that the Census Bureau will post new data in the American Factfinder (AFF). From this point forward, all new datasets will be published via the new data dissemination platform data.census.gov. The second chapter of my book (now available for pre-order!) is devoted to navigating this new interface. In this post I’ll provide a preview / brief tutorial of the advanced search functions.

The new interface is search-driven, so you can type the names of topics and geographies or table ID numbers to find and explore data tables. There are spiffy data profiles for several geographies, and you have the ability to make basic thematic maps. The search interface makes it much easier to casually browse and discover data, so go ahead and explore.

I’d still recommend having a search strategy to find precisely what you need. Keyword searching alone isn’t going to cut it, because you’re searching across tens of thousands of tables in dozens of datasets. The good news is that the same strategy I’ve used for the AFF can be applied to data.census.gov: use the advanced search to filter by survey, year, geography, and topic to narrow down the list of possible tables to a manageable number, and then search or browse through those results to find what you need.

Let’s say we want to download the most recent data on home values for all the counties in Pennsylvania (or a state of your choosing). On data.census.gov click on the advanced search link under the search box. On the advanced search page scroll to the bottom to the filters. We’ll address them one by one:

Surveys. These represent all the different census datasets. Select ACS 5-Year Estimates Detailed Tables. Detailed socio-economic characteristics of the population are primarily published in the ACS. The 1-Year estimates are published for all geographies that have at least 65k people. Since most states have rural counties that have less than this threshold, we’ll have to use the 5-year estimates to get all the counties. The detailed tables are narrow, focusing on estimates for a single variable. The other options include profiles (lots of different data for one place) and subject tables (narrower in scope than profiles, but broader than the detailed tables).

filter by survey

Years. At the moment 2017 is the latest year for the ACS, so let’s select that. This quickly eliminates a lot of tables that we’re not interested in.

Geography. Choose 050 – County, then scroll down and choose Pennsylvania in the County (State) list, then All counties in Pennsylvania in the final list.

filter by geography

Topics. For this example choose Housing, then Financial Characteristics, then Housing Value and Purchase Price. Of all the filter options, this one is the most opened-ended and may require some experimentation based on what you’re looking for.

filter by topic

Codes. We don’t need to filter by codes in this example, but if we were searching for labor or business-related data we’d use this filter to limit results to specific sectors or industries by NAICS codes.

Underneath the filter menu, click the View All Results button. This brings us to the first results page, which provides a list of tables, maps, and pages related to our search. Click the button to View All Tables under the tables section.

This brings us to the table results page; the list of tables is displayed on the left, and the currently selected table is displayed on the right; in this case Value of owner-occupied housing units is shown, with counts of units by value brackets. At this stage, we can scroll through the list and browse to find tables with data that we’re interested in. We can also access the filters at the top of the list, if we want to modify our search parameters.

table results

A little further down the results list is a table for Median Value. Selecting that table will preview it on the right. Hit the Customize Table button. This opens the table in its own dedicated view. Hit the blue drop down arrow to the right of the table name, and you can modify the geography, year, or time-period on the left. On the right is a Download option. Hit download and you’ll be prompted to download a CSV file. In the download you’ll get three text files that contain metadata, the data, and descriptive information about the download. Click Download and you can save it.

customize table

Back on the customize table page, you can navigate back to the table results by clicking on “Tables” in the breadcrumb links that appear in the top left-hand corner. Then you can browse and choose additional tables.

That’s it! Not bad, right? Well, there are always caveats. At the moment, the biggest one is that you can’t easily download most geographies that are contained within other geographies. With one click we can filter to select all counties within a state, or all states within the nation. But if we wanted all census tracts in a county or all county subdivisions in a state, there aren’t any “All geographies in…” options for these geographies. We’d have to select each and every tract within a county, one at a time…

While data.census.gov is now relatively stable, it’s still under development and additional features like this should (hopefully) be implemented as time passes between now and the 2020 census. This is one reason why the American Factfinder will survive for another year, as we’ll still need to lean on it to accomplish certain tasks. Of course, there are other options within the Census Bureau (the API, the FTP site) and without (NHGIS, MCDC, Census Reporter) for accessing data.

The new platform currently provides access to several datasets from the present back to the year 2010: the decennial census, the ACS, population estimates, and several of the business datasets. The first new datasets that will be published in data.census.gov (and NOT in the AFF) include the 2017 Economic Census this summer and the 1-year 2018 ACS in September.

View the Release Notes and FAQs for more details about the platform: general documentation, recent developments, bugs, and planned enhancements. The Census Bureau also has an archived webinar with slides that discuss the transition.

Calculate margin of error for ratio (mean income)

Calculating Mean Income for Groups of Geographies with Census ACS Data

When aggregating small census geographies to larger ones (census tracts to neighborhoods for example) when you’re working with American Community Survey (ACS) data, you need to sum estimates and calculate new margins of error. This is straightforward for most estimates; you simply sum them, and take the square root of the sum of squares for the margins of error (MOEs) for each estimate that you’re aggregating. But what if you need to group and summarize derived estimates like means or medians? In this post, I’ll demonstrate how to calculate mean household income by aggregating ZCTAs to United Hospital Fund neighborhoods (UHF), which is a type of public health area in NYC created by aggregating ZIP Codes.

I’m occasionally asked how to summarize median household income from tracts to neighborhood-like areas. You can’t simply add up the medians and divide them, the result would be completely erroneous. Calculating a new median requires us to sort individual household-level records and choose the middle-value, which we cannot do as those records are confidential and not public. There are a few statistical interpolation methods that we can use with interval data (number of households summarized by income brackets) to estimate a new median and MOE, but the calculations are rather complex. The State Data Center in California provides an excellent tutorial that demonstrates the process, and in my new book I’ll walk through these steps in the supplemental material.

While a mean isn’t as desirable as a median (as it can be skewed by outliers), it’s much easier to calculate. The ACS includes tables on aggregate income, including the sum of all income earned by households and other population group (like families or total population). If we sum aggregate household income and number of households for our small geographic areas, we can divide the total income by total households to get mean income for the larger area, and can use the ACS formula for computing the MOE for ratios to generate a new MOE for the mean value. The Census Bureau publishes all the ACS formulas in a detailed guidebook for data users, and I’ll cover many of them in the ACS chapter of my book (to be published by the end of 2019).

Calculating a Derived Mean in Excel

Let’s illustrate this with a simple example. I’ve gathered 5-year 2017 ACS data on number of households (table B11001) and aggregate household income (table B19025) by ZCTA, and constructed a sheet to correlate individual ZCTAs to the UHF neighborhoods they belong to. UHF 101 Kingsbridge-Riverdale in the Bronx is composed of just two ZCTAs, 10463 and 10471. We sum the households and aggregate income to get totals for the neighborhood. To calculate a new MOE, we take the square root of the sum of squares for each of the estimate’s MOEs:

Calculate margin of error for new sum

Calculate margin of error for new sum

To calculate mean income, we simply divide the total aggregate household income by total households. Calculating the MOE is more involved. We use the ACS formula for derived ratios, where aggregate income is the numerator of the ratio and households is the denominator. We multiply the square of the ratio (mean income) by the square of the MOE of the denominator (households MOE), add that product to the square of the MOE of the numerator (aggregate income MOE), take the square root, and divide the result by the denominator (households):

=(SQRT((moe_ratio_numerator^2)+(ratio^2*moe_ratio_denominator^2))/ratio_denominator)
Calculate margin of error for ratio (mean income)

Calculate margin of error for ratio (mean income)

The 2013-2017 mean household income for UHF 101 is $88,040, +/- $4,223. I always check my math using the Cornell Program on Applied Demographic’s ACS Calculator to make sure I didn’t make a mistake.

This is how it works in principle, but life is more complicated. When I downloaded this data I had number of households by ZCTA and aggregate household income by ZCTA in two different sheets, and the relationship between ZCTAs and UHFs in a third sheet. There are 42 UHF neighborhoods and 211 ZCTAs in the city, of which 182 are actually assigned to UHFs; the others have no household population. I won’t go into the difference between ZIP Codes and ZCTAs here, as it isn’t a problem in this particular example.

Tying them all together would require using the ZCTA in the third sheet in a VLOOKUP formula to carry over the data from the other two sheets. Then I’d have to aggregate the data to UHF using a pivot table. That would easily give me sum of households and aggregate income by UHF, but getting the MOEs would be trickier. I’d have to square them all first, take the sum of these squares when pivoting, and take the square root after the pivot to get the MOEs. Then I could go about calculating the means one neighborhood at a time.

Spreadsheet-wise there might be a better way of doing this, but I figured why do that when I can simply use a database? PostgreSQL to the rescue!

Calculating a Derived Mean in PostgreSQL

In PostgreSQL I created three empty tables for: households, aggregate income, and the ZCTA to UHF relational table, and used pgAdmin to import ZCTA-level data from CSVs into those tables (alternatively you could use SQLite instead of PostgreSQL, but you would need to have the optional math module installed as SQLite doesn’t have the capability to do square roots).

Portion of households table. A separate aggregate household income table is structured the same way, with income stored as bigint type.

Portion of households table. A separate aggregate household income table is structured the same way, with income stored as bigint type.

Portion of the ZCTA to UHF relational table.

Portion of the ZCTA to UHF relational table.

In my first run through I simply tried to join the tables together using the 5-digit ZCTA to get the sum of households and aggregate incomes. I SUM the values for both and use GROUP BY to do the aggregation to UHF. In PostgreSQL pipe-forward slash: |/ is the operator for square root. I sum the squares for each ZCTA MOE and take the root of the total to get the UHF MOEs. I omit ZCTAs that have zero households so they’re not factored into the formulas:

SELECT z.uhf42_code, z.uhf42_name, z.borough,
    SUM(h.households) AS hholds,
    ROUND(|/(SUM(h.households_me^2))) AS hholds_me,
    SUM(a.agg_hhold_income) AS agghholds_inc,
    ROUND(|/(SUM(a.agg_hhold_income_me^2))) AS agghholds_inc_me
FROM zcta_uhf42 z, hsholds h, agg_income a
WHERE z.zcta=h.gid2 AND z.zcta=a.gid2 AND h.households !=0
GROUP BY z.uhf42_code, z.uhf42_name, z.borough
ORDER BY uhf42_code;
Portion of query result, households and income aggregated from ZCTA to UHF district.

Portion of query result, households and income aggregated from ZCTA to UHF district.

Once that was working, I modified the statement to calculate mean income. Calculating the MOE for the mean looks pretty rough, but it’s simply because we have to repeat the calculation for the ratio over again within the formula. This could be avoided if we turned the above query into a temporary table, and then added two columns and populated them with the formulas in an UPDATE – SET statement. Instead I decided to do everything in one go, and just spent time fiddling around to make sure I got all the parentheses in the right place. Once I managed that, I added the ROUND function to each calculation:

SELECT z.uhf42_code, z.uhf42_name, z.borough,
    SUM(h.households) AS hholds,
    ROUND(|/(SUM(h.households_me^2))) AS hholds_me,
    SUM(a.agg_hhold_income) AS agghholds_inc,
    ROUND(|/(SUM(a.agg_hhold_income_me^2))) AS agghholds_inc_me,
    ROUND(SUM(a.agg_hhold_income) / SUM(h.households)) AS hhold_mean_income,
    ROUND((|/ (SUM(a.agg_hhold_income_me^2) + ((SUM(a.agg_hhold_income)/SUM(h.households))^2 * SUM(h.households_me^2)))) / SUM(h.households)) AS hhold_meaninc_me
FROM zcta_uhf42 z, hsholds h, agg_income a
WHERE z.zcta=h.gid2 AND z.zcta=a.gid2 AND h.households !=0
GROUP BY z.uhf42_code, z.uhf42_name, z.borough
ORDER BY uhf42_code;
Query in pgAdmin and portion of result for calculating mean household income

Query in pgAdmin and portion of result for calculating mean household income

I chose a couple examples where a UHF had only one ZCTA, and another that had two, and tested them in the Cornell ACS calculator to insure the results were correct. Once I got it right, I added:

CREATE VIEW household_sums AS

To the top of the statement and executed again to save it as a view. Mission accomplished! To make doubly sure that the values were correct, I connected my db to QGIS and joined this view to a UHF shapefile to visually verify that the results made sense (could also have imported the shapefile into the DB as a spatial table and incorporated it into the query).

Mean household income by UHF neighborhood in QGIS

Mean household income by UHF neighborhood in QGIS

Conclusion

While it would be preferable to have a median, calculating a new mean for an aggregated area is a fair alternative, if you simply need some summary value for the variable and don’t have the time to spend doing statistical interpolation. Besides income, the Census Bureau also publishes aggregate tables for other variables like: travel time to work, hours worked, number of vehicles, rooms, rent, home value, and various subsets of income (earnings, wages or salary, interest and dividends, social security, public assistance, etc) that makes it possible to calculate new means for aggregated areas. Just make sure you use the appropriate denominator, whether it’s total population, households, owner or renter occupied housing units, etc.

Census Workshop Recap

I’ve been swamped these past few months, revising my census book, teaching a spatial database course, and keeping the GIS Lab running. Thus, this will be a shorter post!

Last week I taught a workshop on understanding, finding, and accessing US Census Data at the Metropolitan Library Council of New York. If you couldn’t make it, here are the presentation slides and the group exercise questions.

Most of the participants were librarians who were interested in learning how to help patrons find and understand census data, but there were also some data analysts in the crowd. We began with an overview of how the census is structured by dataset, geography, and subject categories. I always cover the differences between the decennial census and the ACS, with a focus on how to interpret ACS estimates and gauge their reliability.

For workshops I think it’s best to start with searching for profiles (lots of different data for one place). This gives new users a good overview of the breadth and depth of the types of variables that are available in the census. Since this was a New York City-centric crowd we looked at the City’s excellent NYC Population Factfinder first. The participants formed small groups and searched through the application to answer a series of fact-finding questions that I typically receive. Beyond familiarizing themselves with the applications and data, the exercises also helped to spark additional questions about how the census is structured and organized.

Then we switched over to the Missouri Census Data Center’s profile and trends applications (listed on the right hand side of their homepage) to look up data for other parts of the country, and in doing so we were able to discuss the different census geographies that are available for different places. Everyone appreciated the simple and easy to use interface and the accessible tables and graphics. The MCDC doesn’t have a map-based search, so I did a brief demo of TIGERweb for viewing census geography across the country.

Once everyone had this basic exposure, we hopped into the American Factfinder to search for comparison tables (a few pieces of data for many places). We discussed how census data is structured in tables and what the difference between the profile, summary, and detailed tables are. We used the advanced search and I introduced my tried and true method of filtering by dataset, geography, and topic to find what we need. I mentioned the Census Reporter as good place to go for ACS documentation, and as an alternate source of data. Part of my theme was that there are many tools that are suitable for different needs and skill levels, and you can pick your favorite or determine what’s suitable for a particular purpose.

We took a follow-the-leader approach for the AFF, where I stepped through the website and the process for downloading two tables and importing them into a spreadsheet, high-lighting gotchas along the way. We did some basic formulas for aggregating ACS estimates to create new margins of error, and a VLOOKUP for tying data from two tables together.

We wrapped up the morning with a foreshadowing of what’s to come with the new data.census.gov (which will replace the AFF) and the 2020 census. While there’s still much uncertainty around the citizenship question and fears of an under count, the structure of the dataset won’t be too different from 2010 and the timeline for release should be similar.

LISA map of Broad Band Subscription by Household

Mapping US Census Data on Internet Access

ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

Figure 1. Percentage of Households With Subscription to Any Broadband Service: 2013-2017[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above $50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below $50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA map of Broad Band Subscription by Household

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.