tutorial

ZBP Data in a Notebook

Examples of using the Census Bureau’s API with Python

At the end of my book I briefly illustrate how the Census Bureau’s API works using Python. I’ll expand on that in this post; we’ll pull data from the Population Estimates Program, transform it, and create a chart using Python with Pandas in a Notebook. I’ll conclude with an additional example using the ZIP Code Business Patterns.

The Census Bureau has dedicated API pages for each dataset (decennial, acs, pop estimates, and more), and you need to familiarize yourself with the geographies and variables that are available for each. The API is a basic REST API, where you insert parameters into a base url and retrieve data based on the link you submit. Python has several modules you can use for interacting with APIs – the requests module is a popular choice.

The following pop estimates example is on github (but if github flops see the nbviewer example instead).

The top of the script contains basic stuff – import the modules you need, read in your key, and define the variables that you want to pull. You don’t have to use an API key, but if you don’t you’re limited to pulling in 500 records a day. Requesting a key is simple and free. A best practice is to store your key (a big integer) in a file that you read in, so you’re not exposing it in the script. Most of the census APIs require that you pass in a year and a dataset (dsource). Larger datasets may be divided into subsets (dname); for example the population estimates is divided into estimates, components of change, and characteristics (age, sex, race, etc.). Save the columns and geographies that you want to get in a comma-separated string. You have to consult the documentation and variable lists that are available for each dataset to build these, and the geography requires ANSI / FIPS codes.

%matplotlib inline
import requests,pandas as pd

with open('census_key.txt') as key:
    api_key=key.read().strip()

year='2018'
dsource='pep'
dname='components'
cols='GEONAME,NATURALINC,DOMESTICMIG,INTERNATIONALMIG'
state='42'
county='017,029,045,091,101'

Next, you can create the url. I’ve been doing this in two parts. The first part:

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'

Includes the base https://api.census.gov/data/ followed by parameters that you fill in. The year, data source, and dataset name are the standard pieces. The output looks like this:

'https://api.census.gov/data/2018/pep/components'

Then you take that base_url and add additional parameters that are going to vary within the script, in this case the columns and the geography, which all appear in the ‘get’ portion of the url. The ‘for’ and ‘in’ options allow you to select the type of geography within another geography, in this case counties within states, and you pass in the appropriate ANSI FIPS codes from the string you’ve created. The key appears at the end of the url, but if you opt not to use it you can omit that part. Once the link is fully constructed you use the requests module to fetch the data using that url. You can print the result out as text (assuming it’s not too long).

data_url = f'{base_url}?get={cols}&for=county:{county}&in=state:{state}&key={api_key}'
response=requests.get(data_url)
print(response.text)

The result looks like a nested list, but is actually a string that’s structured in a non-standard JSON format:

[["GEONAME","NATURALINC","DOMESTICMIG","INTERNATIONALMIG","state","county"],
["Bucks County, Pennsylvania","-178","-605","862","42","017"],
["Chester County, Pennsylvania","1829","-887","1374","42","029"],
["Delaware County, Pennsylvania","1374","-2513","1579","42","045"],
["Montgomery County, Pennsylvania","1230","-1987","2315","42","091"],
["Philadelphia County, Pennsylvania","8617","-11796","8904","42","101"]]

To do anything with it, convert it to JSON with response.json(). Then you can convert it into a list, dictionary, or in this example a Pandas dataframe. Here, I build the dataframe with everything from row one forward [1:]; row zero contains the column headers[0]. I rename some of the columns, build a unique ID by concatenating the state and county FIPS codes and set that as the new index, and drop the individual county and state FIPS columns. By default every object that’s returned is a string, so I convert the numeric columns to integers:

data=response.json()
df=pd.DataFrame(data[1:], columns=data[0]).\
    rename(columns={"NATURALINC": "Natural Increase", "DOMESTICMIG": "Net Domestic Mig", "INTERNATIONALMIG":"Net Foreign Mig"})
df['fips']=df.state+df.county
df.set_index('fips',inplace=True)
df.drop(columns=['state','county'],inplace=True)
df=df.astype(dtype={'Natural Increase':'int64','Net Domestic Mig':'int64','Net Foreign Mig':'int64'},inplace=True)
df

Then I can see the result:

pep dataframe

Once the data is in good shape, you can begin to analyze and visualize it. Here’s the components of population change for Philadelphia and the surrounding suburban counties in Pennsylvania from 2017 to 2018 – natural increase is the difference between births and deaths, and there’s net migration within the US (domestic) and between the US and other countries (foreign):

labels=df['GEONAME'].str.split(' ',expand=True)[0]
ax=df.plot.bar(rot=0, title='Components of Population Change 2017-18')
ax.set_xticklabels(labels)
ax.set_xlabel('')

Components of Population Change Plot

Each request is going to vary based on your specific needs and the construction of the particular dataset. Here’s another example where I pull data on business establishments, employees, and wages (in $1,000s of dollars) from the ZIP Code Business Patterns (ZBP). This dataset is smaller, so it doesn’t have a dataset name, just a data source. To get all the ZIP Codes in Delaware I use the asterisk * wildcard. Because ZIP Codes do not nest within states I can’t use the ‘in’ option, it’s simply not available. A state code is stored in a special field called ST, and I can use it as a general limiter with equals in the query:

year='2016'
dsource='zbp'
cols='ESTAB,EMP,PAYQTR1,PAYANN'
state='10'

base_url = f'https://api.census.gov/data/{year}/{dsource}'

data_url = f'{base_url}?get={cols}&for=zipcode:*&ST={state}&key={api_key}'
response=requests.get(data_url)
print(response.text)
[["ESTAB","EMP","PAYQTR1","PAYANN","ST","zipcode"],
["982","26841","448380","1629024","10","19713"],
["22","628","3828","15848","10","19716"],
["8","15","371","2030","10","19732"],
["7","0","0","0","10","19718"],
["738","9824","83844","353310","10","19709"]...
data=response.json()
zbp_data=pd.DataFrame(data[1:], columns=data[0]).set_index('zipcode')
zbp_data.drop(columns=['ST'],inplace=True)
for field in cols.split(','):
    zbp_data=zbp_data.astype(dtype={field:'int64'},inplace=True)
zbp_data.head()

ZBP Data for Delaware

One of the issues with the ZBP is that many variables are not disclosed due to privacy regulations; instead of returning nulls a zero is returned, but in this dataset they are not true zeros. Once you retrieve the data and set the types you can replace zeros with NaNs, which are numpy / Panda nulls – although there’s a quirk in that dataframe columns declared as integers cannot contain null values. Instead you can use a float, or a workaround that’s been implemented for new Pandas versions (for my specific use case this data will be inserted into a database, so I’ll use SQL to accomplish the zero to null conversion). ZBP data is also injected with noise to protect privacy, and you can retrieve special columns that contain noise flags.

The API is convenient for automating the data acquisition process, and allows you to cherry pick the variables you want. To avoid accessing the API over and over again as you build your scripts (which is prohibitive when requesting lots of data) you can pickle the data right after you retrieve it – a pickle is a python data object that efficiently stores data locally, and pandas has special functions for creating and accessing them. Once you pull your data and pickle it, you can comment out (or in a notebook, don’t rerun) the requests block, and subsequently pull the data from the pickle as you tweak your code (see caveat in the postscript – perhaps best to use json instead of pickle).

#Write to a pickle
zbp_data.to_pickle('insert path here.pickle')
#Read from a pickle to dataframe
zbp_new=pd.read_pickle('insert path here.pickle')

Take a look at the Census Data API User Guide to learn more. The guide focuses just on the REST API, and is not specific to a scripting language. Of course, you also need to familiarize yourself with the datasets and how they’re created and organized, and with census geography (which is why I wrote this book).

Postscript

Since I’ve finished this post I’ve created a notebook that pulls ZBP data from the API (alt nbviewer here) and have some additional thoughts I’d like to share:

  1. I decided to dump the data I retrieved from the API to a json file and then pull data from it instead of using a pickle. Pickles come with serious security issues. If you don’t intend to share your code with anyone pickles are fine, otherwise consider an alternative.
  2. My method for parsing the retrieved data into a dataframe worked fine because the census API uses non-standard JSON; essentially the string that’s returned resembles a nested Python list. If this was true JSON, we may need to employ a different method to account for the fact that the number of elements per record may vary.
  3. Wildcards are not always available to build urls for certain data; for example to download the number of establishments classified by industry I wasn’t able to grab everything for one state using the method I illustrated in this post. Instead I had to loop through a list of ZIP and NAICS codes to retrieve what I wanted one at a time.
  4. In the case of retrieving establishments classified by industry there were many cases when there was no data for a particular ZIP Code (i.e. no farms and mines in midtown Manhattan). Since I needed records that showed zero establishments, I had to insert them myself if the API returned no result. Even if you didn’t need records with zeros, it’s important to consider the potential impact of getting nothing back from the API on your subsequent code.
  5. Given my experience thus far these APIs were pretty reliable, in that I haven’t had issues with time outs and partially returned data. If this was not the case and you had lots of data to retrieve, you would need to build in some try – except statements to handle exceptions, save data as you go along, and pick up where you left off if something breaks. Read about this geocoding script I wrote a few years back for examples.
datacensusgov

Navigating the New data.census.gov

June 2019 is the final month that the Census Bureau will post new data in the American Factfinder (AFF). From this point forward, all new datasets will be published via the new data dissemination platform data.census.gov. The second chapter of my book (now available for pre-order!) is devoted to navigating this new interface. In this post I’ll provide a preview / brief tutorial of the advanced search functions.

The new interface is search-driven, so you can type the names of topics and geographies or table ID numbers to find and explore data tables. There are spiffy data profiles for several geographies, and you have the ability to make basic thematic maps. The search interface makes it much easier to casually browse and discover data, so go ahead and explore.

I’d still recommend having a search strategy to find precisely what you need. Keyword searching alone isn’t going to cut it, because you’re searching across tens of thousands of tables in dozens of datasets. The good news is that the same strategy I’ve used for the AFF can be applied to data.census.gov: use the advanced search to filter by survey, year, geography, and topic to narrow down the list of possible tables to a manageable number, and then search or browse through those results to find what you need.

Let’s say we want to download the most recent data on home values for all the counties in Pennsylvania (or a state of your choosing). On data.census.gov click on the advanced search link under the search box. On the advanced search page scroll to the bottom to the filters. We’ll address them one by one:

Surveys. These represent all the different census datasets. Select ACS 5-Year Estimates Detailed Tables. Detailed socio-economic characteristics of the population are primarily published in the ACS. The 1-Year estimates are published for all geographies that have at least 65k people. Since most states have rural counties that have less than this threshold, we’ll have to use the 5-year estimates to get all the counties. The detailed tables are narrow, focusing on estimates for a single variable. The other options include profiles (lots of different data for one place) and subject tables (narrower in scope than profiles, but broader than the detailed tables).

filter by survey

Years. At the moment 2017 is the latest year for the ACS, so let’s select that. This quickly eliminates a lot of tables that we’re not interested in.

Geography. Choose 050 – County, then scroll down and choose Pennsylvania in the County (State) list, then All counties in Pennsylvania in the final list.

filter by geography

Topics. For this example choose Housing, then Financial Characteristics, then Housing Value and Purchase Price. Of all the filter options, this one is the most opened-ended and may require some experimentation based on what you’re looking for.

filter by topic

Codes. We don’t need to filter by codes in this example, but if we were searching for labor or business-related data we’d use this filter to limit results to specific sectors or industries by NAICS codes.

Underneath the filter menu, click the View All Results button. This brings us to the first results page, which provides a list of tables, maps, and pages related to our search. Click the button to View All Tables under the tables section.

This brings us to the table results page; the list of tables is displayed on the left, and the currently selected table is displayed on the right; in this case Value of owner-occupied housing units is shown, with counts of units by value brackets. At this stage, we can scroll through the list and browse to find tables with data that we’re interested in. We can also access the filters at the top of the list, if we want to modify our search parameters.

table results

A little further down the results list is a table for Median Value. Selecting that table will preview it on the right. Hit the Customize Table button. This opens the table in its own dedicated view. Hit the blue drop down arrow to the right of the table name, and you can modify the geography, year, or time-period on the left. On the right is a Download option. Hit download and you’ll be prompted to download a CSV file. In the download you’ll get three text files that contain metadata, the data, and descriptive information about the download. Click Download and you can save it.

customize table

Back on the customize table page, you can navigate back to the table results by clicking on “Tables” in the breadcrumb links that appear in the top left-hand corner. Then you can browse and choose additional tables.

That’s it! Not bad, right? Well, there are always caveats. At the moment, the biggest one is that you can’t easily download most geographies that are contained within other geographies. With one click we can filter to select all counties within a state, or all states within the nation. But if we wanted all census tracts in a county or all county subdivisions in a state, there aren’t any “All geographies in…” options for these geographies. We’d have to select each and every tract within a county, one at a time…

While data.census.gov is now relatively stable, it’s still under development and additional features like this should (hopefully) be implemented as time passes between now and the 2020 census. This is one reason why the American Factfinder will survive for another year, as we’ll still need to lean on it to accomplish certain tasks. Of course, there are other options within the Census Bureau (the API, the FTP site) and without (NHGIS, MCDC, Census Reporter) for accessing data.

The new platform currently provides access to several datasets from the present back to the year 2010: the decennial census, the ACS, population estimates, and several of the business datasets. The first new datasets that will be published in data.census.gov (and NOT in the AFF) include the 2017 Economic Census this summer and the 1-year 2018 ACS in September.

View the Release Notes and FAQs for more details about the platform: general documentation, recent developments, bugs, and planned enhancements. The Census Bureau also has an archived webinar with slides that discuss the transition.

Updated QGIS Tutorial for 3.4

I recently released an updated version of the manual and data I use for my day-long GIS Practicum, Introduction to GIS Using Open Source Software (Using QGIS). The manual has five chapters: a summary overview of GIS, basics of using the QGIS interface, GIS analysis that includes several geoprocessing and analysis functions, thematic mapping and map layout, and a summary of where to find data and resources for learning more. Chapters 2, 3, and 4 are broken down into sections with clear steps, followed by commentary that explains what we did and why. We cover much of the material in a single day, although you can space the lessons out into two days if desired.

I updated this version to move us from QGIS 2.18 Las Palmas to 3.4 Madeira, which are the former and current long term service releases. While the move from 2.x to 3.x involved a major rewrite of the code base (see the change log for details), most of the basics remain the same. While veteran users can easily navigate through the differences, it can be a stumbling block for new users if they are trying to learn a new version using an old tutorial with screens and tools that are slightly different. So it was time for an update!

My goal for this edition was to keep my examples in place but revise the steps based on changes in the interface. Most of the screenshots are new, and the substantive changes include: using the Data Manager for adding layers rather than the toolbar with tons of buttons, better support for xlsx and ods files which allowed me to de-emphasize xls and dbf files for attribute table joins, the addition of geopackages to the vector data mix, the loss of the Open Layers plugin and my revision to the web mapping section using OSM XYZ tiles, the disappearance of the setting that allowed you to disable on the fly projection, and the discontinuation of the stand-alone Data Browser. There were also changes to some tools (fixed distance and variable buffer tools are now united under one tool) and names of menus (Style menu has once again become the Symbology menu).

It’s hard to believe that this is my ninth edition of this tutorial. I try to update it once a year to keep in sync with the latest long term release, but fell a bit behind this year. QGIS 2.18 also survived for a bit longer than other releases, as the earlier 3.x versions went through lots of testing before ending up at 3.4. When it comes time for my tenth edition I may change the thematic mapping example in chapter 4 to something that’s global instead of US national, and in doing streamline the content. We’ll see if I have some time this summer.

Since I’m in update mode, I also fixed several links on the Resources page to cure creeping link rot.