business

Map of Ukraine Wheat Exports in 2021

Import and Export Data for Countries: Grain from Ukraine

I’ve been receiving more questions about geospatial data sources as the semester draws to a close. I’ll describe some sources that I haven’t used extensively before in the next couple of posts, beginning with data on bilateral trade: imports and exports between countries. We’ll look at the IMF’s Direction of Trade Statistics (DOTS) and the UN’s COMTRADE database. Both sources provide web-based portals, APIs, and bulk downloading. I’ll focus on the portals.

IMF Direction of Trade Statistics

IMF DOTS provides monthly, quarterly, and annual import and export data, represented as total dollar values for all goods exchanged. The annual data goes back to 1947, while the monthly / quarterly data goes back to 1960. All countries that are part of the IMF are included, plus a few others. Data for exports are published on a Free and On Board (FOB) price basis, while imports are published on a Cost, Insurance, Freight (CIF) price basis. Here are definitions for each term, quoted directly from the OECD’s Statistical Glossary:

The f.o.b. price (free on board price) of exports and imports of goods is the market value of the goods at the point of uniform valuation, (the customs frontier of the economy from which they are exported). It is equal to the c.i.f. price less the costs of transportation and insurance charges, between the customs frontier of the exporting (importing) country and that of the importing (exporting) country.

The c.i.f. price (i.e. cost, insurance and freight price) is the price of a good delivered at the frontier of the importing country, including any insurance and freight charges incurred to that point, or the price of a service delivered to a resident, before the payment of any import duties or other taxes on imports or trade and transport margins within the country.

OECD Glossary of Statistical Terms

There are a few different ways to browse and search for data. Start with the Data Tables tab at the top, and Exports and Imports by Areas and Countries. The default table displays monthly exports by region and country for the entire world (you could switch to imports by selecting the Imports CIF tab beside the Export sFOB tab). Hitting the Calendar dropdown allows you to change the date range and frequency. Hitting the Country dropdown lets you select a specific region or country. In the example below, I’ve changed the calendar from months to years, and the country to Ukraine. By doing so, the table now depicts the total US dollar value of exports and imports between Ukraine and all other countries. The Export button at the top allows you to save the report in a number of formats, Excel being the most data friendly option.

IMF DOTS Basic Report – Total Value of Exports from Ukraine, Last Five Years

While this is the quickest option, it comes with some downsides; the biggest one is that there are no unique identifiers for the countries, which is important if you wanted to join this table to a GIS vector file for mapping, or another country-level table in a database.

A better approach is to return to the home page and use the Query tab, which allows you to get a unique identifier and filter out countries and regions that are not of interest.

DOTS Query Tab
  1. Under Columns, select the time frame and interval. For example, check Years for Frequency at the top, and change the dropdowns at the bottom from Months to Years. From -5 to 0 would give you the last five years in ascending order.
  2. Rows allows you to filter out countries or regions that you don’t want to see in the results. You can also change the attribute that is displayed. Once the menu is open, right click in an empty area and choose Attribute. Here you can choose a variant country name, or an ISO country code. ISO codes are commonly used for uniquely identifying countries.
  3. Indicator lets you choose Exports (FOB), Imports (CIF or FOB), or Trade Balance, all in US dollars.
  4. Counterpart country is the country or region that you want to show trade for, such as Ukraine in our previous example.
  5. The tabs along the top allow you to produce graphs instead of a table (View – Table), to pivot the table (Adjust), and calculate summaries like sums or averages (Advanced).
  6. Export to produce an Excel file. By choosing the ISO codes you’ll lose the country names, but you can join the result to another country data table or shapefile and grab the names from there.
Modify Time
Modify Rows – Country – Change Attribute
DOTS Modified Table to Export: Total Value of Exports from Ukraine Last Five Years

UN COMTRADE

If you want data on the exchange of specific goods and services, quantities in addition to dollar values, and exchanges beyond simple imports and exports, then the UN’s COMTRADE database will be your source. You need to register to download data, but you can generate previews without having to log in. There is an extensive wiki that describes how to use the different database tools, and summaries of technical terms that you need to know for extracting and interpreting the data. You’ll need some understanding of the different systems for classifying commodities and goods. Your options (the links that follow lead to documentation and code lists) are: the Harmonized Classification System (HC), the Standard Industrial Trade Classification (SITC), and the Broad Economic Categories (BEC). What’s the difference? Here are some summaries, quoted directly from a UN report on the BEC:

The HS classification is maintained by the World Customs Organization. Its main purpose is to classify goods crossing the border for import tariffs or for application of some non-tariff measures for safety or health reasons. The HS classification is revised on a five-year cycle (p. 18)

The original SITC was designed in the 1950s as a tool for collection and dissemination of international merchandise trade statistics that would help in establishing internationally comparable trade statistics. By its introduction in 1988, the HS took over as collection and dissemination tool, and SITC was thereon used mostly as an analytical tool. (p. 19)

The Classification by Broad Economic Categories (BEC) is an international product classification. Its main purpose is to provide a set of broad product categories for the analysis of trade statistics. Since its adoption in 1971, statistical offices around the world have used BEC to report trade statistics in a concise and meaningful way (p. iii). The broad economic categories of BEC include all subheadings of the HS classification. Therefore, the total trade in terms of HS equals the total trade of the goods side of BEC. (p. 18)

Classification of Broad Economic Categories Rev 5 (2018)

In short, go with the BEC if you’re interested in high-level groupings, or the HS if you need detailed subdivisions. The SITC would be useful if you need to go further back in time, or if it facilitates looking at certain subdivisions or groupings that the other systems don’t capture.

From COMTRADE’s homepage, I suggest leaving the defaults in place and doing a basic, preliminary search for all global exports for the most recent year, so you can see basic output on the next screen. Then you can apply filters for a narrower search.

For example, let’s look at annual exports of wheat from Ukraine to other countries. Under the HS filter, remove the TOTAL code. Start typing wheat, and you’ll see various product categories: 6-digit codes are the most specific, while 4-digit codes are broader groups that encapsulate the 6-digit categories. We’ll choose wheat and meslin 1001. We’ll select Ukraine as the Reporter (the country that supplied the statistics and represents the origin point), and for the 1st partner we’ll choose All to get a list of all countries that Ukraine exported wheat to. The 2nd partner country we’ll leave as World (alternatively, you would add specific countries here if you wanted to know if there were intermediary nations between the origin and destination).

UN COMTARDE Refine Search with Filters

Hit Preview to see the results. You can click on a heading to sort by dollar value, weight, or country name. Like IMF DOTS, UN COMTRADE measures dollar amounts of exports as FOB and imports as CIF. At this point, you would need to log in to download the data as a CSV (creating an account is free). You would also need to be logged in if you generated an extract that has more than 500 records, otherwise the results will be truncated. You could always copy and paste data for shorter extracts directly from the screen to a spreadsheet, but you wouldn’t get any of the extra metadata fields that come with download, like ISO Country Codes and the classification codes for goods and merchandise.

COMTRADE Filtered Results – Exports of Wheat and Meslin from Ukraine 2021
Data Exported from COMTRADE to CSV with Identifiers

Mapping

For data from either source, if you wanted to map it you’d need to have a data table where there is one row for each country with columns of attributes, and with one column that has the ISO country code to serve as a unique identifier. Save the data table in an Excel file or as a table in a database. Download a country shapefile from Natural Earth. Add the shapefile and data table to a project and join them using the ISO code. Natural Earth shapefiles have several different ISO code columns that represent nations, sovereigns, and parent – child relationships; be sure you select the right one. Data table records that represent regions or groupings of countries (i.e. the EU, ASEAN, sum of smaller countries per continent not enumerated, etc.) will fall out of the dataset, as they won’t have a matching feature in the country shapefile. The map at the top of this post was created in QGIS, using COMTRADE and Natural Earth.

BEA Population Change Map

Population and Economic Time Series Data from the BEA

In this post, I’ll demonstrate how to access and download multiple decades of annual population data for US states, counties, and metropolitan areas in a single table. Last semester, I was helping a student in a GIS course find data on tuberculosis cases by state and metro area that stretched back several decades. In order to make meaningful comparisons, we needed to calculate rates, which meant finding an annual time series for total population. Using data directly from the Census Bureau is tough going, as they don’t focus on time series and you’d have to stitch together several decades of population estimates. Metropolitan areas add more complexity, as they are modified at least a few times each decade.

In searching for alternatives, I landed at the Bureau of Economic Analysis (BEA). As part of their charge is studying the economy over time, they gather data from the Census Bureau, Bureau of Labor Statistics, and others to build time series, and they update them as geography and industrial classification schemes change. They publish national, state, metropolitan area, and county-level GDP, employment, income, wage, and population tables that span multiple decades. Their economic profile table for metros and counties covers 1969 to present, while the state profile table goes back to 1958. Metropolitan areas are aggregates of counties. As metro boundaries change, the BEA normalizes the data, adjusting the series by taking older county-level data and molding it to fit the most recent metro definitions.

Finding the population data was a bit tricky, as it is embedded as one variable in the Economic Profile table (identified as CAINC30) that includes multiple indicators. Here’s the path to get it:

  • From the BEA website, choose Tools – Interactive Data.
  • From the options on the next page, choose Regional from the National, Industry, International or Regional data options. There’s also a link to a video that illustrates how to use the BEA interactive data tool.
  • From the Regional Data page, click “Begin using the data” (but note you can alternatively “Begin mapping the data” to make some basic web maps too, like the one in the header of this post).
  • On the next page are categories, and under each category are data tables for specific series. In this case, Personal Income and Employment by County and Metropolitan Area was what I wanted, and under that the Economic Profile CAINC30 table (states appear under a different heading, but there’s a comparable Economic Profile table for them, SAINC30).
  • On the multi-screen table builder, you choose a type of geography (county or different metro area options), and on the next tab you can choose individual places, hold down CTRL and select several, or grab them all with the option at the top of the dropdown. Likewise for the Statistic, choose Population (persons), or grab a selection, or take all the stats. Under the Unit of Measure dropdown, Levels gives you the actual statistic, but you can also choose percent change, index values, and more. On the next tab, choose your years.
  • On the final page, if your selection is small enough you can preview the result and then download. In this case it’s too large, so you’re prompted to grab an Excel or CSV file to download.

And there you have it! One table with 50+ years of annual population data, using current metro boundaries. The footnotes at the bottom of the file indicate that the latest years of population data are based on the most recent vintage estimates from the Census Bureau’s population estimates. Once the final intercensal estimates for the 2010s are released, the data for that decade will be replaced a final time, and the estimates from the 2020s will be updated annually as each new vintage is released until we pass the 2030 census. Their documentation is pretty thorough.

BEA 5-decade Time Series of Population Data by Metro Area

The Interactive Data table approach allows you to assemble your series step by step. If you wanted to skip all the clicking you can grab everything in one download (all years for all places for all stats in a given table). Of course, that means some filtering and cleaning post-download to isolate what you need. There’s also an API, and several other data series and access options to explore. The steps for creating the map that appears at the top of this post were similar to the steps for building the table.

NYC and NYMA Pop Change Graph 2000 to 2019

New York’s Population and Migration Trends in the 2010s

The Weissman Center for International Business at Baruch College just published my paper, “New York’s Population and Migration Trends in the 2010s“, as part of their Occasional Paper Series. In the paper I study population trends over the last ten years for both New York City (NYC) and the greater New York Metropolitan Area (NYMA) using annual population estimates from the Census Bureau (vintage 2019), county to county migration data (2011-2018) from the IRS SOI, and the American Community Survey (2014-2018). I compare NYC to the nine counties that are home to the largest cities in the US (cities with population greater than 1 million) and the NYMA to the 13 largest metro areas (population over 4 million) to provide some context. I conclude with a brief discussion of the potential impact of COVID-19 on both the 2020 census count and future population growth. Most of the analysis was conducted using Python and Pandas in Jupyter Notebooks available on my GitHub. I discussed my method for creating rank change grids, which appear in the paper’s appendix and illustrate how the sources and destinations for migrants change each year, in my previous post.

Terminology

  • Natural increase: the difference between births and deaths
  • Domestic migration: moves between two points within the United States
  • Foreign migration: moves between the United States and a US territory or foreign country
  • Net migration: the difference between in-migration and out-migration (measured separately for domestic and foreign)
  • NYC: the five counties / boroughs that comprise New York City
  • NYMA: the New York Metropolitan Area as defined by the Office of Management and Budget in Sept 2018, consists of 10 counties in NY State (including the 5 NYC counties), 12 in New Jersey, and one in Pennsylvania
Map of the New York Metropolitan Area
The New York-Newark-Jersey City, NY-NJ-PA Metropolitan Area

Highlights

  • Population growth in both NYC and the NYMA was driven by positive net foreign migration and natural increase, which offset negative net domestic migration.
  • Population growth for both NYC and the NYMA was strong over the first half of the decade, but population growth slowed as domestic out-migration increased from 2011 to 2017.
  • NYC and the NYMA began experiencing population loss from 2017 forward, as both foreign migration and natural increase began to decelerate. Declines in foreign migration are part of a national trend; between 2016 and 2019 net foreign migration for the US fell by 43% (from 1.05 million to 595 thousand).
  • The city and metro’s experience fit within national trends. Most of the top counties in the US that are home to the largest cities and many of the largest metropolitan areas experienced slower population growth over the decade. In addition to NYC, three counties: Cook (Chicago), Los Angeles, and Santa Clara (San Jose) experienced actual population loss towards the decade’s end. The New York, Los Angeles, and Chicago metro areas also had declining populations by the latter half of the decade.
  • Most of NYC’s domestic out-migrants moved to suburban counties within the NYMA (representing 38% of outflows and 44% of net out-migration), and to Los Angeles County, Philadelphia County, and counties in Florida. Out-migrants from the NYMA moved to other large metros across the country, as well as smaller, neighboring metros like Poughkeepsie NY, Fairfield CT, and Trenton NJ. Metro Miami and Philadelphia were the largest sources of both in-migrants and out-migrants.
  • NYC and the NYMA lack any significant relationships with other counties and metro areas where they are net receivers of domestic migrants, receiving more migrants from those places than they send to those places.
  • NYC and the NYMA are similar to the cities and metros of Los Angeles and Chicago, in that they rely on high levels foreign migration and natural increase to offset high levels of negative domestic migration, and have few substantive relationships where they are net receivers of domestic migrants. Academic research suggests that the absolute largest cities and metros behave this way; attracting both low and high skilled foreign migrants while redistributing middle and working class domestic migrants to suburban areas and smaller metros. This pattern of positive foreign migration offsetting negative domestic migration has characterized population trends in NYC for many decades.
  • During the 2010s, most of the City and Metro’s foreign migrants came from Latin America and Asia. Compared to the US as a whole, NYC and the NYMA have slightly higher levels of Latin American and European migrants and slightly lower levels of Asian and African migrants.
  • Given the Census Bureau’s usual residency concept and the overlap in the onset the of COVID-19 pandemic lock down with the 2020 Census, in theory the pandemic should not alter how most New Yorkers identify their usual residence as of April 1, 2020. In practice, the pandemic has been highly disruptive to the census-taking process, which raises the risk of an under count.
  • The impact of COVID-19 on future domestic migration is difficult to gauge. Many of the pandemic destinations cited in recent cell phone (NYT and WSJ) and mail forwarding (NYT) studies mirror the destinations that New Yorkers have moved to between 2011 and 2018. Foreign migration will undoubtedly decline in the immediate future given pandemic disruptions, border closures, and restrictive immigration policies. The number of COVID-19 deaths will certainly push down natural increase for 2020.

zbp_table

County and ZIP Code Business Patterns 2017 and the Census API

The U.S. Census Bureau’s County and ZIP Code Business Patterns (CBP and ZBP) datasets are generated annually from the Business Register, a large administrative database updated by several federal agencies which contains every business establishment in the U.S. with paid employees. Business establishments are defined as single physical locations where business is conducted or where services or industrial operations are performed. Establishments are assigned to industries, which are groups of businesses that produce similar products or provide similar services, using the North American Industrial Classification System (NAICS). The ZBP contains tables with total establishments, employment, and wages by ZIP and counts of business establishments by NAICS and ZIP. The CBP has these tables plus a few others for counties.

The 2017 Business Patterns was recently released, and there are a few important changes to the dataset over previous iterations. I’ll summarize what they are and how they impact data retrieval using the Census Bureau’s ZBP API. I unwittingly discovered these issues this week as I was trying to use a Python / Pandas notebook I’d written for extracting ZBP data and aggregating the USPS ZIP codes to Zip Code Tabulation Areas (ZCTAs), which are used for publishing decennial and ACS census data. Everything went smoothly when I tested the scripts against the 2016 ZBP, but a few things went awry with 2017 and I was forced to make some revisions.

If you’re not familiar with the API, take a look at this earlier post for a basic introduction. The notebooks I’ll refer to are available on my github; zbp_to_zcta.ipynb works for the 2017 ZBP release, and I kept the earlier version that worked for 2016.

2017 NAICS Codes

NAICS codes are revised every five years in tandem with the Economic Census (conducted in years ending in 2 and 7), to effectively capture the changing nature of the economy. The CBP and ZBP employ the latest NAICS series in the year that it’s released, so beginning with 2012 the 2012 NAICS were used for categorizing establishments into industries. The 2012 definitions were used up through 2016, but now that we’re in 2017 we have a new NAICS 2017 series, and this was employed for the 2017 CBP and ZBP and will be used through 2021.

How different are the categories? If you’re working at the broad two-digit sector level nothing has changed. The more detailed the categories are (3 to 6 digit), the more likely it is that you’ll encounter changes: industries that were created, or removed (aggregated into a broader miscellaneous category), or modified. You can use the concordance tables to see how definitions have changed, and in some cases crosswalk data from one category to another.

If you’re using the API, you’ll need to modify your url to access the 2017 NAICS variables (&NAICS2017=) as opposed to the 2012 series (&NAICS2012= ).

New Privacy Regulations

For confidentiality purposes, the Census Bureau has always employed various methods to insure that the summary data produced for the CBP and ZBP can’t be used to identify characteristics of an individual business. If a geographic area or industrial category had fewer than 3 establishments in it, or if one establishment in an area or category constituted an overwhelming majority of the employment or wages, then those values were not disclosed or published. The only characteristic that was always published was the number of establishments.

Not any more – beginning with the 2017 CBP and ZBP, the following applies:

> Prior to reference year 2017, the number of establishments in a particular tabulation cell was not considered sensitive; therefore, counts of establishments were released without any disclosure avoidance methods applied. Beginning with reference year 2017, cells with fewer than 3 establishments have been omitted from the release.

So what does this mean? First, for any county or ZIP Code that has fewer than 3 business establishments in total, records for that county or ZIP Code will not appear in the dataset at all (although establishments in these areas will be counted in summaries of larger areas, like states or metro areas). In my script, about 30 ZIP Codes for NYC fell out of my results compared to last year; these were primarily non-residential ZIPs that represented a single business that processes lots of mail, and post office box ZIPs.

Second, for a given geographic area, if a given NAICS category has less than three business establishments, the number of establishments won’t be reported for that category, but they will be included in the sum total. Once again, in my case I’m working with two-digit sector codes. There is a 00 code that captures the sum of all establishments. When I was summing the values of all of the two-digit codes together, I discovered that these sums rarely matched the 00 total, like they did in the past, because of the new non-disclosure policy. To account for this, and to calculate percent totals correctly, I had to create a category that takes the difference between the total 00 category and the sum of all the others, to count how many businesses were not disclosed (see pic below). I could then treat that category like the others, and the sum of the parts would equal the whole again.

summary_naics

These data frames show counts of establishments by two digit NAICS sectors. In the top df, the totals column N00 does not equal the sum of the others columns. A column was added to the bottom df to get the difference between the two.

Subsequently, I replaced the zeros for any ZIP code that had businesses that weren’t disclosed with NULLs, as I can’t know for certain if the values are truly zero. The most likely categories (at the two digit level for ZIPs) where data was not disclosed were: 11 (agriculture), 21 (mining), 22 (utilities), and 99 (unclassified businesses).

Looping Through and Retrieving Geographies

The API allows you to select all geographies within another geography using the ‘in’ clause (visit the ZBP API to see a list of variables and examples). For example, you can select all the counties in a particular state – in the example below, values would be passed into the variables in braces, and you would pass ANSI FIPS codes into the geography variables:

base_url = f'https://api.census.gov/data/{year}/{dsource}'
edata_url=f'{base_url}?get={ecols}&for={county}:*&in=state:{state}&key={api_key}'

This option is only available for geographies that nest, according to the Census Bureau’s geographic hierarchy. ZIP Codes are not a census geography and don’t nest within anything, so we can’t use the ‘in’ clause. For the 2016 and prior versions of the ZBP API, there was a trick for getting around this; there was a state variable called ST, which you could use in a similar fashion to get all the ZIP Codes in a state in a ‘for’ clause:

edata_url = f'{base_url}?get={ecols}&for=zipcode:*&ST={state}&key={api_key}'

Not any more – the ST variable disappeared in the 2017 API for the ZBP. So what can you do instead? Option one is to loop through a list of ZIP codes, passing them to the API one by one. This is fine if you just need a few, but pretty slow if you need the 260 something that I needed. Option two is to pass in several ZIP codes into the URL at once, but there’s a catch: you’re only allowed to pass in 50 values at a time to any variable. To do this, you need to divide your list of ZIPs into chunks of no more than 50, loop through the sub-lists to insert them into the url, and append the results to a big list as you go along.

A function for breaking a list of ZIP Codes (or any list of variables) into chunks:

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

Call the function to generate a list of lists with an equal number of values (in my case, my ZIP Codes are an index in a dataframe):

reqzips=list(chunks(zip2zcta.index.tolist(),48))

Then run the following to iterate through the list of ZIP code lists. I use enumerate so I can grab both the indices and values in the list. The ZIP codes values (v) have to be strung together and separated by commas before passing them into the url. The ecols variable is a list of columns I want to retrieve, which is also a single string with columns separated by commas. Once I receive the first chunk I append everything to a list (emp_data), but for every subsequent chunk I start reading from the second value [1:] and skip the first [0] because I only want to append the column headers once.

emp_data=[]
for i, v in enumerate (reqzips):
    batchzips=','.join(v)
    edata_url = f'{base_url}?get={ecols}&for=zipcode:{batchzips}&key={api_key}'
    response=requests.get(edata_url)
    if response.status_code==200:
        clear_output(wait=True)
        data=response.json()
        if i == 0:
            for record in data:
                emp_data.append(record)
        else:
            for record in data[1:]:
                emp_data.append(record)
        print('Retrieved data for chunk',i)
    else:
        print('***Problem with retrieval***, response code',response.status_code)
        break

The key here is to get the looping right, to insure that you end up with a list of lists where each list represents a row of data, in this case a ZIP code record with establishment data. I employed something similar (but a bit more complicated) with an ACS script that I wrote, but in that case I was looping through lists of columns / attributes instead of geographies.

If you’d like to learn more about the census business datasets and understand how to navigate NAICS, check out chapter 8 in my book. I don’t cover the APIs, but I do demonstrate how to use the new data.census.gov and I delve into the concepts behind these datasets in good detail.

Census Book

Exploring the US Census Book Published!

My book, Exploring the US Census: Your Guide to America’s Data, has been published! You can purchase it directly from SAGE Publishing or from Barnes and Nobles, Amazon, or your bookstore of choice (it’s currently listed for pre-order on Amazon but its availability there is imminent). It’s $45 for the paperback, $36 for the ebook. Data for the exercises and supplemental material is available on the publisher’s website, and I’ve created a landing page for the book on this site.

Exploring the US Census is the definitive researcher’s guide to working with census data. I place the census within the context of: US society, the open data movement, and the big data universe, provide a crash course on using the new data.census.gov, and introduce the fundamental concepts of census geography and subject categories (aka universes). One chapter is devoted to each of the primary datasets: decennial census (with details about the 2020 census that’s just over the horizon), American Community Survey, Population Estimates Program, and business data from the Business Patterns, Economic Census, and BLS. Subsequent chapters demonstrate how to: integrate census data into writing and research, map census data in GIS, create derivative measures, and work with historic data and microdata with a focus on the Current Population Survey.

I wrote the book as a hybrid between a techie guidebook and an academic text. I provide hands-on exercises so that you learn by doing (techie) while supplying sufficient context so you can understand and evaluate why you’re doing it (academic). I demonstrate how to find and download data from several different sources, and how to work with the data using free and open source software: spreadsheets (LibreOffice Calc), SQL databases (DB Browser for SQLite), and GIS (QGIS). I point out the major caveats and pitfalls of working with the census, along with many helpful tools and resources.

The US census data ecosystem provides us with excellent statistics for describing, studying, and understanding our communities and our nation. It is a free and public domain resource that’s a vital piece of the country’s social, political, and economic infrastructure and a foundational element of American democracy. This book is your indispensable road map for navigating the census. Have a good trip!

See the series – census book tag for posts about the content of the book, additional material that expands on that content (but didn’t make it between the covers), and the writing process.

ZBP Data in a Notebook

Examples of using the Census Bureau’s API with Python

At the end of my book I briefly illustrate how the Census Bureau’s API works using Python. I’ll expand on that in this post; we’ll pull data from the Population Estimates Program, transform it, and create a chart using Python with Pandas in a Notebook. I’ll conclude with an additional example using the ZIP Code Business Patterns.

The Census Bureau has dedicated API pages for each dataset (decennial, acs, pop estimates, and more), and you need to familiarize yourself with the geographies and variables that are available for each. The API is a basic REST API, where you insert parameters into a base url and retrieve data based on the link you submit. Python has several modules you can use for interacting with APIs – the requests module is a popular choice.

The following pop estimates example is on github (but if github flops see the nbviewer example instead).

The top of the script contains basic stuff – import the modules you need, read in your key, and define the variables that you want to pull. You don’t have to use an API key, but if you don’t you’re limited to pulling in 500 records a day. Requesting a key is simple and free. A best practice is to store your key (a big integer) in a file that you read in, so you’re not exposing it in the script. Most of the census APIs require that you pass in a year and a dataset (dsource). Larger datasets may be divided into subsets (dname); for example the population estimates is divided into estimates, components of change, and characteristics (age, sex, race, etc.). Save the columns and geographies that you want to get in a comma-separated string. You have to consult the documentation and variable lists that are available for each dataset to build these, and the geography requires ANSI / FIPS codes.

%matplotlib inline
import requests,pandas as pd

with open('census_key.txt') as key:
    api_key=key.read().strip()

year='2018'
dsource='pep'
dname='components'
cols='GEONAME,NATURALINC,DOMESTICMIG,INTERNATIONALMIG'
state='42'
county='017,029,045,091,101'

Next, you can create the url. I’ve been doing this in two parts. The first part:

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'

Includes the base https://api.census.gov/data/ followed by parameters that you fill in. The year, data source, and dataset name are the standard pieces. The output looks like this:

'https://api.census.gov/data/2018/pep/components'

Then you take that base_url and add additional parameters that are going to vary within the script, in this case the columns and the geography, which all appear in the ‘get’ portion of the url. The ‘for’ and ‘in’ options allow you to select the type of geography within another geography, in this case counties within states, and you pass in the appropriate ANSI FIPS codes from the string you’ve created. The key appears at the end of the url, but if you opt not to use it you can omit that part. Once the link is fully constructed you use the requests module to fetch the data using that url. You can print the result out as text (assuming it’s not too long).

data_url = f'{base_url}?get={cols}&for=county:{county}&in=state:{state}&key={api_key}'
response=requests.get(data_url)
print(response.text)

The result looks like a nested list, but is actually a string that’s structured in a non-standard JSON format:

[["GEONAME","NATURALINC","DOMESTICMIG","INTERNATIONALMIG","state","county"],
["Bucks County, Pennsylvania","-178","-605","862","42","017"],
["Chester County, Pennsylvania","1829","-887","1374","42","029"],
["Delaware County, Pennsylvania","1374","-2513","1579","42","045"],
["Montgomery County, Pennsylvania","1230","-1987","2315","42","091"],
["Philadelphia County, Pennsylvania","8617","-11796","8904","42","101"]]

To do anything with it, convert it to JSON with response.json(). Then you can convert it into a list, dictionary, or in this example a Pandas dataframe. Here, I build the dataframe with everything from row one forward [1:]; row zero contains the column headers[0]. I rename some of the columns, build a unique ID by concatenating the state and county FIPS codes and set that as the new index, and drop the individual county and state FIPS columns. By default every object that’s returned is a string, so I convert the numeric columns to integers:

data=response.json()
df=pd.DataFrame(data[1:], columns=data[0]).\
    rename(columns={"NATURALINC": "Natural Increase", "DOMESTICMIG": "Net Domestic Mig", "INTERNATIONALMIG":"Net Foreign Mig"})
df['fips']=df.state+df.county
df.set_index('fips',inplace=True)
df.drop(columns=['state','county'],inplace=True)
df=df.astype(dtype={'Natural Increase':'int64','Net Domestic Mig':'int64','Net Foreign Mig':'int64'},inplace=True)
df

Then I can see the result:

pep dataframe

Once the data is in good shape, you can begin to analyze and visualize it. Here’s the components of population change for Philadelphia and the surrounding suburban counties in Pennsylvania from 2017 to 2018 – natural increase is the difference between births and deaths, and there’s net migration within the US (domestic) and between the US and other countries (foreign):

labels=df['GEONAME'].str.split(' ',expand=True)[0]
ax=df.plot.bar(rot=0, title='Components of Population Change 2017-18')
ax.set_xticklabels(labels)
ax.set_xlabel('')

Components of Population Change Plot

Each request is going to vary based on your specific needs and the construction of the particular dataset. Here’s another example where I pull data on business establishments, employees, and wages (in $1,000s of dollars) from the ZIP Code Business Patterns (ZBP). This dataset is smaller, so it doesn’t have a dataset name, just a data source. To get all the ZIP Codes in Delaware I use the asterisk * wildcard. Because ZIP Codes do not nest within states I can’t use the ‘in’ option, it’s simply not available. A state code is stored in a special field called ST, and I can use it as a general limiter with equals in the query:

year='2016'
dsource='zbp'
cols='ESTAB,EMP,PAYQTR1,PAYANN'
state='10'

base_url = f'https://api.census.gov/data/{year}/{dsource}'

data_url = f'{base_url}?get={cols}&for=zipcode:*&ST={state}&key={api_key}'
response=requests.get(data_url)
print(response.text)
[["ESTAB","EMP","PAYQTR1","PAYANN","ST","zipcode"],
["982","26841","448380","1629024","10","19713"],
["22","628","3828","15848","10","19716"],
["8","15","371","2030","10","19732"],
["7","0","0","0","10","19718"],
["738","9824","83844","353310","10","19709"]...
data=response.json()
zbp_data=pd.DataFrame(data[1:], columns=data[0]).set_index('zipcode')
zbp_data.drop(columns=['ST'],inplace=True)
for field in cols.split(','):
    zbp_data=zbp_data.astype(dtype={field:'int64'},inplace=True)
zbp_data.head()

ZBP Data for Delaware

One of the issues with the ZBP is that many variables are not disclosed due to privacy regulations; instead of returning nulls a zero is returned, but in this dataset they are not true zeros. Once you retrieve the data and set the types you can replace zeros with NaNs, which are numpy / Panda nulls – although there’s a quirk in that dataframe columns declared as integers cannot contain null values. Instead you can use a float, or a workaround that’s been implemented for new Pandas versions (for my specific use case this data will be inserted into a database, so I’ll use SQL to accomplish the zero to null conversion). ZBP data is also injected with noise to protect privacy, and you can retrieve special columns that contain noise flags.

The API is convenient for automating the data acquisition process, and allows you to cherry pick the variables you want. To avoid accessing the API over and over again as you build your scripts (which is prohibitive when requesting lots of data) you can pickle the data right after you retrieve it – a pickle is a python data object that efficiently stores data locally, and pandas has special functions for creating and accessing them. Once you pull your data and pickle it, you can comment out (or in a notebook, don’t rerun) the requests block, and subsequently pull the data from the pickle as you tweak your code (see caveat in the postscript – perhaps best to use json instead of pickle).

#Write to a pickle
zbp_data.to_pickle('insert path here.pickle')
#Read from a pickle to dataframe
zbp_new=pd.read_pickle('insert path here.pickle')

Take a look at the Census Data API User Guide to learn more. The guide focuses just on the REST API, and is not specific to a scripting language. Of course, you also need to familiarize yourself with the datasets and how they’re created and organized, and with census geography (which is why I wrote this book).

Postscript

Since I’ve finished this post I’ve created a notebook that pulls ZBP data from the API (alt nbviewer here) and have some additional thoughts I’d like to share:

  1. I decided to dump the data I retrieved from the API to a json file and then pull data from it instead of using a pickle. Pickles come with serious security issues. If you don’t intend to share your code with anyone pickles are fine, otherwise consider an alternative.
  2. My method for parsing the retrieved data into a dataframe worked fine because the census API uses non-standard JSON; essentially the string that’s returned resembles a nested Python list. If this was true JSON, we may need to employ a different method to account for the fact that the number of elements per record may vary.
  3. Wildcards are not always available to build urls for certain data; for example to download the number of establishments classified by industry I wasn’t able to grab everything for one state using the method I illustrated in this post. Instead I had to loop through a list of ZIP and NAICS codes to retrieve what I wanted one at a time.
  4. In the case of retrieving establishments classified by industry there were many cases when there was no data for a particular ZIP Code (i.e. no farms and mines in midtown Manhattan). Since I needed records that showed zero establishments, I had to insert them myself if the API returned no result. Even if you didn’t need records with zeros, it’s important to consider the potential impact of getting nothing back from the API on your subsequent code.
  5. Given my experience thus far these APIs were pretty reliable, in that I haven’t had issues with time outs and partially returned data. If this was not the case and you had lots of data to retrieve, you would need to build in some try – except statements to handle exceptions, save data as you go along, and pick up where you left off if something breaks. Read about this geocoding script I wrote a few years back for examples.