analysis

Stata splash screen

Introduction to Stata Tutorial

This month’s post will be brief, but helpful for anyone who wants to learn Stata. I wrote a short tutorial called First Steps with Stata for an introductory social science data course I visited earlier this month. It’s intended for folks who have never used a statistical package or a command-driven interface, and represents initial steps prior to doing any statistical analysis:

  1. Loading and viewing Stata dta files
  2. Describing and summarizing data
  3. Modifying and recoding data
  4. Batch processing with Do files / scripts
  5. Importing data from other formats

I chose the data that I used in my examples to illustrate the difference between census microdata and summary data, using sample data from the Current Population Survey to illustrate the former, and a table from the American Community Survey to represent the latter.

I’m not a statistician by training; I know the basics but rely on Python, databases, Excel, or GIS packages instead of stats packages. I learned a bit of Stata on my own in order to maintain some datasets I’m responsible for hosting, but to prepare more comprehensively to write this tutorial I relied on Using Stata for Quantitative Analysis, which I highly recommend. There’s also an excellent collection of Stata learning modules created by UCLA’s Advance Research Computing Center. Stata’s official user documentation is second to none for clearly introducing and explaining the individual commands and syntax.

In my years working in higher ed, the social science and public policy faculty I’ve met have all sworn by Stata over the alternatives. A study of citations in the health sciences, where stats packages used for the research were referenced in the texts, illustrates that SPSS is employed most often, but that Stata and R have increased in importance / usage over the last twenty years, while SAS has declined. I’ve had some students tell me that Stata commands remind them of R. In searching through the numerous shallow reviews and comparisons on the web, I found this post from a data science company that compares R, Python, SPSS, SAS, and Stata to be comprehensive and even-handed in summarizing the strengths and weaknesses of each. In short, I thought Stata was fairly intuitive, and the ability to batch script commands in Do files and to capture all input / output in logs makes it particularity appealing for creating reproducible research. It’s also more affordable than the other proprietary stats packages and runs on all operating systems.

Example – print the first five records of the active dataset to the screen for specific variables:

list statefip age sex race cinethp in f/5
     +-------------------------------------------+
     | statefip   age      sex    race   cinethp |
     |-------------------------------------------|
  1. |  alabama    76   female   white       yes |
  2. |  alabama    68   female   black       yes |
  3. |  alabama    24     male   black        no |
  4. |  alabama    56   female   black        no |
  5. |  alabama    80   female   white       yes |
     |-------------------------------------------|
Noise Complaint Kernels and Contours

Kernel Density and Contours in QGIS: Noisy NYC

In spatial analysis, kernel density estimation (colloquially referred to as a type of “hot spot analysis”) is used to explore the intensity or clustering of point-based events. Crimes, parking tickets, traffic accidents, bird sightings, forest fires, incidents of infections disease, anything that you can plot as a point at a specific period in time can be studied using KDE. Instead of looking at these features as a distribution of discrete points, you generate a raster that represents a continuous surface of values. You can either measure the density of the incidents themselves, or the concentration of a specific attribute that is tied to those incidents (like the dollar amount of parking tickets or the number of injuries in traffic accidents).

In this post I’ll demonstrate how to do a KDE analysis in QGIS, but you can easily implement KDE in other software like ArcGIS Pro or R. Understanding the inputs you have to provide to produce a meaningful result is more important than the specific tool. This YouTube video produced by the SEER Lab at the University of Florida helped me understand what these inputs are. They used the SAGA kernel tool within QGIS, but I’ll discuss the regular QGIS tool and will cover some basic data preparation steps when working with coordinate data. The video illustrates a KDE based on a weight, where there were single points that had a count-based attribute they wanted to interpolate (number of flies in a trap). In this post I’ll cover simple density based on the number of incidents (individual noise complaints), and will conclude by demonstrating how to generate contour lines from the KDE raster.

For a summary of how KDE works, take a look at the entry for “Kernel” in the Encyclopedia of Geographic Information Science (2007) p 247-248. For a fuller treatment, I always recommend Christopher Lloyd’s Spatial Data Analysis: An Introduction to GIS Users (2010) p 93-97 by Oxford Press. There’s also an explanation in the ArcGIS Pro documentation.

Data Preparation

I visited the NYC Open Data page and pulled up the entry for 311 Service Requests. When previewing the data I used the filter option to narrow the records down to a small subset; I chose complaints that were created between June 1st and 30th 2022, where the complaint type began with “Noise”, which gave me about 75,000 records (it’s a noisy town). Then I hit the Export button and chose one of the CSV formats. CSV is a common export option from open data portals; as long as you have columns that contain latitude and longitude coordinates, you will be able to plot the records. The NYC portal allows you to filter up front; other data portals like the ones in Philly and DC package data into sets of CSV files for each year, so if you wanted to apply filters you’d use the GIS or stats package to do that post-download. If shapefiles or geoJSON are provided, that will save you the step of having to plot coordinates from a CSV.

NYC Open Data 311 Service Requests

With the CSV, I launched QGIS, went to the Data Source Manager, and selected Delimited Text. Browsed for the file I downloaded, gave the layer a common sense name, and under geometry specified Point coordinates, and confirmed that the X field was my longitude column and the Y field was latitude. Ran the tool, and the points were plotted in the basic WGS 84 longitude / latitude system in degrees, which is the system the coordinates in the data file were in (generally a safe bet for modern coordinate data, but not always the case).

QGIS Add Delimited Text and Plot Coordinates

The next step was to save these plotted points in a file format that stores geometry and allows us to do spatial analysis. In doing that step, I recommend taking two additional ones. First, verify that all of the plotted data have coordinates – if there are any records where lat and long are missing, those records will be carried along into the spatial file but there will be no geometry for them, which will cause problems. I used the Select Features by Expression tool, and in the expression window typed “Latitude” is not null to select all the features that have coordinates.

QGIS Select by Expression

Second, transform the coordinate reference system (CRS) of the layer to a projected system that uses meters or feet. When we run the kernel tool, it will ask us to specify a radius for defining the density, as well as the size of the pixels for the output raster. Using degrees doesn’t make sense, as it’s hard for us to conceptualize distances in degrees, and they are not a constant unit of measurement. If you’ve googled around and read Stack Exchange posts or watched videos where a person says “You just have to experiment and adjust these numbers until your map looks Ok”, they were working with units in fractions of degrees. This is not smart. Transform the system of your layers!

I selected the layer, right clicked, Export, Save Selected Features As. The default output is a geopackage, which is fine. Otherwise you could select ESRI shapefile, both are vector formats that store geometry. For file name I browse … and save the file in a specific folder. Beside CRS I hit the globe button, and in the CRS Selector window typed NAD83 Long Island in the filter at the top, and at the bottom I selected the NAD83 / New York Long Island (ftUS) EPSG 2263 option system in the list. Every state in the US has one or more state plane zones that you can select for making optimal maps for that area, in feet or meters. Throughout the world, you could choose an appropriate UTM zone that covers your area in meters. For countries or continents, look for an equidistant projection (meters again).

QGIS Export – Save As

Clicked a series of Oks to create the new file. To reset my map window to match CRS of the new file, I selected that file, right clicked, Layer CRS, Set Project CRS from Layer. Removed my original CSV to avoid confusion, and saved my project.

QGIS Noise Complaints in Projected CRS

Kernel Density Estimation

Now our data is ready. Under the Processing menu I opened the toolbox and searched for kernel to find Heatmap (Kernel Density Estimation) under the Interpolation tools. The tool asks for an input point layer, and then a radius. The radius is used to define an area for calculating a local density estimate around each point. We can use a formula to determine an ideal radius; the hopt method seems to be commonly employed for this purpose.

To use the hopt formula, we need to know the standard distance for our layer, which measures the degree to which features are dispersed around the spatial mean or center of the distribution. A nice 3rd party plugin was created for calculating this. I went to the the plugins menu, searched for the Standard Distance plugin, and added it. Searched for it in the Processing toolbox and launched it. I provided my point layer for input, and specified an output file. The other fields are optional (if we were measuring an attribute of the points instead of the density of the points, we could specify the attribute as a weight column). The output layer consists of a circle where the center is the mean center of the distribution, and the circle represents the standard deviation. The attribute table contains one record, with the standard distance attribute of 36,046.18 feet (if no feature was created, the likely problem is you have records in the point file that don’t have geometry – delete them and try again).

Output from the Standard Distance Plugin

Knowing this, I used the hopt formula:

=((2/(3N))^0.25)SD

Where N is the number of features and SD is the standard distance. I used Excel to plug in these values and do the calculation.

((2/(374526))^0.25)36046.18 = 1971.33

Finally, I launched the heatmap kernel tool, specified my noise points as input, and the radius as 1,971 feet. The output raster size does take some experimentation. The larger the pixel size, the coarser or more general the resolution will be. You want to choose something that makes sense based on the size of the area, the number of points, and / or some other contextual information. Just like the radius, the units are based on the map units of your layer. If I type in 100 feet for Pixel X, I see I’ll have a raster with 1,545 rows and 1,565 columns. Change it to 200 feet, and I get 773 by 783. I’ll go with 200 feet (the distance between a “standard” numbered street block in midtown Manhattan). I kept the defaults for the other options.

QGIS Heatmap Kernel Density Estimation Window

The resulting raster was initially displayed in black and white. I opened the properties and symbology menu and changed the render type from Singleband gray to Singleband pseudocolor, and kept the default yellow to red scheme. Voila!

Kernel Density Estimate of NYC Noise Complaints June 2022

In June 2022 there were high clusters of noise complaints in north central Brooklyn, northern Manhattan, and the southwest portion of the Bronx. There’s a giant red hot spot in the north central Bronx that looks like the storm on planet Jupiter. What on earth is going on there? I flipped back to the noise point layer and selected points in that area, and discovered a single address where over 2,700 noise complaints about a loud party were filed on June 18 and 19! There’s also an address on the adjacent block that registered over 900 complaints. And yet the records do not appear to be duplicates, as they have different time stamps and closing dates. A mistake in coding this address, multiple times? A vengeful person spamming the 311 system? Or just one helluva loud party? It’s hard to say, but beware of garbage in, garbage out. Beyond this demo, I would spend more time investigating, would try omitting these complaints as outliers and run the heatmap tool again, and compare this output to different months. It’s also worth experimenting with the color classification scheme, and some different pixel sizes.

Kernel Results Zoomed In

Contour Lines

Another interesting way to visualize this data would be to generate contour lines based on the kernel output. I did a search for contour in the processing toolbox, and in the contour tool I provided the kernel noise raster as the input. For intervals between contour lines I tried 20 feet, and changed the attribute name to reflect what the contour represents: COMPLAINT instead of ELEV. Generated the new file, overlaid on top of the kernel, and now you can see how it represents the “elevation” of complaints.

Noise Complaint Kernel Density with Contour Lines

Switch the kernel off, symbolize the contours and add some labels, and throw the OpenStreetMap underneath, and now you can explore New York’s hills and valleys of noise. Or more precisely, the hills and valleys of noise complainers! In looking at these contours, it’s important to remember that they’re generated from the kernel raster’s grid cells and not from the original point layer. The raster is a generalization of the point layer, so it’s possible that if you look within the center of some of the denser circles you may not find, say, 340 or 420 actual point complaints. To generate a more precise set of contours, you would need to decrease the pixel size in the kernel tool (from say 200 feet to 100).

Noise Complaint Contours in Lower Manhattan, Northwest Brooklyn, and Long Island City

It’s interesting what you can create with just one set of points as input. Happy mapping!

STATA records

Creating STATA Variable Lists in Excel and Do Files With Python

In this post I demonstrate how export a list of variables from a STATA dta file to an Excel spreadsheet, and how to create a STATA do file by using Python to read in a list of variables from a spreadsheet; the do file will generate an extract of attributes and observations from a larger dta file. Gallup Analytics microdata serves as the example.

Gallup Analytics Microdata

Many academic libraries subscribe to an online database called Gallup Analytics, which lets users explore and download summary statistics from a number of on-going polls and surveys conducted by the Gallup Organization, such as the US Daily Tracker poll, World Poll, and SPSS polling series. As part of the package, subscribing institutions also receive microdata files for some of the surveys, in STATA and SPSS formats. These files contain the anonymized, individual responses to the surveys. The microdata is valuable to social science researchers who use the responses to conduct statistical analyses.

STATA
Microdata in STATA

Naturally, the microdata is copyrighted and licensed for non-commercial research purposes to members of the university or institution who are covered by the license agreement, and cannot be shared outside the institution. Another stipulation is that the files cannot be shared in their entirety, even for members of the licensed institution; researchers must request individual extracts of variables and observations to answer a specific research question. This poses a challenge for the data librarian, who somehow has to communicate to the researcher what’s available in the files and mediate the request. Option 1 is to share the codebooks (which are also copyrighted and can’t be publicly distributed) with the researcher and haggle back and forth via email to iron out the details of the request. Option 2 is to have a stand-alone computer set up in the library, where a researcher can come and generate their own extract from files stored on a secure, internal network. In both cases, the manual creation of the extract and the researcher’s lack of familiarity with the contents of the data makes for a tedious process.

My solution was to create spreadsheets that list all of the variables in each dataset, and have the researcher check the ones they want. I created a resource guide that advertises and describes the datasets, and provides secure links to the Gallup codebooks and these spreadsheets, which are stored on a Google Drive and are protected via university authentication. The researcher can then fill out a Google form (also linked to from that page), where they describe the nature of the request, select the specific dataset of interest, specify filters on observations (rows), and upload the spreadsheet of requested variables (columns). Then, I can read the spreadsheet variables into Python and generate a STATA do file (STATA scripts stored in plain text format), to create the desired extract which I can share with the researcher.

Create List of STATA Variables in Excel Spreadsheet

First, I created a standard set of STATA do files to output lists of all variables to a spreadsheet for the different data files. An example for the US Daily Tracker poll from pre-2018 is below. I was completely unfamiliar with STATA, but the online docs and forums taught me what I needed to pull this together.

Some commands are the same across all the do files. I use describe and then translate to create a simple text file that saves a summary from the screen that counts rows and columns. Describe gives a description of the data stored in memory, while replace is used to swap out existing variables with a new subset. Then, generate select_vars gives me codebook information about the dataset (select_vars is a variable name I created), which I sort using the name column. The export excel command is followed by the specific summary fields I wish to output; the position of the variable, data type, variable label, and the variable name itself.

* Create variable list for Gallup US Tracker Survey 2008-2017

local y = YEAR in 1

describe,short
summarize YEAR
translate @Results gallup_tracker_`y'_summary.txt, replace

describe, replace
generate select_vars = ""
sort name

export excel position name type varlab select_vars using gallup_tracker_`y'_vars.xlsx, firstrow(variables) replace

The variation for this particular US Daily Tracker dataset is that the files are packaged as one file per year. I load the first file for 2008, and the do file saves the YEAR attribute as a local variable, which allows me to include the year in the summary and excel output file names. I had to run this do file for each subsequent year up to 2017. This is not a big deal as I’ll never have to repeat the process on the old files, as new data will be released in separate, new files. Other datasets imposed different requirements; the GPSS survey is packaged in eleven separate files for different surveys, and the updates are cumulative (each file contains older data plus any updates – Gallup sends us updated files a few times each year). For the GPSS, I prompt the user for input to specify the survey file name, and overwrite the previous Excel file.

With the do file in hand, you open STATA and the data file you want to process, change the working directory from the default user folder to a better location for storing the output, open the do file, and it runs and creates the variable list spreadsheet.

Excel spreadsheet of variables generated from STATA
List of variables in Excel generated from STATA file. Users check the variables they want in an extract in the select_vars column

Create a STATA Do File with Python and Excel

Once a researcher submits their Google form and their selected variable spreadsheet (placing an X in a dedicated column to indicate that they want to include a variable), I run the Python script below. I use the openpyxl module to read the Excel file. I have to modify the paths, spreadsheet file name, and an integer for the particular survey each time I run it. I use the os module to navigate up and down through folders to store outputs in specific places. If the researcher specifies in the Google form that they want to filter observations, for example records for specific states or age ranges, I have to add those manually but I commented out a few examples that I can copy and modify. One caveat is that you must filter using the coded variable and not its label (i.e. if a month value is coded as 2 and its label is February, I must reference the code and not the label). Reading in the requested columns is straightforward; the script identifies cells in the selection column (E) that have an X, then grabs the variable name from the adjacent column.

# -*- coding: utf-8 -*-
"""
Pull selected gallup variables from spreadsheet to create STATA Do File
Frank Donnelly / GIS and Data Librarian / Brown University
"""

import openpyxl as xl, os
from datetime import date

thedate=date.today().strftime("%m%d%Y")
surveys={1:'gallup_covid',2:'gallup_gpss',3:'gallup_tracker',4:'gallup_world'}

rpath=os.path.join('requests','test') # MODIFY BASED ON INPUT
select_file=os.path.join(rpath,'gallup_tracker_2017_vars_TEST.xlsx') #MODIFY BASED ON INPUT
survey_file=surveys[3] #MODIFY BASED ON INPUT

dofile=os.path.join(rpath,'{}_vars_{}.do'.format(survey_file,thedate))
dtafile=os.path.join(os.path.abspath(os.getcwd()),rpath,'{}_extract_{}.dta'.format(survey_file,thedate))


#MODIFY to filter by observations - DO NOT ERASE EXAMPLES - copy, then modify
obsfilter=None
# obsfilter=None
# obsfilter='keep if inlist(STATE_NAME,"CT","MA","ME","NH","RI","VT")'
# obsfilter='keep if inrange(WP1220,18,64)'
# obsfilter='keep if SC7==2 & MONTH > 6'
# obsfilter='keep if FIPS_CODE=="44007" | FIPS_CODE=="25025"'

workbook = xl.load_workbook(select_file)
ws = workbook['Sheet1']

# May need to modify ws col and cell values based on user input
vlist=[]
for cell in ws['E']:
    if cell.value in ('x','X'): 
        vlist.append((ws.cell(row=cell.row, column=2).value))
outfile = open(dofile, "w")
outfile.writelines('keep ')
outfile.writelines(" ".join(vlist)+"\n")
if obsfilter==None:
    pass
else:
    outfile.writelines(obsfilter+"\n")
outfile.writelines('save '+dtafile+"\n")
outfile.close()
print('Created',dofile) 

The plain text do file begins with the command keep followed by the columns, and if requested, an additional keep statement to filter by records. The final save command will direct the output to a specific location.

keep CENREG D17A D23 D24 D5 FIPS_CODE HISPANIC INT_DATE MONTH MOTHERLODE_ID PE_WEIGHT RACE SC7 STATE_NAME WP10202 WP10208 WP10209 WP10215 WP10216 WP10229 WP10230 WP1220 WP1223 YEAR ZIPGALLUPREGION ZIPSTATE
save S:\gallup\processing\scripts\reques\test\gallup_tracker_extract_02202022.dta

All that remains is to open the requested data file in STATA, open the do file, and an extract is created. Visit my GitHub for the do files, Python script, and sample output. The original source data and the variable spreadsheets are NOT included due to licensing issues; if you have the original data files you can generate what I’ve described here. Sorry, I can’t share the Gallup data with you (so please don’t ask). You’ll need to contact your own university or institution to determine if you have access.

NYC and NYMA Pop Change Graph 2000 to 2019

New York’s Population and Migration Trends in the 2010s

The Weissman Center for International Business at Baruch College just published my paper, “New York’s Population and Migration Trends in the 2010s“, as part of their Occasional Paper Series. In the paper I study population trends over the last ten years for both New York City (NYC) and the greater New York Metropolitan Area (NYMA) using annual population estimates from the Census Bureau (vintage 2019), county to county migration data (2011-2018) from the IRS SOI, and the American Community Survey (2014-2018). I compare NYC to the nine counties that are home to the largest cities in the US (cities with population greater than 1 million) and the NYMA to the 13 largest metro areas (population over 4 million) to provide some context. I conclude with a brief discussion of the potential impact of COVID-19 on both the 2020 census count and future population growth. Most of the analysis was conducted using Python and Pandas in Jupyter Notebooks available on my GitHub. I discussed my method for creating rank change grids, which appear in the paper’s appendix and illustrate how the sources and destinations for migrants change each year, in my previous post.

Terminology

  • Natural increase: the difference between births and deaths
  • Domestic migration: moves between two points within the United States
  • Foreign migration: moves between the United States and a US territory or foreign country
  • Net migration: the difference between in-migration and out-migration (measured separately for domestic and foreign)
  • NYC: the five counties / boroughs that comprise New York City
  • NYMA: the New York Metropolitan Area as defined by the Office of Management and Budget in Sept 2018, consists of 10 counties in NY State (including the 5 NYC counties), 12 in New Jersey, and one in Pennsylvania
Map of the New York Metropolitan Area
The New York-Newark-Jersey City, NY-NJ-PA Metropolitan Area

Highlights

  • Population growth in both NYC and the NYMA was driven by positive net foreign migration and natural increase, which offset negative net domestic migration.
  • Population growth for both NYC and the NYMA was strong over the first half of the decade, but population growth slowed as domestic out-migration increased from 2011 to 2017.
  • NYC and the NYMA began experiencing population loss from 2017 forward, as both foreign migration and natural increase began to decelerate. Declines in foreign migration are part of a national trend; between 2016 and 2019 net foreign migration for the US fell by 43% (from 1.05 million to 595 thousand).
  • The city and metro’s experience fit within national trends. Most of the top counties in the US that are home to the largest cities and many of the largest metropolitan areas experienced slower population growth over the decade. In addition to NYC, three counties: Cook (Chicago), Los Angeles, and Santa Clara (San Jose) experienced actual population loss towards the decade’s end. The New York, Los Angeles, and Chicago metro areas also had declining populations by the latter half of the decade.
  • Most of NYC’s domestic out-migrants moved to suburban counties within the NYMA (representing 38% of outflows and 44% of net out-migration), and to Los Angeles County, Philadelphia County, and counties in Florida. Out-migrants from the NYMA moved to other large metros across the country, as well as smaller, neighboring metros like Poughkeepsie NY, Fairfield CT, and Trenton NJ. Metro Miami and Philadelphia were the largest sources of both in-migrants and out-migrants.
  • NYC and the NYMA lack any significant relationships with other counties and metro areas where they are net receivers of domestic migrants, receiving more migrants from those places than they send to those places.
  • NYC and the NYMA are similar to the cities and metros of Los Angeles and Chicago, in that they rely on high levels foreign migration and natural increase to offset high levels of negative domestic migration, and have few substantive relationships where they are net receivers of domestic migrants. Academic research suggests that the absolute largest cities and metros behave this way; attracting both low and high skilled foreign migrants while redistributing middle and working class domestic migrants to suburban areas and smaller metros. This pattern of positive foreign migration offsetting negative domestic migration has characterized population trends in NYC for many decades.
  • During the 2010s, most of the City and Metro’s foreign migrants came from Latin America and Asia. Compared to the US as a whole, NYC and the NYMA have slightly higher levels of Latin American and European migrants and slightly lower levels of Asian and African migrants.
  • Given the Census Bureau’s usual residency concept and the overlap in the onset the of COVID-19 pandemic lock down with the 2020 Census, in theory the pandemic should not alter how most New Yorkers identify their usual residence as of April 1, 2020. In practice, the pandemic has been highly disruptive to the census-taking process, which raises the risk of an under count.
  • The impact of COVID-19 on future domestic migration is difficult to gauge. Many of the pandemic destinations cited in recent cell phone (NYT and WSJ) and mail forwarding (NYT) studies mirror the destinations that New Yorkers have moved to between 2011 and 2018. Foreign migration will undoubtedly decline in the immediate future given pandemic disruptions, border closures, and restrictive immigration policies. The number of COVID-19 deaths will certainly push down natural increase for 2020.

Rank Change Grid

Creating Heatmaps to Show Change in Rank Over Time with Python

In this post I’ll demonstrate how I created annotated heatmaps (or what I’m calling a rank change grid) showing change in rank over time using Python and Matplotlib’s imshow plots. I was writing a report on population trends and internal migration using the IRS county to county migration dataset, and wanted to depict the top origins and destinations of migrants for New York City and the New York Metropolitan Area and how they changed from year to year.

I hit upon this idea based on an example in the Matplotlib documentation using the imshow plot. Imshow was designed for manipulating and creating images, but since images are composed of rows and columns of pixels you can use this function to create grids (for GIS folks, think of a raster). The rows can indicate rank from 1 to N, while the columns could represent time, which in my case is years. I could label each grid cell with the name of a place (i.e. origin or destination), and if a place changes ranks over time I could assign the cell a color indicating increase or decrease; otherwise I’d assign a neutral color indicating no change. The idea is that you could look at place at a given rank in year 1 and follow it across the chart by looking at the label. If a new place appears in a given position, the color change clues you in, and you can quickly scan to see whether a given place went up or down.

The image below shows change in rank for the top metro area destinations for migrants leaving the NYC metro from 2011 to 2018. You can see that metro Miami was the top destination for several years, up until 2016-17 when it flips positions with metro Philadelphia, which had been the number 2 destination. The sudden switch from a neutral color indicates that the place occupying this rank is new. You can also follow how 3rd ranked Bridgeport falls to 4th place in the 2nd year (displaced by Los Angeles), remains in 4th place for a few years, and then falls to 5th place (again bumped by Los Angeles, which falls from 3rd to 4th as it’s bumped by Poughkeepsie).

NYC Metro Outflow Grid
Annual Change in Ranks for Top Destinations for NYC Metro Migrants (Metro Outflows)

I opted for this over a more traditional approach called a bump chart (also referred to a slope chart or graph), with time on the x-axis and ranks on the y-axis, and observations labeled at either the first or last point in time. Each observation is assigned a specific color or symbol, and lines connect each observation to its changing position in rank so you can follow it along the chart. Interpreting these charts can be challenging; if there are frequent changes in rank the whole thing begins to look like spaghetti, and the more observations you have the tougher it gets to interpret. Most of the examples I found depicted a small and finite number of observations. I have hundreds of observations and only want to see the top ten, and if observations fall in and out of the top N ranks you get several discontinuous lines which look odd. Lastly, neither Matplotlib or Pandas have a default function for creating bump charts, although I found a few examples where you could create your own.

Creating the rank change grids was a three-part process that required: taking the existing data and transforming it into an array of the top or bottom N values that you want to show, using that array to generate an array that shows change in ranks over time, and generating a plot using both arrays, one for the value and the other for the labels. I’ll tackle each piece in this post. I’ve embedded the functions at the end of each explanation; you can also look at my GitHub repo that has the Jupyter Notebook I used for the analysis for the paper (to be published in Sept 2020).

Create the Initial Arrays

In the paper I was studying flows between NYC and other counties, and the NYC metro area and other metropolitan statisical areas. I’ll refer just to the metro areas as my example in this post, but my functions were written to handle both types of places, stored in separate dataframes. I began with a large dataframe with every metro that exchanged migrants with the NYC metro. There is a row for each metro where the index is the Census Bureau’s unique FIPS code, and columns that show inflows, outflows, and net flows year by year (see image below). There are some rows that represent aggregates, such as flows to all non-metro areas and the sum of individual metro flows that could not be disclosed due to privacy regulations.

Initial Dataframe
Initial Dataframe

The first step is to create an array that has just the top or bottom N places that I want to depict, just for one flow variable (in, out, or net). Why an array? Arrays are pretty solid structures that allow you to select specific rows and columns, and they mesh nicely with imshow charts as each location in the matrix can correspond with the same location in the chart. Most of the examples I looked at used arrays. It’s possible to use other structures but it’s more tedious; nested Python lists don’t have explicit rows and columns so a lot of looping and slicing is required, and with dataframes there always seems to be some catch with data types, messing with the index versus the values, or something else. I went with NumPy’s array type.

I wrote a function where I pass in the dataframe, the type of variable (in, out, or net flow), the number of places I want, whether they are counties or metro areas, and whether I want the top or bottom N records (true or false). Two arrays are returned: the first shows the FIPS unique ID numbers of each place, while the second returns the labels. You don’t have to do anything to calculate actual ranks, because once the data is sorted the ranks become implicit; each row represents ranks 1 through 10, each column represents a year, and the ID or label for a place that occupies each position indicates its rank for that year.

In my dataframe, the names of the columns are prefixed based on the type of variable (inflow, outflow, or net flow), followed by the year, i.e. inflows_2011_12. In the function, I subset the dataframe by selecting columns that start with the variable I want. I have to deal with different issues based on whether I’m looking at counties or metro areas, and I need to get rid of any IDs that are for summary values like the non-metro areas; these IDS are stored in a list called suppressed, and the ~df.indexisin(suppressed) is pandaesque for taking anything that’s not in this list (the tilde acts as not). Then, I select the top or bottom values for each year, and append them to lists in a nested list (each sub-list represents the top / bottom N places in order for a given year). Next, I get the labels I want by creating a dictionary that relates all ID codes to label names, pull out the labels for the actual N values that I have, and format them before appending them to lists in a nested list. For example, the metro labels are really long and won’t fit in the chart, so I split them and grab just the first piece: Albany-Schenectady-Troy, NY becomes Albany (split using the dash) while Akron, OH becomes Akron (if no dash is present, split at comma). At the end, I use np.array to turn the nested lists into arrays, and transpose (T) them so rows become ranks and years become values. The result is below:

ID Array
Function and Result for Creating Array of IDs Top N Places
# Create array of top N geographies by flow type, with rows as ranks and columns as years
# Returns 2 arrays with values for geographies (id codes) and place names
# Must specify: number of places to rank, counties or metros, or sort by largest or smallest (True or False)
def create_arrays(df,flowtype,nsize,gtype,largest):
    geogs=[]
    cols=[c for c in df if c.startswith(flowtype)]
    for c in cols:
        if gtype=='counties':
            row=df.loc[~df.index.isin(suppressed),[c]]
        elif gtype=='metros':
            row=df.loc[~df.index.isin(msuppressed),[c]]
        if largest is True:
            row=row[c].nlargest(nsize)
        elif largest is False:
            row=row[c].nsmallest(nsize)
        idxs=list(row.index)
        geogs.append(idxs)

    if gtype=='counties':
        fips=df.to_dict()['co_name']
    elif gtype=='metros':
        fips=df.to_dict()['mname']
    labels=[]
    for row in geogs:
        line=[]
        for uid in row:
            if gtype=='counties':
                if fips[uid]=='District of Columbia, DC':
                    line.append('Washington\n DC')
                else:
                    line.append(fips[uid].replace('County, ','\n')) #creates short labels
            elif gtype=='metros':
                if '-' in fips[uid]:
                    line.append(fips[uid].split('-')[0]) #creates short labels
                else:
                    line.append(fips[uid].split(',')[0])
        labels.append(line)

    a_geogs=np.array(geogs).T
    a_labels=np.array(labels).T

    return a_geogs, a_labels

Change in Rank Array

Using the array of geographic ID codes, I can feed this into function number two to create a new array that indicates change in rank over time. It’s better to use the ID code array as we guarantee that the IDs are unique; labels (place names) may not be unique and pose all kinds of formatting issues. All places are assigned a value of 0 for the first year, as there is no previous year to compare them to. Then, for each subsequent year, we look at each value (ID code) and compare it to the value in the same position (rank) in the previous column (year). If the value is the same, that place holds the same rank and is assigned a 0. Otherwise, if it’s different we look at the new value and see what position it was in in the previous year. If it was in a higher position last year, then it has declined and we assign -1. If it was in a lower position last year or was not in the array in that column (i.e. below the top 10 in that year) it has increased and we assign it a value of 1. This result is shown below:

Rank Change Array
Function and Result for Creating Change in Rank Array
# Create array showing how top N geographies have changed ranks over time, with rows as rank changes and
# columns as years. Returns 1 array with values: 0 (no change), 1 (increased rank), and -1 (descreased rank)
def rank_change(geoarray):
    rowcount=geoarray.shape[0]
    colcount=geoarray.shape[1]

    # Create a number of blank lists
    changelist = [[] for _ in range(rowcount)]

    for i in range(colcount):
        if i==0:
            # Rank change for 1st year is 0, as there is no previous year
            for j in range(rowcount):
                changelist[j].append(0)
        else:
            col=geoarray[:,i] #Get all values in this col
            prevcol=geoarray[:,i-1] #Get all values in previous col
            for v in col:
                array_pos=np.where(col == v) #returns array
                current_pos=int(array_pos[0]) #get first array value
                array_pos2=np.where(prevcol == v) #returns array
                if len(array_pos2[0])==0: #if array is empty, because place was not in previous year
                    previous_pos=current_pos+1
                else:
                    previous_pos=int(array_pos2[0]) #get first array value
                if current_pos==previous_pos:
                    changelist[current_pos].append(0)
                    #No change in rank
                elif current_posprevious_pos: #Larger value = smaller rank
                    changelist[current_pos].append(-1)
                    #Rank has decreased
                else:
                    pass

    rankchange=np.array(changelist)
    return rankchange 

Create the Plot

Now we can create the actual chart! The rank change array is what will actually be charted, but we will use the labels array to display the names of each place. The values that occupy the positions in each array pertain to the same place. The chart function takes the names of both these arrays as input. I do some fiddling around at the beginning to get the labels for the x and y axis the way I want them. Matplotlib allows you to modify every iota of your plot, which is in equal measures flexible and overwhelming. I wanted to make sure that I showed all the tick labels, and changed the default grid lines to make them thicker and lighter. It took a great deal of fiddling to get these details right, but there were plenty of examples to look at (Matplotlib docs, cookbook, Stack Overflow, and this example in particular). For the legend, shrinking the colorbar was a nice option so it’s not ridiculously huge, and I assign -1, 0, and 1 to specific colors denoting decrease, no change, and increase. I loop over the data values to get their corresponding labels, and depending on the color that’s assigned I can modify whether the text is dark or light (so you can see it against the background of the cell). The result is what you saw at the beginning of this post for outflows (top destinations for migrants leaving the NY metro). The function call is below:

Function for Creating Rank Change Grid
Function for Creating Rank Change Grid
# Create grid plot based on an array that shows change in ranks and an array of cell labels
def rank_grid(rank_change,labels):
    alabels=labels
    xlabels=[yr.replace('_','-') for yr in years]
    ranklabels=['1st','2nd','3rd','4th','5th','6th','7th','8th','9th','10th',
               '11th','12th','13th','14th','15th','16th','17th','18th','19th','20th']
    nsize=rank_change.shape[0]
    ylabels=ranklabels[:nsize]

    mycolors = colors.ListedColormap(['#de425b','#f7f7f7','#67a9cf'])
    fig, ax = plt.subplots(figsize=(10,10))
    im = ax.imshow(rank_change, cmap=mycolors)

    # Show all ticks...
    ax.set_xticks(np.arange(len(xlabels)))
    ax.set_yticks(np.arange(len(ylabels)))
    # ... and label them with the respective list entries
    ax.set_xticklabels(xlabels)
    ax.set_yticklabels(ylabels)

    # Create white grid.
    ax.set_xticks(np.arange(rank_change.shape[1]+1)-.5, minor=True)
    ax.set_yticks(np.arange(rank_change.shape[0]+1)-.5, minor=True)
    ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
    ax.grid(which="major",visible=False)

    cbar = ax.figure.colorbar(im, ax=ax, ticks=[1,0,-1], shrink=0.5)
    cbar.ax.set_yticklabels(['Increased','No Change','Decreased'])

    # Loop over data dimensions and create text annotations.
    for i in range(len(ylabels)):
        for j in range(len(xlabels)):
            if rank_change[i,j] < 0:
                text = ax.text(j, i, alabels[i, j],
                           ha="center", va="center", color="w", fontsize=10)
            else:
                text = ax.text(j, i, alabels[i, j],
                           ha="center", va="center", color="k", fontsize=10)

    #ax.set_title("Change in Rank Over Time")
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    fig.tight_layout()
    plt.show()
    return ax 

Conclusions and Alternatives

I found that this approach worked well for my particular circumstances, where I had a limited number of data points to show and the ranks didn’t fluctuate much from year to year. The charts for ten observations displayed over seven points in time fit easily onto standard letter-sized paper; I could even get away with adding two additional observations and an eighth point in time if I modified the size and placement of the legend. However, beyond that you can begin to run into trouble. I generated charts for the top twenty places so I could see the results for my own analysis, but it was much too large to create a publishable graphic (at least in print). If you decrease the dimensions for the chart or reduce the size of the grid cells, the labels start to become unreadable (print that’s too small or overlapping labels).

There are a number of possibilities for circumventing this. One would be to use shorter labels; if we were working with states or provinces we can use the two-letter postal codes, or ISO country codes in the case of countries. Not an option in my example. Alternatively, we could move the place names to the y-axis (sorted alphabetically or by first or final year rank) and then use the rank as the annotation label. This would be a fundamentally different chart; you could see how one place changes in rank over time, but it would be tougher to discern which places were the most important source / destination for the area you’re studying (you’d have to skim through the whole chart). Or, you could keep ranks on the y-axis and assign each place a unique color in the legend, shade the grid cells using that color, and thus follow the changing colors with your eye. But this flops is you have too many places / colors.

A different caveat is this approach doesn’t work so well if there is a lot of fluctuation in ranks from year to year. In this example, the top inflows and outflows were relatively stable from year to year. There were enough places that held the same rank that you could follow the places that changed positions. We saw the example above for outflows, below is an example for inflows (i.e. the top origins or sources of migrants moving to the NY metro):

NYC Metro Inflow Grid
Annual Change in Ranks for Top Origins for NYC Metro Migrants (Metro Inflows)

In contrast, the ranks for net flows were highly variable. There was so much change that the chart appears as a solid block of colors with few neutral (unchanged) values, making it difficult to see what’s going on. An example of this is below, representing net flows for the NYC metro area. This is the difference between inflows and outflows, and the chart represents metros that receive more migrants from New York than they send (i.e. net receivers of NY migrants). While I didn’t use the net flow charts in my paper, it was still worth generating as it made it clear to me that net flow ranks fluctuate quite a bit, which was a fact I could state in the text.

NYC Metro Net Flow Grid
Annual Change in Ranks for Net Receivers of NYC Metro Migrants (Metro Net Flows)

There are also a few alternatives to using imshow. Matplotlib’s pcolor plot can produce similar effects but with rectangles instead of square grid cells. That could allow for more observations and longer labels. I thought it was less visually pleasing than the equal grid, and early on I found that implementing it was clunkier so I went no further. My other idea was to create a table instead of a chart. Pandas has functions for formatting dataframes in a Jupyter Notebook, and there are options for exporting the results out to HTML. Formatting is the downside – if you create a plot as an image, you export it out and can then embed it into any document format you like. When you’re exporting tables out of a notebook, you’re only exporting the content and not the format. With a table, the content and formatting is separate, and the latter is often tightly bound to the publication format (Word, LaTeX, HTML, etc.) You can design with this in mind if you’re self-publishing a blog post or report, but this is not feasible when you’re submitting something for publication where an editor or designer will be doing the layout.

I really wanted to produce something that I could code and run automatically in many different iterations, and was happy with this solution. It was an interesting experiment, as I grappled with taking something that seemed intuitive to do the old-fashioned way (see below) and reproducing it in a digital, repeatable format.

Copybook Chart
zbp_table

County and ZIP Code Business Patterns 2017 and the Census API

The U.S. Census Bureau’s County and ZIP Code Business Patterns (CBP and ZBP) datasets are generated annually from the Business Register, a large administrative database updated by several federal agencies which contains every business establishment in the U.S. with paid employees. Business establishments are defined as single physical locations where business is conducted or where services or industrial operations are performed. Establishments are assigned to industries, which are groups of businesses that produce similar products or provide similar services, using the North American Industrial Classification System (NAICS). The ZBP contains tables with total establishments, employment, and wages by ZIP and counts of business establishments by NAICS and ZIP. The CBP has these tables plus a few others for counties.

The 2017 Business Patterns was recently released, and there are a few important changes to the dataset over previous iterations. I’ll summarize what they are and how they impact data retrieval using the Census Bureau’s ZBP API. I unwittingly discovered these issues this week as I was trying to use a Python / Pandas notebook I’d written for extracting ZBP data and aggregating the USPS ZIP codes to Zip Code Tabulation Areas (ZCTAs), which are used for publishing decennial and ACS census data. Everything went smoothly when I tested the scripts against the 2016 ZBP, but a few things went awry with 2017 and I was forced to make some revisions.

If you’re not familiar with the API, take a look at this earlier post for a basic introduction. The notebooks I’ll refer to are available on my github; zbp_to_zcta.ipynb works for the 2017 ZBP release, and I kept the earlier version that worked for 2016.

2017 NAICS Codes

NAICS codes are revised every five years in tandem with the Economic Census (conducted in years ending in 2 and 7), to effectively capture the changing nature of the economy. The CBP and ZBP employ the latest NAICS series in the year that it’s released, so beginning with 2012 the 2012 NAICS were used for categorizing establishments into industries. The 2012 definitions were used up through 2016, but now that we’re in 2017 we have a new NAICS 2017 series, and this was employed for the 2017 CBP and ZBP and will be used through 2021.

How different are the categories? If you’re working at the broad two-digit sector level nothing has changed. The more detailed the categories are (3 to 6 digit), the more likely it is that you’ll encounter changes: industries that were created, or removed (aggregated into a broader miscellaneous category), or modified. You can use the concordance tables to see how definitions have changed, and in some cases crosswalk data from one category to another.

If you’re using the API, you’ll need to modify your url to access the 2017 NAICS variables (&NAICS2017=) as opposed to the 2012 series (&NAICS2012= ).

New Privacy Regulations

For confidentiality purposes, the Census Bureau has always employed various methods to insure that the summary data produced for the CBP and ZBP can’t be used to identify characteristics of an individual business. If a geographic area or industrial category had fewer than 3 establishments in it, or if one establishment in an area or category constituted an overwhelming majority of the employment or wages, then those values were not disclosed or published. The only characteristic that was always published was the number of establishments.

Not any more – beginning with the 2017 CBP and ZBP, the following applies:

> Prior to reference year 2017, the number of establishments in a particular tabulation cell was not considered sensitive; therefore, counts of establishments were released without any disclosure avoidance methods applied. Beginning with reference year 2017, cells with fewer than 3 establishments have been omitted from the release.

So what does this mean? First, for any county or ZIP Code that has fewer than 3 business establishments in total, records for that county or ZIP Code will not appear in the dataset at all (although establishments in these areas will be counted in summaries of larger areas, like states or metro areas). In my script, about 30 ZIP Codes for NYC fell out of my results compared to last year; these were primarily non-residential ZIPs that represented a single business that processes lots of mail, and post office box ZIPs.

Second, for a given geographic area, if a given NAICS category has less than three business establishments, the number of establishments won’t be reported for that category, but they will be included in the sum total. Once again, in my case I’m working with two-digit sector codes. There is a 00 code that captures the sum of all establishments. When I was summing the values of all of the two-digit codes together, I discovered that these sums rarely matched the 00 total, like they did in the past, because of the new non-disclosure policy. To account for this, and to calculate percent totals correctly, I had to create a category that takes the difference between the total 00 category and the sum of all the others, to count how many businesses were not disclosed (see pic below). I could then treat that category like the others, and the sum of the parts would equal the whole again.

summary_naics

These data frames show counts of establishments by two digit NAICS sectors. In the top df, the totals column N00 does not equal the sum of the others columns. A column was added to the bottom df to get the difference between the two.

Subsequently, I replaced the zeros for any ZIP code that had businesses that weren’t disclosed with NULLs, as I can’t know for certain if the values are truly zero. The most likely categories (at the two digit level for ZIPs) where data was not disclosed were: 11 (agriculture), 21 (mining), 22 (utilities), and 99 (unclassified businesses).

Looping Through and Retrieving Geographies

The API allows you to select all geographies within another geography using the ‘in’ clause (visit the ZBP API to see a list of variables and examples). For example, you can select all the counties in a particular state – in the example below, values would be passed into the variables in braces, and you would pass ANSI FIPS codes into the geography variables:

base_url = f'https://api.census.gov/data/{year}/{dsource}'
edata_url=f'{base_url}?get={ecols}&for={county}:*&in=state:{state}&key={api_key}'

This option is only available for geographies that nest, according to the Census Bureau’s geographic hierarchy. ZIP Codes are not a census geography and don’t nest within anything, so we can’t use the ‘in’ clause. For the 2016 and prior versions of the ZBP API, there was a trick for getting around this; there was a state variable called ST, which you could use in a similar fashion to get all the ZIP Codes in a state in a ‘for’ clause:

edata_url = f'{base_url}?get={ecols}&for=zipcode:*&ST={state}&key={api_key}'

Not any more – the ST variable disappeared in the 2017 API for the ZBP. So what can you do instead? Option one is to loop through a list of ZIP codes, passing them to the API one by one. This is fine if you just need a few, but pretty slow if you need the 260 something that I needed. Option two is to pass in several ZIP codes into the URL at once, but there’s a catch: you’re only allowed to pass in 50 values at a time to any variable. To do this, you need to divide your list of ZIPs into chunks of no more than 50, loop through the sub-lists to insert them into the url, and append the results to a big list as you go along.

A function for breaking a list of ZIP Codes (or any list of variables) into chunks:

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

Call the function to generate a list of lists with an equal number of values (in my case, my ZIP Codes are an index in a dataframe):

reqzips=list(chunks(zip2zcta.index.tolist(),48))

Then run the following to iterate through the list of ZIP code lists. I use enumerate so I can grab both the indices and values in the list. The ZIP codes values (v) have to be strung together and separated by commas before passing them into the url. The ecols variable is a list of columns I want to retrieve, which is also a single string with columns separated by commas. Once I receive the first chunk I append everything to a list (emp_data), but for every subsequent chunk I start reading from the second value [1:] and skip the first [0] because I only want to append the column headers once.

emp_data=[]
for i, v in enumerate (reqzips):
    batchzips=','.join(v)
    edata_url = f'{base_url}?get={ecols}&for=zipcode:{batchzips}&key={api_key}'
    response=requests.get(edata_url)
    if response.status_code==200:
        clear_output(wait=True)
        data=response.json()
        if i == 0:
            for record in data:
                emp_data.append(record)
        else:
            for record in data[1:]:
                emp_data.append(record)
        print('Retrieved data for chunk',i)
    else:
        print('***Problem with retrieval***, response code',response.status_code)
        break

The key here is to get the looping right, to insure that you end up with a list of lists where each list represents a row of data, in this case a ZIP code record with establishment data. I employed something similar (but a bit more complicated) with an ACS script that I wrote, but in that case I was looping through lists of columns / attributes instead of geographies.

If you’d like to learn more about the census business datasets and understand how to navigate NAICS, check out chapter 8 in my book. I don’t cover the APIs, but I do demonstrate how to use the new data.census.gov and I delve into the concepts behind these datasets in good detail.