research

US Census Data ALA Tech Report

ALA Tech Report on Using Census Data for Research

I have written a new report that’s just been released: US Census Data: Concepts and Applications for Supporting Research, was published as the May / June 2022 issue of the American Library Association’s Library and Technology Reports. It’s available for purchase digitally or in hard copy from the ALA from now through next year. It will also be available via EBSCOhost as full text, sometime this month. One year from now, the online version will transition to become a free and open publication available via the tech report archives.

The report was designed to be a concise primer (about 30 pages) for librarians who want to be knowledgeable with assisting researchers and students with finding, accessing, and using public summary census data, or who want to apply it to their own work as administrators or LIS researchers. But I also wrote it in such a way that it’s relevant for anyone who is interested in learning more about the census. In some respects it’s a good distillation of my “greatest hits”, drawing on work from my book, technical census-related blog posts, and earlier research that used census data to study the distribution of public libraries in the United States.

Chapter Outline

  1. Introduction
  2. Roles of the Census: in American society, the open data landscape, and library settings
  3. Census Concepts: geography, subject categories, tables and universes
  4. Datasets: decennial census, American Community Survey, Population Estimates, Business Establishments
  5. Accessing Data: data.census.gov, API with python, reports and data summaries
  6. GIS, historical research, and microdata: covers these topics plus the Current Population Survey
  7. The Census in Library Applications: overview of the LIS literature on site selection analysis and studying library access and user populations

I’m pleased with how it turned out, and in particular I hope that it will be used by MLIS students in data services and government information courses.

Although… I must express my displeasure with the ALA. The editorial team for the Library Technology Reports was solid. But once I finished the final reviews of the copy edits, I was put on the spot to write a short article for the American Libraries magazine, primarily to promote the report. This was not part of the contract, and I was given little direction and a month at a busy time of the school year to turn it around. I submitted a draft and never heard about it again – until I saw it in the magazine last week. They cut and revised it to focus on a narrow aspect of the census that was not the original premise, and they introduced errors to boot! As a writer I have never had an experience where I haven’t been given the opportunity to review revisions. It’s thoroughly unprofessional, and makes it difficult to defend the traditional editorial process as somehow being more accurate or thorough compared to the web posting and tweeting masses. They were apologetic, and are posting corrections. I was reluctant to contribute to the magazine to begin with, as I have a low opinion of it and think it’s deteriorated in recent years, but that’s a topic for a different discussion.

Stepping off the soapbox… I’ll be attending the ALA annual conference in DC later this month, to participate on a panel that will discuss the 2020 census, and to reconnect with some old colleagues. So if you want to talk about the census, you can buy me some coffee (or beer) and check out the report.

A final research and publication related note – the map that appears at the top of my post on the distribution of US public libraries from several years back has also made it into print. It appears on page 173 of The Argument Toolbox by K.J. Peters, published by Broadview Press. It was selected as an example of using visuals for communicating research findings, making compelling arguments in academic writing, and citing underlying sources to establish credibility. I’m browsing through the complimentary copy I received and it looks excellent. If you’re an academic librarian or a writing center professional and are looking for core research method guides, I would recommend checking it out.

STATA records

Creating STATA Variable Lists in Excel and Do Files With Python

In this post I demonstrate how export a list of variables from a STATA dta file to an Excel spreadsheet, and how to create a STATA do file by using Python to read in a list of variables from a spreadsheet; the do file will generate an extract of attributes and observations from a larger dta file. Gallup Analytics microdata serves as the example.

Gallup Analytics Microdata

Many academic libraries subscribe to an online database called Gallup Analytics, which lets users explore and download summary statistics from a number of on-going polls and surveys conducted by the Gallup Organization, such as the US Daily Tracker poll, World Poll, and SPSS polling series. As part of the package, subscribing institutions also receive microdata files for some of the surveys, in STATA and SPSS formats. These files contain the anonymized, individual responses to the surveys. The microdata is valuable to social science researchers who use the responses to conduct statistical analyses.

STATA
Microdata in STATA

Naturally, the microdata is copyrighted and licensed for non-commercial research purposes to members of the university or institution who are covered by the license agreement, and cannot be shared outside the institution. Another stipulation is that the files cannot be shared in their entirety, even for members of the licensed institution; researchers must request individual extracts of variables and observations to answer a specific research question. This poses a challenge for the data librarian, who somehow has to communicate to the researcher what’s available in the files and mediate the request. Option 1 is to share the codebooks (which are also copyrighted and can’t be publicly distributed) with the researcher and haggle back and forth via email to iron out the details of the request. Option 2 is to have a stand-alone computer set up in the library, where a researcher can come and generate their own extract from files stored on a secure, internal network. In both cases, the manual creation of the extract and the researcher’s lack of familiarity with the contents of the data makes for a tedious process.

My solution was to create spreadsheets that list all of the variables in each dataset, and have the researcher check the ones they want. I created a resource guide that advertises and describes the datasets, and provides secure links to the Gallup codebooks and these spreadsheets, which are stored on a Google Drive and are protected via university authentication. The researcher can then fill out a Google form (also linked to from that page), where they describe the nature of the request, select the specific dataset of interest, specify filters on observations (rows), and upload the spreadsheet of requested variables (columns). Then, I can read the spreadsheet variables into Python and generate a STATA do file (STATA scripts stored in plain text format), to create the desired extract which I can share with the researcher.

Create List of STATA Variables in Excel Spreadsheet

First, I created a standard set of STATA do files to output lists of all variables to a spreadsheet for the different data files. An example for the US Daily Tracker poll from pre-2018 is below. I was completely unfamiliar with STATA, but the online docs and forums taught me what I needed to pull this together.

Some commands are the same across all the do files. I use describe and then translate to create a simple text file that saves a summary from the screen that counts rows and columns. Describe gives a description of the data stored in memory, while replace is used to swap out existing variables with a new subset. Then, generate select_vars gives me codebook information about the dataset (select_vars is a variable name I created), which I sort using the name column. The export excel command is followed by the specific summary fields I wish to output; the position of the variable, data type, variable label, and the variable name itself.

* Create variable list for Gallup US Tracker Survey 2008-2017

local y = YEAR in 1

describe,short
summarize YEAR
translate @Results gallup_tracker_`y'_summary.txt, replace

describe, replace
generate select_vars = ""
sort name

export excel position name type varlab select_vars using gallup_tracker_`y'_vars.xlsx, firstrow(variables) replace

The variation for this particular US Daily Tracker dataset is that the files are packaged as one file per year. I load the first file for 2008, and the do file saves the YEAR attribute as a local variable, which allows me to include the year in the summary and excel output file names. I had to run this do file for each subsequent year up to 2017. This is not a big deal as I’ll never have to repeat the process on the old files, as new data will be released in separate, new files. Other datasets imposed different requirements; the GPSS survey is packaged in eleven separate files for different surveys, and the updates are cumulative (each file contains older data plus any updates – Gallup sends us updated files a few times each year). For the GPSS, I prompt the user for input to specify the survey file name, and overwrite the previous Excel file.

With the do file in hand, you open STATA and the data file you want to process, change the working directory from the default user folder to a better location for storing the output, open the do file, and it runs and creates the variable list spreadsheet.

Excel spreadsheet of variables generated from STATA
List of variables in Excel generated from STATA file. Users check the variables they want in an extract in the select_vars column

Create a STATA Do File with Python and Excel

Once a researcher submits their Google form and their selected variable spreadsheet (placing an X in a dedicated column to indicate that they want to include a variable), I run the Python script below. I use the openpyxl module to read the Excel file. I have to modify the paths, spreadsheet file name, and an integer for the particular survey each time I run it. I use the os module to navigate up and down through folders to store outputs in specific places. If the researcher specifies in the Google form that they want to filter observations, for example records for specific states or age ranges, I have to add those manually but I commented out a few examples that I can copy and modify. One caveat is that you must filter using the coded variable and not its label (i.e. if a month value is coded as 2 and its label is February, I must reference the code and not the label). Reading in the requested columns is straightforward; the script identifies cells in the selection column (E) that have an X, then grabs the variable name from the adjacent column.

# -*- coding: utf-8 -*-
"""
Pull selected gallup variables from spreadsheet to create STATA Do File
Frank Donnelly / GIS and Data Librarian / Brown University
"""

import openpyxl as xl, os
from datetime import date

thedate=date.today().strftime("%m%d%Y")
surveys={1:'gallup_covid',2:'gallup_gpss',3:'gallup_tracker',4:'gallup_world'}

rpath=os.path.join('requests','test') # MODIFY BASED ON INPUT
select_file=os.path.join(rpath,'gallup_tracker_2017_vars_TEST.xlsx') #MODIFY BASED ON INPUT
survey_file=surveys[3] #MODIFY BASED ON INPUT

dofile=os.path.join(rpath,'{}_vars_{}.do'.format(survey_file,thedate))
dtafile=os.path.join(os.path.abspath(os.getcwd()),rpath,'{}_extract_{}.dta'.format(survey_file,thedate))


#MODIFY to filter by observations - DO NOT ERASE EXAMPLES - copy, then modify
obsfilter=None
# obsfilter=None
# obsfilter='keep if inlist(STATE_NAME,"CT","MA","ME","NH","RI","VT")'
# obsfilter='keep if inrange(WP1220,18,64)'
# obsfilter='keep if SC7==2 & MONTH > 6'
# obsfilter='keep if FIPS_CODE=="44007" | FIPS_CODE=="25025"'

workbook = xl.load_workbook(select_file)
ws = workbook['Sheet1']

# May need to modify ws col and cell values based on user input
vlist=[]
for cell in ws['E']:
    if cell.value in ('x','X'): 
        vlist.append((ws.cell(row=cell.row, column=2).value))
outfile = open(dofile, "w")
outfile.writelines('keep ')
outfile.writelines(" ".join(vlist)+"\n")
if obsfilter==None:
    pass
else:
    outfile.writelines(obsfilter+"\n")
outfile.writelines('save '+dtafile+"\n")
outfile.close()
print('Created',dofile) 

The plain text do file begins with the command keep followed by the columns, and if requested, an additional keep statement to filter by records. The final save command will direct the output to a specific location.

keep CENREG D17A D23 D24 D5 FIPS_CODE HISPANIC INT_DATE MONTH MOTHERLODE_ID PE_WEIGHT RACE SC7 STATE_NAME WP10202 WP10208 WP10209 WP10215 WP10216 WP10229 WP10230 WP1220 WP1223 YEAR ZIPGALLUPREGION ZIPSTATE
save S:\gallup\processing\scripts\reques\test\gallup_tracker_extract_02202022.dta

All that remains is to open the requested data file in STATA, open the do file, and an extract is created. Visit my GitHub for the do files, Python script, and sample output. The original source data and the variable spreadsheets are NOT included due to licensing issues; if you have the original data files you can generate what I’ve described here. Sorry, I can’t share the Gallup data with you (so please don’t ask). You’ll need to contact your own university or institution to determine if you have access.

Census Tracts

Call for Proposals: Celebrating the Census in the Journal of Maps

I’m serving as a co-editor for a special issue for the Journal of Maps entitled “Celebrating the Census“. The Journal of Maps is an open access, peer reviewed journal published by the Taylor & Francis Group. The journal is distinct in that all articles feature maps and spatial diagrams as the focal point for studying geographic phenomena from both a physical / environmental and social science perspective.

Here’s the official synopsis for this census-themed special issue:

We invite contributions to a special issue of the Journal of Maps focused upon the evolving character and cartographic opportunities offered by traditional census statistics and the impact of transitioning from these sources of population data at a range of spatial scales into a new era of big data assembly. In so doing, the special issue marks two important events taking place in the UK during 2021 in the history of British Censuses and seeks contributions that reflect the past transition of population data cartography through the digital era of the last 50 years and anticipates its transformation into the big data era of the foreseeable future.

While the issue marks the 100th anniversary of the UK census, submissions concerning census mapping from around the world are welcome and encouraged in these topic areas, including but not limited to:

  • Spatial and statistical consistency over time
  • People on the move
  • Mapping people through space and time
  • Mapping morbidity and mortality
  • Politics and population data
  • International comparison of demographic mapping
  • Before and after population mapping using censuses and administrative sources
  • Population data and mapping human-environmental interaction
  • Transition and evolution in population mapping

Visit the special issue announcement for full details. Deadlines:

  • April 30, 2021: a short draft (500-word limit) outlining themes and scope of the paper, preferably with a sample map
  • June 14, 2021: abstracts will be selected by the editorial team by this date
  • Sept 5, 2021: completed paper (4000-word limit) is due

The issue will be published sometime in 2022.

CEC North America LULC

Dataset Roundup: A Summary of Specialized Open Data Sources

I list the top free GIS data sources that I consistently use on my Resources page; these are general, foundational sources that can be used for many applications. In this post I’m going to summarize an eclectic mix of more specialized resources that I’ve used or that have been recommended to me over this past year. I’ve categorized these into GIS datasets, sub-national population data for countries (tabular data that can be joined to GIS vector layers), and historic socio-economic data for countries.

Geospatial Data

North American Land Change Monitoring System

Published by the Commission for Environmental Cooperation, these land use and land cover rasters (see photo at the top of this post) are derived from MODIS imagery at 250 meter resolution for earlier years and either Landsat-7 or RapidEye imagery at 30 meter resolution for later years for Canada, the United States, and Mexico in 2005, 2010, and 2015. There are layers for both land cover and land cover change over a 5-year period. Land cover is classified into 19 categories based on UN FAO standards. It’s easy to download as the layer is unified (no individual tiles to mess with and stitch together) and for the 2015 series you can choose a national file or one for the entire continent.

PRISM Climate Data

Published by the Northwest Alliance for Computational Science & Engineering at Oregon State University, the PRISM Climate Group publishes climate data for the United States. You can generate daily, monthly, or 30-year normal rasters for temperature (min, max, mean), precipitation, dew point, and a few other measures for the continental US. There are also some prepackaged files that were created for special projects that cover Alaska, Hawaii, and some of the US territories. The site is very easy to use (certainly compared to other sites that provide climate data) and beyond its research applications the data is good for teaching purposes, as files are straightforward to create, download, and interpret.

PRISM Mean Temp Map Oct 2020

Marineregions.org Marine Boundaries

I usually help people find vector boundaries for terrestrial features, and the oceans are an afterthought that appear as the absence of land. But what if you specifically needed features that represent oceans and seas? Marineregions.org, maintained by the Flanders Marine Institute, provides many sets of water-based boundaries that include maritime regions (legal sea zones around countries) as well as polygons that represent the boundaries of the oceans and largest seas (IHO Sea Areas, defined by the International Hydrographic Association). See the screenshot of this layer in QGIS below.

IHO Seas Layer in QGIS

GNSS Time Series

Produced by NASA JPL, this dataset can be used for measuring vertical land movement (VLM) and subsistence, primarily due to movement of the earth’s tectonic plates. The dataset contains over 2,000 GPS observation points or stations; the majority are in the US but there are a scattering of points throughout the world. The data file for geodetic positions and velocities contains two records for every station: the POS (position) record provides data for the latitude (N), longitude (E), and elevation (V) in mm. The VEL (velocity) indicates the rate of movement over the time period by direction (N / E) and elevation. The last three columns for both sets of records are margins of error for each value. The data file is in a fixed-width text format. To use it in GIS you need to parse the data into a tabular format and drop the header information. When plotting the coordinates, the CRS for the geodetic file is IGS14 (EPSG code 9019). If your CRS library doesn’t include this system, it is roughly equivalent to ITRF2014 (EPSG code 7789).

Subnational Population Data

IPUMS Terra

Are you looking for population or socio-economic data for the first-level administrative divisions (states, provinces, departments, districts, etc) for many different countries? IPUMS Terra is part of the IPUMS series at the Minnesota Population Center, Univ of Minnesota. The data has been gathered from census and statistical agencies of individual countries, or in some cases from estimates generated by the project. Choose the "Create Your Custom Dataset" option, then on the next screen choose "Start Extract Area Level Output". On the Extract Builder (see pic below) choose variables on the left, like Demographic and Total Population. Then under Datasets on the right you can choose countries and filter by year. Once you move on to the next screen, you can choose to harmonize the output or choose specific years, and choose your administrative level: national, ADM-1, or smallest available. You must register to use the IPUMS data series, but registration is free for educational and non-commercial use (as long as you cite IPUMS as the source).

IPUMS Terra Interface

Subnational Human Development Index

An alternative for first-level admin data is the Subnational Human Development Index published by the GlobalDataLab at the Institute for Management Research at Radboud University. There are far fewer variables and less customization compared to IPUMS Terra, but as such the site is smaller and easier to use. There are several different indices for measuring human development, but you can also access the following indicators: life expectancy, GNI per capita, expected and mean years of schooling, and population size in millions.

Historic Global Population and Economic Data

Maddison Project

Yes, that’s Maddison with two "ds". This project from the Groningen Growth and Development Centre at the University of Groningen generates comparative economic growth, income, and population data for countries over a long historical time span; back to the year AD 1 in a few cases, but for the most part from AD 1500 forward. They provide detailed documentation that explains how the dataset was created, and it’s easy to download in either an Excel or STATA format.

The World Countries Urban Population

This dataset consists of two spreadsheet files – one for the total urban population and another for the urban ratio of the population for countries going back to the year 1500. The dataset was created by Jonathan Fink-Jensen at Utrecht University and is held in the International Institute of Social History’s data repository. The repository contains a variety of other historic socio-economic datasets for many different countries.

NYC and NYMA Pop Change Graph 2000 to 2019

New York’s Population and Migration Trends in the 2010s

The Weissman Center for International Business at Baruch College just published my paper, “New York’s Population and Migration Trends in the 2010s“, as part of their Occasional Paper Series. In the paper I study population trends over the last ten years for both New York City (NYC) and the greater New York Metropolitan Area (NYMA) using annual population estimates from the Census Bureau (vintage 2019), county to county migration data (2011-2018) from the IRS SOI, and the American Community Survey (2014-2018). I compare NYC to the nine counties that are home to the largest cities in the US (cities with population greater than 1 million) and the NYMA to the 13 largest metro areas (population over 4 million) to provide some context. I conclude with a brief discussion of the potential impact of COVID-19 on both the 2020 census count and future population growth. Most of the analysis was conducted using Python and Pandas in Jupyter Notebooks available on my GitHub. I discussed my method for creating rank change grids, which appear in the paper’s appendix and illustrate how the sources and destinations for migrants change each year, in my previous post.

Terminology

  • Natural increase: the difference between births and deaths
  • Domestic migration: moves between two points within the United States
  • Foreign migration: moves between the United States and a US territory or foreign country
  • Net migration: the difference between in-migration and out-migration (measured separately for domestic and foreign)
  • NYC: the five counties / boroughs that comprise New York City
  • NYMA: the New York Metropolitan Area as defined by the Office of Management and Budget in Sept 2018, consists of 10 counties in NY State (including the 5 NYC counties), 12 in New Jersey, and one in Pennsylvania
Map of the New York Metropolitan Area
The New York-Newark-Jersey City, NY-NJ-PA Metropolitan Area

Highlights

  • Population growth in both NYC and the NYMA was driven by positive net foreign migration and natural increase, which offset negative net domestic migration.
  • Population growth for both NYC and the NYMA was strong over the first half of the decade, but population growth slowed as domestic out-migration increased from 2011 to 2017.
  • NYC and the NYMA began experiencing population loss from 2017 forward, as both foreign migration and natural increase began to decelerate. Declines in foreign migration are part of a national trend; between 2016 and 2019 net foreign migration for the US fell by 43% (from 1.05 million to 595 thousand).
  • The city and metro’s experience fit within national trends. Most of the top counties in the US that are home to the largest cities and many of the largest metropolitan areas experienced slower population growth over the decade. In addition to NYC, three counties: Cook (Chicago), Los Angeles, and Santa Clara (San Jose) experienced actual population loss towards the decade’s end. The New York, Los Angeles, and Chicago metro areas also had declining populations by the latter half of the decade.
  • Most of NYC’s domestic out-migrants moved to suburban counties within the NYMA (representing 38% of outflows and 44% of net out-migration), and to Los Angeles County, Philadelphia County, and counties in Florida. Out-migrants from the NYMA moved to other large metros across the country, as well as smaller, neighboring metros like Poughkeepsie NY, Fairfield CT, and Trenton NJ. Metro Miami and Philadelphia were the largest sources of both in-migrants and out-migrants.
  • NYC and the NYMA lack any significant relationships with other counties and metro areas where they are net receivers of domestic migrants, receiving more migrants from those places than they send to those places.
  • NYC and the NYMA are similar to the cities and metros of Los Angeles and Chicago, in that they rely on high levels foreign migration and natural increase to offset high levels of negative domestic migration, and have few substantive relationships where they are net receivers of domestic migrants. Academic research suggests that the absolute largest cities and metros behave this way; attracting both low and high skilled foreign migrants while redistributing middle and working class domestic migrants to suburban areas and smaller metros. This pattern of positive foreign migration offsetting negative domestic migration has characterized population trends in NYC for many decades.
  • During the 2010s, most of the City and Metro’s foreign migrants came from Latin America and Asia. Compared to the US as a whole, NYC and the NYMA have slightly higher levels of Latin American and European migrants and slightly lower levels of Asian and African migrants.
  • Given the Census Bureau’s usual residency concept and the overlap in the onset the of COVID-19 pandemic lock down with the 2020 Census, in theory the pandemic should not alter how most New Yorkers identify their usual residence as of April 1, 2020. In practice, the pandemic has been highly disruptive to the census-taking process, which raises the risk of an under count.
  • The impact of COVID-19 on future domestic migration is difficult to gauge. Many of the pandemic destinations cited in recent cell phone (NYT and WSJ) and mail forwarding (NYT) studies mirror the destinations that New Yorkers have moved to between 2011 and 2018. Foreign migration will undoubtedly decline in the immediate future given pandemic disruptions, border closures, and restrictive immigration policies. The number of COVID-19 deaths will certainly push down natural increase for 2020.