Author: Frank

GIS & Data Librarian at Brown University

US Census Data ALA Tech Report

ALA Tech Report on Using Census Data for Research

I have written a new report that’s just been released: US Census Data: Concepts and Applications for Supporting Research, was published as the May / June 2022 issue of the American Library Association’s Library and Technology Reports. It’s available for purchase digitally or in hard copy from the ALA from now through next year. It will also be available via EBSCOhost as full text, sometime this month. One year from now, the online version will transition to become a free and open publication available via the tech report archives.

The report was designed to be a concise primer (about 30 pages) for librarians who want to be knowledgeable with assisting researchers and students with finding, accessing, and using public summary census data, or who want to apply it to their own work as administrators or LIS researchers. But I also wrote it in such a way that it’s relevant for anyone who is interested in learning more about the census. In some respects it’s a good distillation of my “greatest hits”, drawing on work from my book, technical census-related blog posts, and earlier research that used census data to study the distribution of public libraries in the United States.

Chapter Outline

  1. Introduction
  2. Roles of the Census: in American society, the open data landscape, and library settings
  3. Census Concepts: geography, subject categories, tables and universes
  4. Datasets: decennial census, American Community Survey, Population Estimates, Business Establishments
  5. Accessing Data: data.census.gov, API with python, reports and data summaries
  6. GIS, historical research, and microdata: covers these topics plus the Current Population Survey
  7. The Census in Library Applications: overview of the LIS literature on site selection analysis and studying library access and user populations

I’m pleased with how it turned out, and in particular I hope that it will be used by MLIS students in data services and government information courses.

Although… I must express my displeasure with the ALA. The editorial team for the Library Technology Reports was solid. But once I finished the final reviews of the copy edits, I was put on the spot to write a short article for the American Libraries magazine, primarily to promote the report. This was not part of the contract, and I was given little direction and a month at a busy time of the school year to turn it around. I submitted a draft and never heard about it again – until I saw it in the magazine last week. They cut and revised it to focus on a narrow aspect of the census that was not the original premise, and they introduced errors to boot! As a writer I have never had an experience where I haven’t been given the opportunity to review revisions. It’s thoroughly unprofessional, and makes it difficult to defend the traditional editorial process as somehow being more accurate or thorough compared to the web posting and tweeting masses. They were apologetic, and are posting corrections. I was reluctant to contribute to the magazine to begin with, as I have a low opinion of it and think it’s deteriorated in recent years, but that’s a topic for a different discussion.

Stepping off the soapbox… I’ll be attending the ALA annual conference in DC later this month, to participate on a panel that will discuss the 2020 census, and to reconnect with some old colleagues. So if you want to talk about the census, you can buy me some coffee (or beer) and check out the report.

A final research and publication related note – the map that appears at the top of my post on the distribution of US public libraries from several years back has also made it into print. It appears on page 173 of The Argument Toolbox by K.J. Peters, published by Broadview Press. It was selected as an example of using visuals for communicating research findings, making compelling arguments in academic writing, and citing underlying sources to establish credibility. I’m browsing through the complimentary copy I received and it looks excellent. If you’re an academic librarian or a writing center professional and are looking for core research method guides, I would recommend checking it out.

UN ICSC Retail Price Index Map

UN Retail Price Index Time Series

We recently launched our fledgling geodata portal on GitHub for the open datasets we’ll create in our new lab. In the spring we carved out a space on the 11th floor of the Sciences Library at Brown which we’ve christened GeoData@SciLi, a GIS and data consultation and work space. We’ll be doing renovations on both the webspace and workspace over the summer.

Our inaugural dataset was created by Ethan McIntosh, a senior (now graduate) who began working with me this spring. The dataset is the United Nations International Civil Service Commission’s (UN ICSC) Retail Price Indices with Details (RPID). The index measures the cost of living based on several categories of goods and services in duty stations around the world. It’s used to adjust the salaries of the UN’s international staff relative to UN headquarters in New York City (index value of 100 = cost of living in New York). The data is updated six times a year, published in an Excel spreadsheet that contains a macro that allows you to look up the value of each duty station via a dropdown menu. The UN ICSC makes the data public by request; you register and are granted access to download the data in PDF and Excel format in files that are packaged in one month / year at a time.

We were working with a PhD student in economics who wanted to construct a time-series of this data. Ethan wrote a Python script to aggregate all of the files from 2004 to present into a single CSV; the actual values for each country / duty station were stored in hidden cells that the macro pulled from, and he was able to pull them from these cells. He parsed the data into logical divisions, and added the standard 3-letter ISO 3166 country code to each duty station so that each record now has a unique place identifier. His script generates three outputs: a basic CSV of the data in separate month / year files, a “long” (aka flat) time series file where each record represents a specific duty station and retail index category or weight for a given month and year, and a “wide” time series file where the category / weight has been pivoted to a column, so each record represents all values for a duty station for a given month / year. He’s written the program to process and incorporate additional files as they’re published.

While the primary intention was to study this data as a time series in a statistical analysis, it can also be used for geospatial analysis and mapping. Using the wide file, I created the map in the header of this post, which depicts the total retail index for February 2022 for each country, where the value represents the duty station within the country (usually the capital city). I grabbed some boundaries from Natural Earth and joined the data to it using the ISO code. I classified the data using natural breaks, but manually adjusted the top level category to include all countries with a value greater than or equal to the base value of 100.

There were only five duty stations that were more expensive than New York, with values between 102 and 124: Tokyo, Ashkhabad (Turkmenistan), Singapore, Beirut, and Hong Kong. Beijing and Geneva were equivalent in price at 100. The least expensive stations with values between 52 and 69 were: Caracas (Venezuela), Tripoli, Damascus, Ankara (Turkey), Bucharest (Romania), Mbabane (Eswatini – formerly Swaziland), and Sofia (Bulgaria). There appears to be regional clustering of like values, although I didn’t run any tests. The station in the US that’s measured relative to NYC is Washington DC (index value of 89).

The final datasets and code used to generate them are available on GitHub, and we’ll update it at least once, if not a couple times, a year. We are not providing the original month / year macro spreadsheets; if you want those you should register with the UN ICSC and access them there. If you’re using our data files, you should still register with them, as they would like to be aware of how their data is being used.

We will post additional projects, datasets, and code in individual repos as we create them, linked to from our main page. I’m working on creating a basic metadata profile for our lab, so we’ll provide structured metadata for each of our datasets in the near future.

Hurricanes 2021

GIS Data for US Coastal Storms and Floods

Over the course of this academic year I’ve helped many students find GIS data related to coastal storms and flooding in the US. There’s a ton of data available, particularly from NOAA, but there are so many projects and initiatives that it can be tough to find what you’re looking for. So I’ll share a few key resources here.

NOAA’s DigitalCoast is a good place to start; it’s a catalog of federal, state, and US territory projects and websites that provide both spatial and non-spatial datasets related to coastal storms and flooding. You can filter by place and data type; there are even a few global sources. Most of the projects I mention below are cataloged there.

Given the size of many of these datasets, the ArcGIS File Geodatabase is often used for packaging and distribution. Once you’ve downloaded and unzipped one, it looks like a folder with lots of subfolders and files. If you’re an ArcGIS user, use the Catalog pane to browse your file system and add a connection to the database / folder to access its contents. If you’re a QGIS user, use the Data Manager and on the Vector tab change the source type from File to Directory. In the Source Type dropdown you can choose OpenFileGDB, and browse and select the database, which appears as a folder. Once you hit the Add button, you’ll be prompted to choose the features in the DB that you wish to add to the project.

Adding a File Geodatabase in QGIS
Adding a File Geodatabase in QGIS

FEMA Flood Hazards and Disasters

The FEMA flood maps are usually the first thing that comes to mind when folks set out to find data on flooding, but good luck finding their GIS data. I’ve searched through their main program site for the National Flood Hazard Layer and followed every link, but can’t for the life of me find the connection to the page that has actual GIS data; there are map viewer tools, scanned paper maps, web mapping services, and everything else under the sun.

If you want FEMA flood data in a GIS format: GO HERE! This is the record in data.gov for the National Flood Hazard Layer. The links at the bottom include this one: Download Seamless Nationwide NFHL GIS data. The data is packaged in an ArcGIS File Geodatabase, with one polygon feature class for flood zones. They’re categorized into 100 and 500 year zones, open water bodies, areas outside of flood zones, and areas outside flood zones protected by levees. The pic below illustrates 100 and 500 year zones overlaid on the OpenTopoMap.

FEMA Flood Maps. Light blue areas are 500 year zones, dark blue are 100 year
FEMA Flood Hazard Layer, 100 year zones in dark blue, 500 year in light blue

FEMA also has a GIS data feed for current and historical emergencies and disasters, that are available in a variety of formats both spatial and non-spatial. These are county-level layers that indicate where disaster areas were declared and what kind of funding or assistance is / was available.

NOAA Sea Level Rise

The FEMA maps assess both past events and current conditions to model the likelihood of flooding in a 100 or 500 year period for a major storm event. A different way of looking at flooding is to consider sea level rise due to climate change, where the impact of sea level rise is measured in different increments. Instead of the impact of a one-shot event, this illustrates potential long term change. NOAA’s Sea Level Rise (SLR) viewer allows you to easily visualize the impact of sea level rise in 1 foot increments, between 1 and 10 feet. You can download the data by US state or territory for coastal areas. There are separate downloads for sea level rise, rise depth, the confidence intervals for the models, as well as DEMs and flood frequency. The sea level rise data is package in an ArcGIS file geodatabase, with two sets of files (a low estimate and high estimate) in one foot increments. An example of 6 feet in sea level rise is shown below.

NOAA Sea Level Rise 6ft Layer
NOAA Sea Level Rise. Areas in pink illustrate sea level 6 feet higher than present

NOAA National Hurricane Center

Beyond showing the general impact of flooding or sea level rise, you can also look at the track of individual hurricanes and tropical storms. The National Hurricane Center’s GIS data page provides historical forecasts – the projected path and cone of storms, windspeeds, storm surges, etc. You choose your year, then can choose a storm, and then a particular day. You can use this data to see how the forecasts evolved as the storm moved. When we’re in hurricane season, you can also see what the circumstances are day by day for tracking new storms.

If you want to see what actually happened (as opposed to a forecast), you can dig through the data page and browse the different options. There’s the Tropical Cyclone Report (TCR) which provides “information on each tropical cyclone, including synoptic history, meteorological statistics, casualties and damages, and the post-analysis best track (six-hourly positions and intensities). Tropical cyclones include depressions, storms and hurricanes.” The default page shows you the Atlantic, but you can swap to Eastern or Central Pacific using the link at the top. Storms are listed alphabetically (and thus by date) and your format options are shapefile or KML. There’s a map at the bottom that depicts and labels all the storms for that season. You actually get four shapefiles in a download; a point file that contains a number of measurements, a line file for the storm track, a polygon file for the radius of the storm, and another polygon with the wind swath. The layers for 2021’s Tropical Storm Henri are illustrated below.

NOAA Tropical Cyclone Report Layers
Layers from NOAA”s NHC Tropical Cyclone Report, Tropical Storm Henri 2021

GIS data for the storms begins in 2010 with KMZ files (which you’ll need to convert in ArcGIS or QGIS to make them useful beyond display purposes), and shapefiles appear in 2015. Further back in time are just PDF reports and map scans.

If you really want to go back and time and get all the tracks at once, there’s the HURDAT2 database; one for the Atlantic (1851 to present) and another for the Pacific (1949 to present). It’s a csv file that contains coordinates for the track of every storm, which you can process to create a geospatial file using a points to line tool. Or – you can grab a version where that’s already been created! The International Best Track Archive for Climate Stewardship (IBTrACS) keeps a running CSV and shapefile of all global storms. Scroll down and choose shapefile (CSV is another option). The download page is just a list of files – you can choose points or lines, storms by ocean (East Pacific, North Atlantic, North Indian, South Atlantic, South Indian, South Pacific, West Pacific), or grab everything in lists that are: active, everything (ALL), last 3 years, or since 1980. Below is an example of all storms in the North Atlantic – there are quite a lot (see below)! You get storm speed and direction, wind speed and direction, coordinates, and identifiers associated with the storm as points and lines. A subset of this data for the 2021 season is displayed in the feature image at the top of this post.

IBTrACS Historical Hurricane Tracks
Historical hurricane / storm tracks from 1851 to 2021 in the North Atlantic from IBTrACS

How About the Weather?

There are many places you can go for this and the best source depends on the use case. More often than not, I end up using the Local Climatological Database. Choose a geographic type, then a specific area, and you’ll see all the weather stations in this area. Add them to the cart, and then view the cart once you have all the stations you want. On the next screen choose an output format (CSV or TXT fixed width) and a date range. You submit an order and wait a bit for it to be compiled, and are notified by email when it’s ready for download. Mixed in this CSV are records that are monthly, daily, and hourly, so after downloading you’ll want to extract just the period you’re interested in. Data includes temperature, precipitation, dew point, wind speed and direction, humidity, barometric pressure, and cloud cover.

NOAA Local Climatological Data Map Tool
Map Tool search interface for NOAA Local Climatological Data

Some processing is required to make these files GIS ready. Each record represents an observation at a station at a given point in time, so if you plot these “as is” the likely idea is you’re making an illustrated time series of some sort, as you’ll have tons of observations plotted on a few spots (where the stations are). If this isn’t desirable, then you’ll filter records to create extracts for just a given point in time, maybe separate features for each time period. For monthly summaries you can pivot time to columns, to create a column for each month and indicator. This would be impractical for daily or hourly summaries, unless you’re focusing on a single month for the former or day / week for the latter (otherwise you’ll have a bazillion columns).

Annoyingly, the CSV option doesn’t include any of the station information in the download (like the standard WBAN ID, name, longitude, latitude, and elevation) except for one unique identifier. I know that this information was all included in the past, and am not sure why it was dropped. The TXT version includes the station info, but fixed-width files are a pain to work with. If you are working with a small number of stations, you can pull the station info individually by previewing the station on the download screen (click on the station title or little eye symbol). The five digit WBAN number is included as the last 5 digits of the identifier in the CSV, so you can identify and relate each one. If you don’t want to mess with copying and pasting, you can generate a second extract for all the stations for just a single day and download that in the TXT format, and then parse just the station columns and associate them with your main table.

There are multiple ways that you can create extracts for this data beyond the example I just provided, available from the main data tools page. For a more refined search you can select the summary period (yearly, monthly, daily, hourly) and targeted variables in advance. There are also FTP options for bulk downloads.

One thing that surprises folks who are new to working with this data, is that there aren’t many weather stations. For the LCD, my home state of Delaware only has three, one in each county. The entire City of New York only has three as well, at each of the airports and one in Central Park. If you’re not interested in points and want areas, then you would need to gather a significant number of stations and do interpolation. Or – use data that’s already modeled. I mentioned PRISM at Oregon State in a previous post, as a nice source for national US rasters of temperature and precipitation that you can generate for dailies, monthlies, and normals.

Census ACS 2020 and Pop Estimates 2021

Last week, the Census Bureau released the latest 5-year estimates for the American Community Survey for 2016-2020. This latest dataset uses the new 2020 census geography, which means if you’re focused on using the latest data, you can finally move away from the 2010-based geography which had been used for the ACS from 2010 to 2019 (with some caveats: 2020 ZCTAs won’t be utilized until the 2021 ACS, and 2020 PUMAs until 2022). As always, mappers have a choice between the TIGER Line files that depict the precise boundaries, or the generalized cartographic boundary files with smoothed lines and large sections of coastal water bodies removed to depict land areas. The 2016-2020 ACS data is available via data.census.gov and the ACS API.

This release is over 3 months late (compared to normal), and there was some speculation as to whether it would be released at all. The pandemic (chief among several other disruptive events) hampered 2020 decennial census and ACS operations. The 1-year 2020 ACS numbers were released over 2 months later than usual, in late November 2021, and were labeled as an experimental release. Instead of the usual 1,500 plus tables in 40 subject areas for all geographic areas with over 65,000 people, only 54 tables were released for the 50 states plus DC. This release is only available from the experimental tables page and is not being published via data.census.gov.

What happened? The details were published in a working paper, but in summary fewer addresses were sampled and the normal mail out and follow-up procedures were disrupted (pg 8). The overall sample size fell from 3.5 to 2.9 million addresses due to reduced mailing between April and June 2020 (pg 18), and total interviews fell from 2 million to 1.4 million with most of the reductions occurring in spring and summer (pg 18). The overall housing unit response rate for 2020 was 71%, down from 86% in 2019 and 92% in 2018 (pg 20). The response rate for the group quarters population fell from 91% in 2019 to 47% in 2020 (pg 21). Responses were differential, varying by time period (with the lowest rates during the peak pandemic months) and geography. Of the 818 counties that meet the 65k threshold, response rates in some were below 50% (pg 21). The data contained a large degree of non-response bias, where people who did respond to the survey had significantly different social, economic and housing characteristics from those who didn’t. As a consequence of all of this, margins of error for the data increased by 20 to 30% over normal (pg 18).

Thus, 2020 will represent a hole in the ACS estimates series. The Bureau made adjustments to weighting mechanisms to produce the experimental 1-year estimates, but is generally advising policy makers and researchers who normally use this series to choose alternatives: either the 1-year 2019 ACS, or the 5-year 2016-2020 ACS. The Bureau was able to make adjustments to produce satisfactory 5-year estimates to reduce non-response bias, and the 5-year pool of samples is balanced somewhat by having at least 4 years of good data.

The Population Estimates Program has also released its latest series of vintage 2021 estimates for counties and metropolitan areas. This dataset gives us a pretty sharp view of how the pandemic affected the nation’s population. Approximately 73% of all counties experienced natural decrease in 2021 (between July 1st 2020 and 2021), where the number of deaths outnumbered births. In contrast, 56% of counties had natural decrease in 2020 and 46% in 2019. Declining birth rates and increasing death rates are long term trends, but COVID-19 magnified them, given the large number of excess deaths on one hand and families postponing child birth due to the virus on the other hand. Net foreign migration continued its years-long decline, but net domestic migration increased in a number of places, reflecting pandemic moves. Medium to small counties benefited most, as did large counties in the Sunbelt and Mountain West. The biggest losers in overall population were counties in California (Los Angeles, San Francisco, and Alameda), Cook County (Chicago), and the counties that constitute the boroughs of NYC.

Census Bureau 2021 Population Estimates Map
STATA records

Creating STATA Variable Lists in Excel and Do Files With Python

In this post I demonstrate how export a list of variables from a STATA dta file to an Excel spreadsheet, and how to create a STATA do file by using Python to read in a list of variables from a spreadsheet; the do file will generate an extract of attributes and observations from a larger dta file. Gallup Analytics microdata serves as the example.

Gallup Analytics Microdata

Many academic libraries subscribe to an online database called Gallup Analytics, which lets users explore and download summary statistics from a number of on-going polls and surveys conducted by the Gallup Organization, such as the US Daily Tracker poll, World Poll, and SPSS polling series. As part of the package, subscribing institutions also receive microdata files for some of the surveys, in STATA and SPSS formats. These files contain the anonymized, individual responses to the surveys. The microdata is valuable to social science researchers who use the responses to conduct statistical analyses.

STATA
Microdata in STATA

Naturally, the microdata is copyrighted and licensed for non-commercial research purposes to members of the university or institution who are covered by the license agreement, and cannot be shared outside the institution. Another stipulation is that the files cannot be shared in their entirety, even for members of the licensed institution; researchers must request individual extracts of variables and observations to answer a specific research question. This poses a challenge for the data librarian, who somehow has to communicate to the researcher what’s available in the files and mediate the request. Option 1 is to share the codebooks (which are also copyrighted and can’t be publicly distributed) with the researcher and haggle back and forth via email to iron out the details of the request. Option 2 is to have a stand-alone computer set up in the library, where a researcher can come and generate their own extract from files stored on a secure, internal network. In both cases, the manual creation of the extract and the researcher’s lack of familiarity with the contents of the data makes for a tedious process.

My solution was to create spreadsheets that list all of the variables in each dataset, and have the researcher check the ones they want. I created a resource guide that advertises and describes the datasets, and provides secure links to the Gallup codebooks and these spreadsheets, which are stored on a Google Drive and are protected via university authentication. The researcher can then fill out a Google form (also linked to from that page), where they describe the nature of the request, select the specific dataset of interest, specify filters on observations (rows), and upload the spreadsheet of requested variables (columns). Then, I can read the spreadsheet variables into Python and generate a STATA do file (STATA scripts stored in plain text format), to create the desired extract which I can share with the researcher.

Create List of STATA Variables in Excel Spreadsheet

First, I created a standard set of STATA do files to output lists of all variables to a spreadsheet for the different data files. An example for the US Daily Tracker poll from pre-2018 is below. I was completely unfamiliar with STATA, but the online docs and forums taught me what I needed to pull this together.

Some commands are the same across all the do files. I use describe and then translate to create a simple text file that saves a summary from the screen that counts rows and columns. Describe gives a description of the data stored in memory, while replace is used to swap out existing variables with a new subset. Then, generate select_vars gives me codebook information about the dataset (select_vars is a variable name I created), which I sort using the name column. The export excel command is followed by the specific summary fields I wish to output; the position of the variable, data type, variable label, and the variable name itself.

* Create variable list for Gallup US Tracker Survey 2008-2017

local y = YEAR in 1

describe,short
summarize YEAR
translate @Results gallup_tracker_`y'_summary.txt, replace

describe, replace
generate select_vars = ""
sort name

export excel position name type varlab select_vars using gallup_tracker_`y'_vars.xlsx, firstrow(variables) replace

The variation for this particular US Daily Tracker dataset is that the files are packaged as one file per year. I load the first file for 2008, and the do file saves the YEAR attribute as a local variable, which allows me to include the year in the summary and excel output file names. I had to run this do file for each subsequent year up to 2017. This is not a big deal as I’ll never have to repeat the process on the old files, as new data will be released in separate, new files. Other datasets imposed different requirements; the GPSS survey is packaged in eleven separate files for different surveys, and the updates are cumulative (each file contains older data plus any updates – Gallup sends us updated files a few times each year). For the GPSS, I prompt the user for input to specify the survey file name, and overwrite the previous Excel file.

With the do file in hand, you open STATA and the data file you want to process, change the working directory from the default user folder to a better location for storing the output, open the do file, and it runs and creates the variable list spreadsheet.

Excel spreadsheet of variables generated from STATA
List of variables in Excel generated from STATA file. Users check the variables they want in an extract in the select_vars column

Create a STATA Do File with Python and Excel

Once a researcher submits their Google form and their selected variable spreadsheet (placing an X in a dedicated column to indicate that they want to include a variable), I run the Python script below. I use the openpyxl module to read the Excel file. I have to modify the paths, spreadsheet file name, and an integer for the particular survey each time I run it. I use the os module to navigate up and down through folders to store outputs in specific places. If the researcher specifies in the Google form that they want to filter observations, for example records for specific states or age ranges, I have to add those manually but I commented out a few examples that I can copy and modify. One caveat is that you must filter using the coded variable and not its label (i.e. if a month value is coded as 2 and its label is February, I must reference the code and not the label). Reading in the requested columns is straightforward; the script identifies cells in the selection column (E) that have an X, then grabs the variable name from the adjacent column.

# -*- coding: utf-8 -*-
"""
Pull selected gallup variables from spreadsheet to create STATA Do File
Frank Donnelly / GIS and Data Librarian / Brown University
"""

import openpyxl as xl, os
from datetime import date

thedate=date.today().strftime("%m%d%Y")
surveys={1:'gallup_covid',2:'gallup_gpss',3:'gallup_tracker',4:'gallup_world'}

rpath=os.path.join('requests','test') # MODIFY BASED ON INPUT
select_file=os.path.join(rpath,'gallup_tracker_2017_vars_TEST.xlsx') #MODIFY BASED ON INPUT
survey_file=surveys[3] #MODIFY BASED ON INPUT

dofile=os.path.join(rpath,'{}_vars_{}.do'.format(survey_file,thedate))
dtafile=os.path.join(os.path.abspath(os.getcwd()),rpath,'{}_extract_{}.dta'.format(survey_file,thedate))


#MODIFY to filter by observations - DO NOT ERASE EXAMPLES - copy, then modify
obsfilter=None
# obsfilter=None
# obsfilter='keep if inlist(STATE_NAME,"CT","MA","ME","NH","RI","VT")'
# obsfilter='keep if inrange(WP1220,18,64)'
# obsfilter='keep if SC7==2 & MONTH > 6'
# obsfilter='keep if FIPS_CODE=="44007" | FIPS_CODE=="25025"'

workbook = xl.load_workbook(select_file)
ws = workbook['Sheet1']

# May need to modify ws col and cell values based on user input
vlist=[]
for cell in ws['E']:
    if cell.value in ('x','X'): 
        vlist.append((ws.cell(row=cell.row, column=2).value))
outfile = open(dofile, "w")
outfile.writelines('keep ')
outfile.writelines(" ".join(vlist)+"\n")
if obsfilter==None:
    pass
else:
    outfile.writelines(obsfilter+"\n")
outfile.writelines('save '+dtafile+"\n")
outfile.close()
print('Created',dofile) 

The plain text do file begins with the command keep followed by the columns, and if requested, an additional keep statement to filter by records. The final save command will direct the output to a specific location.

keep CENREG D17A D23 D24 D5 FIPS_CODE HISPANIC INT_DATE MONTH MOTHERLODE_ID PE_WEIGHT RACE SC7 STATE_NAME WP10202 WP10208 WP10209 WP10215 WP10216 WP10229 WP10230 WP1220 WP1223 YEAR ZIPGALLUPREGION ZIPSTATE
save S:\gallup\processing\scripts\reques\test\gallup_tracker_extract_02202022.dta

All that remains is to open the requested data file in STATA, open the do file, and an extract is created. Visit my GitHub for the do files, Python script, and sample output. The original source data and the variable spreadsheets are NOT included due to licensing issues; if you have the original data files you can generate what I’ve described here. Sorry, I can’t share the Gallup data with you (so please don’t ask). You’ll need to contact your own university or institution to determine if you have access.

Signals

The Role of Geography in Video Games

I’ve suffered from a bit of the blues in this new year, so two weekends ago I logged into Steam and bought a new game to relax a little. It’s been a few years since I’ve bought a new game, and in playing this one and perusing my existing collection I realized they all had one thing in common: geography plays a central role in all of them. So in this first post of the year, I’ll explore the role of geography in video games using examples from my favorites.

Terrain and Exploration in 4X Games

My latest purchase is Signals by MKDB Studios, an indie husband and wife developer team in the UK. It can be classified as a “casual” game, in that it’s: easy to learn and play yet possesses subtle complexity, is relatively relaxing in that it doesn’t require constant action and mental exertion, and you can finish a scenario or round in an hour or so. In Signals, you are part of a team that has discovered an alien signaling device on a planet. The device isn’t working; fixing it requires setting up a fully functional research station adjacent to the device, so you can unlock its secrets. The researchers need access to certain key elements that you must discover and mine. The planet you’re on doesn’t have these resources, so you’ll need to explore neighboring systems in the sector and beyond to find and bring them back.

Space travel is expensive, and the further you journey the more credits (money) you’ll need. To fund travel to neighboring systems in search of the research components (lithium, silicon, and titanium) you can harvest and sell a variety of other resources including copper, iron, aluminum, salt, gold, gems, diamonds, oil, and plutonium. As you flood the market with resources their value declines, yielding diminishing returns. You must be strategic in hopping from planet to planet and deciding what to harvest and sell. Mining resources incurs initial fixed costs for building harvesters (one per resource patch), a solar array for powering them (which can only cover a small area), and a trade post for moving resources to the market (one per planet).

Signals Terrain
Terrain view on Signals. Harvesters extracting resources in the center, other resource patches to the right

The game has two distinct views: one displays the terrain of the planet you’re on, while the other is a map of the sector(s) with different solar systems, so you can explore the planets and their resources and make travel plans. Different terrain provides different resources and imposes limits on game play. You are precluded from constructing buildings on water and mountains, and must clear forests if you wish to build on forested spaces. Terrain varies by type of planet: habitable Earth-like, red Mars-like, ice worlds, and desert worlds. The type of world influences what you will find in terms of resources; salt is only found on habitable worlds, while iron is present in higher quantities on Mars-like worlds. Video games, like maps, are abstractions of reality. The planet view shows you just a small slice of terrain that stands in for the entire world, so that the game can emphasize the planet hopping concept that’s central to its design. The other view – the sector map – is used for navigation and reference, keeping track of where you are, where you’ve been, and where you should go next.

The use of terrain and the role of physical geography are key aspects in simulators (like SimCity, an old favorite) and the so-called 4X games which focus on exploration, mining, trading, and fighting, although not all games employ all four aspects (Signals has no fighting or conquest component). Another example of a 4X game is Factorio by Wube Software, which I’ve written about previously. Like Signals, exploring terrain and mining resources are central to the game play. Similarities end there, as Factorio is anything but a casual game. It requires a significant amount of research and experimentation to learn how to play, which means consulting wikis, tutorials, and YouTube. It also takes a long time – 30 to 40 hours to complete one game!

The action in Factorio occurs on a single planet, where you’re looking for resources to mine to build higher order goods in factories, that you turn into even higher order products as you unlock more technologies by conducting research, with the ultimate goal of constructing a rocket to get off the planet. There are also two map views: the primary terrain view that you navigate as the player, and an overview map displaying the extent of the planet that you’ve explored. You begin with good knowledge of what lies around you, as you captured this info before your spaceship crashed and marooned you here. Beyond that is simply unknown darkness. To reveal it, you physically have to go out and explore, or build radar devices that gradually expand your knowledge. The terrain imposes limits on building and movement; water can’t be traversed or built upon, canyons block your path, and forests slow movement and prevent construction unless you chop them down (or build elsewhere). The world generated in Factorio is endless, and as you use up resources you have to push outward to find more; you can build vehicles to travel more quickly, while conveyor belts and trains can transport resources and products to and around your factory; this growing logistical puzzle forms a large basis of the game.

Factorio
Mining drills in Factorio extracting stone. Belts transport resources short distances, while trains cover longer distances.

The role of terrain and exploration has long been a mainstay in these kinds of games. Thanks to DOSbox (an emulator that let’s you run DOS and DOS programs on any OS), I was recently playing the original Sid Meier’s Civilization from 1991 by Microprose. This game served as a model for many that followed. Your small band of settlers sets out in 4000 BC, to found the first city for your particular civilization. You can see the terrain immediately around you, but the rest of the world is shrouded in darkness. Moving about slowly reveals more of the world, and as you meet other civs you can exchange maps to expand your knowledge. The terrain – river basins, grasslands, deserts, hills, mountains, and tundra – influences how your cities will grow and develop based on the varying amount of food and resources it produces. Terrain also influences movement; it is tougher and takes longer to move over mountains versus plains, and if you construct roads that will increase speed and trade. The terrain also influences attack and defense capabilities for military units…

Original Civilization
The unknown, unexplored world is shrouded in darkness in Civilization

Terrain and Exploration in Strategic Wargames

Strategic war games are another genre where physical geography matters. One weekend shortly after the pandemic lock-down began in 2020, I dug my old Nintendo out of the closet and replayed some of my old favorites. Desert Commander by Kemco was a particular favorite, and one of Nintendo’s only war strategy games. You command one of two opposing armies in the North African desert during World War II, and your objective is simple: eliminate the enemy’s headquarters unit, or their entire army. You have a mix of tanks, infantry, armored cars, artillery, anti-aircraft guns, supply trucks, fighters, and bombers at your command. Each unit varies in terms of range of movement and offensive and defensive strength, in total and relative to other units. Tanks are powerful attackers, but weak defenders against artillery and hopeless against bombers. Armored cars cruise quickly across the sands compared to slowly trudging infantry.

Terrain also influences movement, offense, and defense. You can speed along a road, but if you’re attacked by planes or enemy tanks you’ll be a sitting duck. Desert, grassland, wilds (mountains), and ocean make up the rest of the terrain, but scattered about are individual features like pillboxes and oases which provide extra defense. A novel aspect of the game was its reliance on supply: units run low on fuel and ammo, and after combat their numbers are depleted. You can supply fuel and ammo to ground units with trucks, but eventually these also run out of gas. Scattered about the map are towns, which are used for resupply and reinforcement. There are a few airfields scattered about, which perform the same functions for aircraft. Far from being a simple battle game, you have to constantly gauge your distance and access to these supply bases, and consider the terrain that you’re fighting on. The later scenarios leave you hopelessly outnumbered against the enemy, which makes the only winning strategy a defensive one where you position your headquarters and the bulk of your forces at the best strong point, while simultaneously sending out a smaller strike force to get the enemy HQ.

Desert Commander
Units and terrain in NES Desert Commander. Towns and airfields are used for resupply.

There have been countless iterations and updates on this type of game. One that I have in my Steam library is Unity of Command by 2×2 Games, which pits the German and Soviet armies in the campaigns around Stalingrad during World War II. Like most modern turn-based games, the grid structure (used in Desert Commander) has been replaced with a hex structure, reflecting greater adjacency between areas. Again, there are a mix of different units with different strengths and capabilities. The landscape on the Russian steppe is flat, so much of the terrain challenge lies in securing bridgeheads or flanking rivers when possible, as attacking across them is suicidal. The primary goal is to capture key objectives like towns and bridgeheads in a given period of turns. A unit’s attack and movement phase are not strictly separated, so you can attack with a unit, move out of the way, and move another in to attack again until you defeat a given enemy. This opens up a hole, allowing you to pour more units through a gap to capture territory and move towards the objectives. The supply concept is even more crucial in UOC; as units move beyond their base, which radiates from either a single point or from a roadway, they will eventually run out of supplies and will be unable to fight. By pushing through gaps in defense and outflanking the enemy, you can capture terrain that cuts off this supply, which is more effective than trying to attack and destroy everything.

Unity of Command
Unity of Command Stalingrad Campaign. Push units through gaps to capture objectives and cut off enemy supplies.

The most novel take on this type of game that I’ve seen is Radio General by Foolish Mortals. This WWII game is a mix of strategy and history lesson, as you command and learn about the Canadian army’s role in the war (it incorporates an extensive amount of real archival footage). As the general commanding the army, you can’t see the terrain, or even where any of the units are – including your own. You’re sitting behind the lines in a tent, looking at a topographic map and communicating with your army – by voice! – on the radio. You check in and confirm where they are, so you can issue orders (“Charlie company go to grid cell Echo 8”), and then slide their little unit icons on the map to their last reported position. They radio in updates, including the position of enemy units. The map doesn’t give you complete and absolute knowledge in this game; instead it’s a tool that you use to record and understand what’s going on. Another welcome addition is the importance of elevation, which aids or detracts in movement, attack, defense, and observation. A unit sitting on top of a hill can relay more information to you about battlefield circumstances compared to one hunkered down in a valley.

Radio General
In Radio General, the topo map is the battlefield. You rely on the radio to figure out where your units, and the enemy units, are.

Strategy Games with the Map as Focal Point

While terrain takes center stage in many games, in others all the action takes place on a map. Think of board games like Risk, where the map is the board and players capture set geographic areas like countries, which may be grouped into hierarchies (like continents) that yield more resources. Unlike the terrain-based games where the world is randomly generated, most map-based games are relatively fixed. Back to the Nintendo, the company Koei was a forerunner of historical strategy games on consoles, my favorite being Genghis Khan. The basic game view was the map, which displayed the “countries” of the world in the late 12th and early 13th centuries. Each country produced specialized resources based on its location, had military units unique to its culture, and produced gold, food, and resources based on its infrastructure. Your goal was to unite Mongolia and then conquer the world (of course). Once you captured other countries, they remained as distinct areas within your empire, and you would appoint governors to manage them. When invading, the view switched from an administrative map mode to a battlefield terrain mode, similar to ones discussed previously.

Genghis Khan
Genghis Khan on the original NES. The map is the focal point of the game.

Fast forward to now, and Paradox has become a leading developer of historical strategy games. Crusader Kings 2 is one of their titles that I have, where the goal is to rule a dynasty from the beginning to the end of the medieval period in the old world. Conquering the entire world is unlikely; you aim to rule some portion of it, with the intention of earning power and prestige for your dynasty through a variety of means, warlike and peaceful. These are complex games, which require diving into wikis and videos to understand all the medieval mechanics that you need to keep the dynasty going. Should you use gavelkind, primogeniture, seniority, or feudal elective succession? Choose wisely, otherwise your kingdom could fracture into pieces upon your demise, or worse your nefarious uncle could take the throne.

CK2 takes human geographical complexity to a new level with its intricate hierarchy of places. The fundamental administrative unit on the map is a county. Within the county you have sub-divisions, which in GIS-speak are like “points in polygons”: towns ruled by a non-noble mayor, parishes ruled by a bishop, and maybe a barony ruled by a noble baron. Ostensibly, these would be vassals to the count, while the count in turn is a vassal to a duke. Several counties form a duchy, which in turn make up a kingdom, and in some cases several kingdoms are part of empires (i.e. Holy Roman and Byzantine). In every instance, rulers at the top of the chain will hold titles to smaller areas. A king holds a title to the kingdom, plus one or more duchies, one or more counties (the king’s demesne), and maybe a barony or two. If he / she doesn’t hold these titles directly, they are granted to a vassal. A big part of the game is thwarting the power of vassals to lay claim to your titles, or if you are a vassal, getting claims to become the rightful heir. Tracking and shaping family relations and how they are tied to places is a key to success, more so than simply invading places.

Crusader Kings 2
Geographic hierarchy in Crusader Kings 2. Middlesex is a county with a town, barony, and parish. It’s one of four counties in the Duchy of Essex, in the Kingdom of England.

Which in CK2, is hard to do. Unlike other war games, you can’t invade whoever you want (unless they are members of a rival religion). The only way to go to war is if you have a legitimate claim to territory. You gain claims through marriage and inheritance, through your vassals and their claims, by fabricating claims, or by claiming an area that’s a dejure part of your territory, i.e. that is historically and culturally part of your lands. While the boundaries of the geographic units remain stable, their claims and dejure status change over time depending on how long they’re held, which makes for a map that’s dynamic. In another break from the norm, the map in CK2 performs all functions; it’s the main screen for game play, both administration and combat. You can modify the map to show terrain, political boundaries, and a variety of other themes.

Conclusion

I hope you enjoyed this little tour, which merely scratches the surface of the relation between geography and video games, based on a small selection of games I’ve played and enjoyed over the years. There’s a tight relationship between terrain and exploration, and how topography influences resource availability and development, the construction of buildings, movement, offense and defense. In some cases maps provide the larger context for tracking and explaining what happens at the terrain level, as well as navigating between different terrain spaces. In other cases the map is the central game space, and the terrain element is peripheral. Different strategies have been employed for equating the players knowledge with the map; the player can be all knowing and see the entire layout, or they must explore to reveal what lies beyond their immediate surroundings.

There are also a host of geographically-themed games that make little use of maps or terrain. For example, Mini Metro by Dinosaur Polo Club is a puzzle game where you connect constantly emerging stations to form train lines to move passengers, using a schematic resembling the London tube map. In this game, the connectivity between nodes in a network is what’s important, and you essentially create the map as you go. Or 3909 LLC’s Papers, Please, a dystopian 1980s “document simulator” where you are a border control guard in an authoritarian country in a time of revolution, checking the documentation of travelers against ever changing rules and regulations (do traveler’s from Kolechia need a work permit? Is Skal a city in Orbistan or is this passport a forgery…). Of course, we can’t end this discussion without mentioning Where in the World is Carmen Sandiego, Broderbund’s 1985 travel mystery that introduced geography and video games to many Gen Xers like myself. Without it, I may have never learned that perfume is the chief export of Comoros, or that Peru is slightly smaller than Alaska!

Where in the World is Carmen Sandiego? In hi-tech CGA resolution!