tutorial

Sample of Geolocated Tweets Nov 1, 2022

Parsing the Internet Archive’s Twitter Stream Grab with Python

In this post I’ll share a process for getting geo-located tweets from Twitter, using large files of tweets archived by the Internet Archive. These are tweets where the user opted to have their phone or device record the longitude and latitude coordinates for their location, at the time of the tweet. I’ve created some straightforward scripts in Python without any 3rd party modules for processing a daily file of tweets. Given all the turmoil at Twitter in early 2023, most of the tried and true solutions for scraping tweets or using their APIs no longer function. What I’m presenting here is one, simple solution.

Social media data is not my forte, as I specialize in working with official government datasets. When such questions turn up from students, I’ve always turned to the great Web Scraping Toolkit developed by our library’s Center for Digital Scholarship. But the graduate student I was helping last week and I discovered that both the Twint and TAGS tools no longer function due to changes in Twitter’s developer policies. Surely there must be another solution – there are millions of posts on the internet that show how easy it is to grab tweets via R or Python! Alas, we tried several alternatives to no avail. Many of these projects rely on third party modules that are deprecated or dodgy (or both), and even if you can escape from dependency hell and get everything working, the changed policies rendered them moot.

You can register under Twitter’s new API policy and get access to a paltry number of records. But I thought – surely, someone else has scraped tons of tweets for academic research purposes and has archived them somewhere – could we just access those? Indeed, the folks at Harvard have. They have an archive of geolocated tweets in their dataverse repository, and another one for political tweets. They are also affiliated with a much larger project called DocNow with other schools that have different tweet archives. But alas, there are rules to follow, and to comply with Twitter’s license agreement Harvard and these institutions can’t directly share the raw tweets with anyone outside their institutions. You can search and get IDs to the tweets, using their Hydrator application, which you can use in turn to get the actual tweets. But then in small print:

“Twitter’s changes to their API which greatly reduce the amount of read-only access means that the Hydrator is no longer a useful application. The application keys, which functioned for the last 7 years, have been rescinded by Twitter.”

Fortunately, there is the Internet Archive, which has been working to preserve pieces of the internet for posterity for several decades. Their Twitter Stream Grab consists of monthly collections of daily files for the past few years, from 2016 to 2022. This project is no longer active, but there’s a newer one called the Twitter Archiving Project which has data from 2017 to now. I didn’t investigate this latter one, because I wasn’t sure if it provided the actual tweets or just metadata about them, while the older project definitely did. The IA describes the Stream Grab as the “spritzer” version of Twitter grabs (as opposed to a sprinkler or garden hose). Thanks to the internet, it’s easy to find statistics but hard to find reliable ones – this one, credible looking source (the GDELT Project) suggests that there are between 400 and 500 million tweets a day in recent years. The file I downloaded from IA for one day had over 4 million tweets, so that’s about 1% of all tweets.

I went into the November 2022 collection and downloaded the file for Nov 1st. It’s a TAR file that’s about 3 GB. Unzipping it gives you a folder for that data named for the date, with hundreds of gz ZIP files. Unzip those, and you have tons of JSON Line files. These are JSON files where each JSON record has been collapsed into one line.

Internet Archive Twitter Stream Grab

Python to the rescue. See GitHub for the full scripts – I’ll just add some snippets here for illustration. I wrote two scripts: the first reads in and aggregates all the tweets from the JSONL files, parses them into a Python dictionary, and writes out the geo-located records into regular JSON. The second reads in that file, selects the elements and values that we want into a list format, and writes those out to a CSV. The rationale was to separate importing and parsing from making these selections, as we’re not going to want to repeat the time-consuming first part while we’re tweaking and modifying the second part.

In the sample data I used for 11/01/2022, unzipping the downloaded TAR file gave me a date folder, and in that date folder were hundreds of gz ZIP files. Unzipping those revealed the JSONL files. I wrote the script to look in that date folder, one level below the folder that holds the scripts, and read in anything that ended with .json. Not all of the Internet Archive’s stream’s are structured this way; if your downloads are structured differently, you can simply move all the unzipped json files to one directory below the script to read them. Or, you can modify the script to iterate through sub-directories.

Because the data was stored as JSONL, I wasn’t able to read it in as regular JSON. I read each line as a string that I appended to a list, iterated through that list to convert it into a dictionary, pulled out the records that had geo-located elements, and added those records to a larger dictionary where I used an identifier in the record as a key and the value as a dictionary with all elements and values for a tweet. This gets written out as regular JSON at the end. Reading the data in didn’t take long; parsing the strings into dictionaries was the time consuming part. Originally, I wanted to parse and save all 4 million records, but the process stalled around 750k as I ran out of memory. Since so few records are geo-located, just selecting these circumvented this problem. If you wanted to modify this part to get other kinds of records, you would need to apply some filter, or implement a more efficient process than what I’m using.

json_list=[] # list of lists, each sublist has 1 string element = 1 line

for f in os.listdir(json_dir):
    if f.endswith('.json'):
        json_file=os.path.join(json_dir,f)
        with open(json_file,'r',encoding='utf-8') as jf:
            jfile_list = list(jf) # create list with one element, a line saved as a string 
            json_list.extend(jfile_list)
            print('Processed file',f,'...')

geo_dict={} # dictionary of dicts, each dict has line parsed into keys / values
i=0   
for json_str in json_list:
    result = json.loads(json_str) # convert line / string to dict
    if result.get('geo')!=None: # only take records that were geocoded
        geo_dict[result['id']]=result 
    i=i+1
    if i%100000==0:
        print('Processed',i,'records...')

The second script reads the JSON output from the first, and simply iterates through the dictionary and chooses the elements and values I want and assigns them to variables. Some of these are straightforward, such as grabbing the timestamp and tweet. Others required additional work. The source element provides HTML code with a source link and name, so I split and strip this value to get them separately. The coordinates are stored as a list, so to get longitude and latitude as separate values I indicate the list position. In cases where I’m delving into a sub-dictionary to get a value (like the coordinates), I added if statements to set values to None if they don’t exist in the JSON, otherwise you get an error. Once I finish iterating, I append all these variables to a list, and add this list to the main one that captures every record. I create a matching header row list, and both are written out as a CSV.

with open(input_json) as json_file:
    twit_data = json.load(json_file)

twit_list=[]

# In this block, select just the keys / values to save
for k,v in twit_data.items():
    tweet_id=k
    timestamp=v.get('created_at')
    tweet=v.get('text')
    # Source is in HTML with anchors. Separate the link and source name
    source=v.get('source') # This is in HTML
    source_url=source.split('"')[1] # This gets the url
    source_name=source.strip('</a>').split('>')[-1] # This gets the name
    lang=v.get('lang')
    # Value for long / lat is stored in a list, must specify position
    if v['geo'] !=None:
        longitude=v.get('geo').get('coordinates')[1]
        latitude=v.get('geo').get('coordinates')[0]
    else:
        longitude=None
        latitude=None
...

My code could use improvement – much of this could be abstracted into a function to avoid repetition. We were in a hurry, and I’m also working with folks who need data but aren’t necessarily familiar with Python, so something that’s inefficient but understandable is okay (although I will polish this up in the future).

I provide the output in GitHub, examples of the final CSV appear below. Every language in the world is captured in these tweets, so Windows users need to import the CSV into Excel (Data – From Text/CSV) and choose UTF-8 encoding. Double-clicking the CSV to open it in Excel in Windows will render most of the text as junk, in the default Windows-1252 encoding.

Tweets extracted from Internet Archive with timestamp, tweets, and source information
Geolocated Twitter Data 1
Tweets extracted from Internet Archive, showing geo-located information

So, is this data actually useful? That’s an open question. Of the 4 million tweets in this file, just over 1,158 were geo-located! I checked and this is not a mistake. The metadata record for the Harvard geolocated tweets mentions that only 1% to 2% of all tweets are geo-located. So of the 400 million daily tweets, only 4 million. And out of our daily 4 million sample from IA, just 1,158 (less than 1%). What we ended up with does give you a sense of variety and global coverage (see map at the top of the post, showing sample of tweets by language Nov 1, 2022). In this sample, the top five countries represented were: US (35%), Japan (17%), Brazil (4%), UK (4%), Mexico and Turkey (tied 3%). For languages, the top five: English (51%), Japanese (17%), Spanish (9%), Portuguese (5%), and Turkish (3%).

In many cases, I think you’d need a larger sample than a single day, assuming you’re interested in just geo-located records. Perhaps 4 million is large enough for certain non-spatial research? Again, not my area of expertise, but you would want to be aware of events that happened on a certain date that would influence what was tweeted. My graduate student wanted to see differences in certain kinds of tweets in the LA metro area versus the rest of the US, but this sample includes less than 20 tweets from LA. To do anything meaningful, she’d have to download and process a whole month of tweets (at least). Even then, there are certain tweeters that show up repeatedly in given areas. In NYC, most of the tweets on this date were from the 511 service, warning people where that day’s potholes were.

Beyond the location of the tweet, there is a lot of information about the user, including their self-reported location. This data is available in all tweets (not just the geo-located ones). But there are a lot problems with this attribute: the user isn’t necessarily tweeting from that location, as it represents their “static” home. This location is not geocoded, and it’s self reported and uncontrolled. In this example, some users dutifully reported their home as ‘Cleveland, OH’ or ‘New York City’. Other folks listed ‘NYC – LA – ATL – MIA’, ‘CIUDAD DE LAS BAJAS PASIONES’, ‘H E L L’, and ‘Earth. For now’.

Even for research that incorporated geo-located tweets from other, larger data sources that were previously accessible, how representative are all those studies when the data represents only 1% of the total tweet volume? I am skeptical. Also consider the information from the good folks at the Pew Research Center, that tells us that only one in five US adults use Twitter, and that the minority of Twitter users generate the vast majority of tweets: “The top 25% of US users by tweet volume produce 97% of all tweets, while the bottom 75% of users produce just 3%” (10 Facts About Americans and Twitter May 5, 2022).

For what’s it worth, if you need access to Twitter data for academic, non-commercial research purposes and the old methods aren’t working, perhaps the Internet Archive’s data and the solution posed here will fit the bill. You can see the geo-located output (JSON and CSV) from this example in the GitHub repo’s output folder. There is also a samples folder, which contains JSON and CSV for about 77k records that include both geo-located and non-geolocated examples. Looking at the examples can help you decide how to modify the scripts, to pull out different elements and values of interest.

Raster Temperature Jan 1, 2020 Southern NE

Summarizing Raster Data for Areas and Assigning Values to Points

It’s been a busy few months, but I have a few days to catch my breath now that it’s spring break and most people (except me) have gone away! One question that’s come up quite a bit this semester is how to associate raster data with coinciding vector data. I’ll summarize some approaches in this post using ArcGIS Pro and QGIS, to summarize raster values for polygons (zonal statistics) and to assign raster values to points (aka raster sampling).

Zonal Statistics: Summarize Rasters by Area

Imagine that you have quantitative values such as temperature or a vegetation index in a raster grid, and you want to use this data to calculate an average for counties or metro areas. The goal is to have a new attribute column in the vector layer that contains the summarized raster value, perhaps because you want to make thematic maps of that value, or you want to use it in conjunction with other variables to run spatial statistics, or you just want a plain and simple summary for given places.

The term zonal statistics is used to define any operation that calculates statistics on cell values of a raster within an area or zone defined by another dataset, either a raster or a vector. The ArcGIS Pro toolbox has a Zonal Statistics tool where the output is a new raster file with cells that are summarized by the input zones. That’s not desirable for the use case I’m presenting here; the better choice is the Zonal Statistics as Table tool. The output is a table containing the unique identifiers of the raster and vector, the summary stats you’ve generated (average, sum, min, max, etc), and a count of the number of cells used to generate the summary. You can join this resulting table back to the vector file using their common unique identifier in a table join.

In the example below, I’m using counties from the census TIGER files for southern New England as my Input Feature Zone, the AFFGEOID (Census ANSI / FIPS code) to identify the Zone Field, and a temperature grid for January 1, 2020 from PRISM as the Input Value Raster. I’m calculating the mean temperature for the counties on that day.

ArcGIS Zonal Statistics as Table Tool
ArcGIS Pro Zonal Statistics as Table; Temperature Grid and Southern New England Counties

The output table consists of one record for each zone / county, with the count of the cells used to create the average, and the mean temperature (in degrees Celsius). This table can be joined back to the original vector feature (select the county feature in the Contents, right click, Joins and Relates – Join) to thematically map the average temp.

ArcGIS Zonal Statistics Result
ArcGIS Pro Zonal Statistics; Table Output and Join to Show Average Temperature per County

In QGIS, this tool is simply called Zonal Statistics; search for it in the Processing toolbox. The vector with the zones is the Input layer, and the Raster layer is the grid with the values. By default the summary stats are the count, sum, and mean, but you can check the Statistics to calculate box to select others. Unlike ArcGIS, QGIS allows you to write output as a table or a new shapefile / geopackage, which carries along the feature geometry from the Input zones and adds the summaries, allowing you to skip the step of having to do a table join (if you opted to create a table, you could join it to the zones using the Joins tab under the Properties menu for the vector features).

QGIS Zonal Stats
QGIS Zonal Statistics

Extract Raster Values for Point Features

Zonal stats allows you to summarize raster data within a polygon. But what if you had point features, and wanted to assign each point the value of the raster cell that it falls within? In ArcGIS Pro, search the toolbox for the Extract Values to Points tool. You select your input points and raster, and a new point feature that will include the raster values. The default is to take the value for the cell that the point falls within, but there is an Interpolate option that will calculate the value from adjacent cells. The output point feature contains a new column called RASTERVALU. I created some phony point data and used it to generate the output below.

ArcGIS Extract Values to Points
ArcGIS Pro Extract Values to Points (assign raster cell values to points)

In QGIS the name of this tool is Sample raster values, which you can find in the Processing toolbox. Input the points, choose a raster layer, and write the output to a new vector point file. Unlike ArcGIS, there isn’t an option for interpolation from surrounding cells; you simply get the value for the cell that the point falls within. If you needed to interpolate, you can go to the Plugins menu, enable the SAGA plugin, and in the Processing toolbox try the SAGA tool Raster Values to Points instead.

QGIS Sample Raster Values
QGIS Sample Raster Values (assign raster cell values to points)

A variation on this theme would be to create and assign an average value around each point at a given distance, such as the average temperature within five miles. One way to achieve this would be to use the buffer tools in either ArcGIS or QGIS to create distinct buffers around each point at the specified distance. The buffer will automatically carry over all the attributes from the point features, including unique identifiers. Then you can run the zonal statistics tools against the buffer polygons and raster to compute the average, and if need be do a table join between the output table and the original point layer using their common identifier.

Wrap-up

In using any of these tools, it’s important to consider the resolution of the raster (i.e. the size of the grid cell):

1. Relative to the size of the zonal areas or number of points, and

2. In relation to the phenomena that you’re studying.

When larger grid cells or zonal areas are used for measurement, any phenomena becomes more generalized, and any variations within these large areas become masked. The temperature grid cells in this example had a resolution of 2.5 miles, which was suitable for creating county summaries. Summarizing data for census tracts at this resolution would be less ideal, as the tracts are much smaller than the cells, with the cell value characterizing a much larger area. This might be okay in the case of temperature, which tends not to vary considerably over a distance of a few miles. In contrast, averaging temperature data for states is not worthwhile, as states vary considerably in size and most are large enough that they contain multiple ecosystems and elevation levels.

The solutions I’ve described here are the desktop GIS solutions. You could also use either spatial SQL in a geodatabase or a spatial extension in a scripting language like Python or R to perform similar operations. In both cases a basic overlay and intersection statement is used, in conjunction with some grouping function for calculating summaries. I’ve been doing a lot more spatial Python work with geopandas these past few months – perhaps a topic for a subsequent post…

Stata splash screen

Introduction to Stata Tutorial

This month’s post will be brief, but helpful for anyone who wants to learn Stata. I wrote a short tutorial called First Steps with Stata for an introductory social science data course I visited earlier this month. It’s intended for folks who have never used a statistical package or a command-driven interface, and represents initial steps prior to doing any statistical analysis:

  1. Loading and viewing Stata dta files
  2. Describing and summarizing data
  3. Modifying and recoding data
  4. Batch processing with Do files / scripts
  5. Importing data from other formats

I chose the data that I used in my examples to illustrate the difference between census microdata and summary data, using sample data from the Current Population Survey to illustrate the former, and a table from the American Community Survey to represent the latter.

I’m not a statistician by training; I know the basics but rely on Python, databases, Excel, or GIS packages instead of stats packages. I learned a bit of Stata on my own in order to maintain some datasets I’m responsible for hosting, but to prepare more comprehensively to write this tutorial I relied on Using Stata for Quantitative Analysis, which I highly recommend. There’s also an excellent collection of Stata learning modules created by UCLA’s Advance Research Computing Center. Stata’s official user documentation is second to none for clearly introducing and explaining the individual commands and syntax.

In my years working in higher ed, the social science and public policy faculty I’ve met have all sworn by Stata over the alternatives. A study of citations in the health sciences, where stats packages used for the research were referenced in the texts, illustrates that SPSS is employed most often, but that Stata and R have increased in importance / usage over the last twenty years, while SAS has declined. I’ve had some students tell me that Stata commands remind them of R. In searching through the numerous shallow reviews and comparisons on the web, I found this post from a data science company that compares R, Python, SPSS, SAS, and Stata to be comprehensive and even-handed in summarizing the strengths and weaknesses of each. In short, I thought Stata was fairly intuitive, and the ability to batch script commands in Do files and to capture all input / output in logs makes it particularity appealing for creating reproducible research. It’s also more affordable than the other proprietary stats packages and runs on all operating systems.

Example – print the first five records of the active dataset to the screen for specific variables:

list statefip age sex race cinethp in f/5
     +-------------------------------------------+
     | statefip   age      sex    race   cinethp |
     |-------------------------------------------|
  1. |  alabama    76   female   white       yes |
  2. |  alabama    68   female   black       yes |
  3. |  alabama    24     male   black        no |
  4. |  alabama    56   female   black        no |
  5. |  alabama    80   female   white       yes |
     |-------------------------------------------|
Dingo Paths from ZoaTrack

Wildlife Tracking GIS Data Sources

I’ve also received a number of questions this semester about animal observation and tracking data. Since I usually study people and not animals, I was a bit out of my element and had some homework to do. If you’ve ever watched nature shows, you’ve seen scientists tagging animals with collars or bands to track them by radio or satellite, or setting up cameras to record them. Many scientists upload their GPS coordinate data into publicly accessible repositories for others to download and use.

I’ve written a short, three-part document that I’ve posted on our tutorials page: GIS Data Sources for Wildlife Tutorial. In the first part, I provide summaries, links, and guidance on using large portals like Movebank and Zoatrack that include many species from all over the world (wild and domestic), as well a government repositories including NOAA’s National Center for Environment Information Geoportal and the National Park Service’s Data Store. The second part focuses on search strategies, crawling the web and combing through academic literature in library databases to find additional data. Since these datasets are highly diffuse, it’s worth going beyond the portals to see what else you can discover.

I describe how you can add and visualize this data in QGIS and ArcGIS Pro in the third and final part. Wildlife data comes packaged in a number of formats; in some cases you’ll find shapefiles or geodatabases that you can readily add and visualize, but more often than not the data is packaged in a plain CSV / TXT format. This requires you to plot the coordinates (X for longitude, Y for latitude) to create a dot map of the observations. Data files will often contain a number of individual animals, which can be uniquely identified with a tag ID, allowing you to symbolize the points by category so you have a different color or symbol for each individual. Alternatively, there might be separate data files for each individual, that you could add and symbolize differently. The files will contain either a sequential integer or a timestamp that indicates the order of the observations. With one field that indicates the order and another that identifies each individual, you can use a Points to Line or Points to Path tool to generate lines (tracks or trajectories) from the points (observations or detections).

You can see where dingos in Queensland, Australia are going in the screenshot below, which displays individual observation points, and the screenshot in the header of this post where the points were connected to form paths. I obtained the data from ZoaTrack and used QGIS for mapping. Check out the tutorial for details on how to find and map your favorite animals.

Dingo observations from ZoaTrack plotted in QGIS
IPUMS CPS Table

Creating Geographic Estimates with the IPUMS CPS Online Data Analysis System

Introduction

In this post I’ll demonstrate how to use the IPUMS CPS Online Data Analysis System to generate summary data from the US Census Bureau’s Current Population Survey (CPS). The tool employs the Survey Documentation and Analysis system (SDA) created at UC Berkeley.

The CPS is a monthly stratified sample survey of 60k households. It includes a wide array of statistics, some captured routinely each month, others at various intervals (such as voter registration and participation, captured every November in even-numbered years). The same households are interviewed over a four-month period, then rotated out for four months, then rotated back in for a final four months. Given its consistency, breadth, high response rate and accuracy (interviews are conducted in-person and over the phone), researchers use the CPS microdata (individual responses to surveys that have been de-identified) to study demographic and socio-economic trends among and between different population groups. It captures many of the same variables as the American Community Survey, but includes a fair number that are not.

I think the CPS is used less often by geographers, as the sample size is too small to produce reliable estimates below the state or metropolitan area levels. I find that students and researchers who are only familiar with working with summary data often don’t use it; generating your own estimates from microdata records can be time consuming.

The IPUMS project provides an online analyzer that lets you generate summary estimates from the CPS without having to handle the individual sample records. I’ve used it in undergraduate courses where students want to generate extracts of data at the regional or state level, and who are interested in variables not collected in the ACS, such as generational households for immigrants. The online analyzer doesn’t include the full CPS, but only the data that’s collected in March as part of the core CPS series, and the Annual Social & Economic Supplement (ASEC). It includes data from 1962 to the present.

To access any of the IPUMS tools, you must register and create an account, but it’s free and non-commercial. They provide an ample amount of documentation; I’ll give you the highlights for generating a basic geographic-based extract in this post.

Creating a Basic Geographic Summary Table

Once you launch the tool, the first thing you need to do is select some variables. You can use the drill-down folder menus at the bottom left, but I find it’s easier to hit the Codebook button and peruse the alphabetical list. Let’s say we want to generate state-level estimates for nativity for a recent year. If we go into the codebook and look-up nativity, we see it captures foreign birthplace or parentage. Also in the list is a variable called statefip, which are the two-digit codes that uniquely identify every state.

Codebook for Nativity – Foreign Birthplace or Parentage

Back on the main page for the Analyzer in the tables tab, we provide the following inputs:

  1. Row represents our records or observations. We want a record for every state, so we enter the variable: statefip.
  2. Column represents our attributes or variables. In this example, it’s: nativity.
  3. Selection filter is used to specify that we want to generate estimates from a subset of all the responses. To generate estimates for the most recent year, we enter year as the variable and specify the filter value in parentheses: year(2020). If we didn’t specify a year, the program would use all the responses back to 1962 to generate the estimates.
  4. Weight is the value that’s used to weight the samples to generate the estimates. The supplemental person weight sdawt is what we’ll use, as nativity is measured for individual persons. There is a separate weight for household-level variables.
  5. Under the output option dropdown, we change the Percentaging option from column to row. Row calculates the percentage of the population in each nativity category within the state (row). The column option would provide the percentage of the population in each nativity category between the states. In this example, the row option makes more sense.
  6. For the confidence interval, check the box to display it, as this will help us gauge the precision of the estimate. The default level is 95%; I often change this to 90% as that’s what the American Community Survey uses.
  7. At the bottom of the screen, run the table to see the result.
CPS Online Analyzer – Generate Basic Extract for Nativity by State for a Single Year

In the result, the summary of your parameters appears at the top, with the table underneath. At the top of the table, the Cells contain legend lists what appears in each of the cells in order. In this example, the row percent is first, in bold. For the first cell in Alabama: 91.0% of the total population have parents who were both born in the US, the confidence interval is 90.2% to 91.8% (and we’re 90% confident that the true value falls within this range), and the large number at the bottom is the estimated number of cases that fall in this category. Since we filtered our observations for one year, this represents the total population in that category for that year (if we check the totals in the last column against published census data, they are roughly equivalent to the total population of each state).

Output Table – Nativity by State in 2020

Glancing through the table, we see that Alabama and Alaska have more cases where both parents are born in the US (91.0% and 85.1%) relative to Arizona (68.7%). Arizona has a higher percentage of cases where both parents are foreign born, or the persons themselves are foreign-born. The color coding indicates the Z value (see bottom of the table for legend), which indicates how far a variable deviates from the mean, with dark red being higher than the expected mean and dark blue being lower than expected. Not surprisingly, states with fewer immigrants have higher than average values for both parents native born (Alabama, Alaska, Arkansas) while this value is lower than average for more diverse states (Arizona, California).

To capture the table, you could highlight / copy and paste the screen from the website into a spreadsheet. Or, if you return to the previous tab, at the bottom of the screen before running the table, you can choose the option to export to CSV.

Variations for Creating Detailed Crosstabs

To generate a table to show nativity for all races:

Input Parameters – Generate Tables for Nativity by State for each Race

In the control box, type the variable race. The control box will generate separate tables in the results for each category in the control variable. In this case, one nativity table per racial group.

To generate a table for nativity specifically Asians:

Input Parameters – Generate Table for Nativity by State for Asians

Remove race from the control box, and add it in the filter box after the year. In parentheses, enter the race code for Asians; you can find this in the codebook for race: year(2020), race(651).

Now that we’re drilling down to smaller populations, the reliability of the estimates is declining. For example, in Arkansas the estimate for Asians for both parents foreign born is 32.4%, but the value could be as low as 22.2% or as high as 44.5%. The confidence interval for California is much narrower, as the Asian population is much larger there. How can we get a better estimate?

Output Table – Nativity by State for Asians in 2020

Generate a table for nativity for Asians over a five year period:

Input Parameters – Generate Table for Nativity by State for Asians for 5-year Period

Add more years in the year filter, separated by commas. In this version, our confidence intervals are much narrower; notice for Asians for both parents foreign born in Arkansas is now 19.2% with a range of 14.1% to 25.6%. By increasing the sample pool with more years of data, we’ve increased the precision of the estimate at the cost of accepting a broader time period. One big caveat here: the Weighted N represents the total number of estimated cases, but since we are looking at five years of data instead of one it no longer represents a total population value for a single year. If we wanted to get an average annual estimate for this 5-year time period (similar to what the ACS produces), we’d have to divide each of estimates by five for a rough approximation. You can download the table and do these calculations in a spreadsheet or stats program.

Output Table – Nativity by State for Asians between 2016-2020 (weighted N = estimate of total cases over 5 years, not an average annual value)

You can also add control variables to a crosstab. For example, if you added sex as a control variable, you would generate separate female and male tables for the nativity by state for the Asian population over a given time period,

Example of a Profile Table

If we wanted to generate a profile for a given place as opposed to a comparison table, we can swap the variables we have in the rows and columns. For example, to see nativity for all Hispanic subgroups within California for a single year:

Input Parameters – Generate a Profile Table for California of Nativity by Hispanic Groups in 2020

In this case, you could opt to calculate percentages by column instead of row, if you wanted to see percent totals across groups for the categories. You could show both in the same chart, but I find it’s difficult to read. In this last example, note the large confidence intervals and thus lower precision for smaller Hispanic groups in California (Cuban, Dominican) versus larger groups (Mexican, Salvadoran).

Output Table – Nativity by Hispanic Groups in California 2020 (confidence interval is much larger for smaller groups)

In short – this is handy tool if you want to generate some quick estimates and crosstabs from the CPS without having to download and weight microdata records yourself. You can create geographic data for regions, divisions, states, and metro areas. Just be mindful of confidence intervals as you drill down to smaller subgroups; you can aggregate by year, geography, or category / group to get better precision. What I’ve demonstrated is the tip of the iceberg; read the documentation and experiment with creating charts, statistical summaries, and more.

BEA Population Change Map

Population and Economic Time Series Data from the BEA

In this post, I’ll demonstrate how to access and download multiple decades of annual population data for US states, counties, and metropolitan areas in a single table. Last semester, I was helping a student in a GIS course find data on tuberculosis cases by state and metro area that stretched back several decades. In order to make meaningful comparisons, we needed to calculate rates, which meant finding an annual time series for total population. Using data directly from the Census Bureau is tough going, as they don’t focus on time series and you’d have to stitch together several decades of population estimates. Metropolitan areas add more complexity, as they are modified at least a few times each decade.

In searching for alternatives, I landed at the Bureau of Economic Analysis (BEA). As part of their charge is studying the economy over time, they gather data from the Census Bureau, Bureau of Labor Statistics, and others to build time series, and they update them as geography and industrial classification schemes change. They publish national, state, metropolitan area, and county-level GDP, employment, income, wage, and population tables that span multiple decades. Their economic profile table for metros and counties covers 1969 to present, while the state profile table goes back to 1958. Metropolitan areas are aggregates of counties. As metro boundaries change, the BEA normalizes the data, adjusting the series by taking older county-level data and molding it to fit the most recent metro definitions.

Finding the population data was a bit tricky, as it is embedded as one variable in the Economic Profile table (identified as CAINC30) that includes multiple indicators. Here’s the path to get it:

  • From the BEA website, choose Tools – Interactive Data.
  • From the options on the next page, choose Regional from the National, Industry, International or Regional data options. There’s also a link to a video that illustrates how to use the BEA interactive data tool.
  • From the Regional Data page, click “Begin using the data” (but note you can alternatively “Begin mapping the data” to make some basic web maps too, like the one in the header of this post).
  • On the next page are categories, and under each category are data tables for specific series. In this case, Personal Income and Employment by County and Metropolitan Area was what I wanted, and under that the Economic Profile CAINC30 table (states appear under a different heading, but there’s a comparable Economic Profile table for them, SAINC30).
  • On the multi-screen table builder, you choose a type of geography (county or different metro area options), and on the next tab you can choose individual places, hold down CTRL and select several, or grab them all with the option at the top of the dropdown. Likewise for the Statistic, choose Population (persons), or grab a selection, or take all the stats. Under the Unit of Measure dropdown, Levels gives you the actual statistic, but you can also choose percent change, index values, and more. On the next tab, choose your years.
  • On the final page, if your selection is small enough you can preview the result and then download. In this case it’s too large, so you’re prompted to grab an Excel or CSV file to download.

And there you have it! One table with 50+ years of annual population data, using current metro boundaries. The footnotes at the bottom of the file indicate that the latest years of population data are based on the most recent vintage estimates from the Census Bureau’s population estimates. Once the final intercensal estimates for the 2010s are released, the data for that decade will be replaced a final time, and the estimates from the 2020s will be updated annually as each new vintage is released until we pass the 2030 census. Their documentation is pretty thorough.

BEA 5-decade Time Series of Population Data by Metro Area

The Interactive Data table approach allows you to assemble your series step by step. If you wanted to skip all the clicking you can grab everything in one download (all years for all places for all stats in a given table). Of course, that means some filtering and cleaning post-download to isolate what you need. There’s also an API, and several other data series and access options to explore. The steps for creating the map that appears at the top of this post were similar to the steps for building the table.