open data

Cost Surfaces and Least Cost Paths in QGIS and GRASS

In this post I’ll demonstrate how to create least cost paths using QGIS and GRASS GIS, and in doing so will describe how a cost surface is constructed. In a surface analysis, you model movement across a grid whose values represent friction encountered in moving across it. In computing a least cost path, you’re seeking an optimal route from an origin to the closest destination, where ‘close’ incorporates distance and ease of movement across that surface. These kinds of analyses are often conducted in the environmental sciences, in modeling the movement of water across terrain, and in zoology in predicting migration paths for land-based animals.

In this example the idea was to chart the origin of settlements and possible trade routes in ancient history. In applications where we’re studying human activity, network analysis is typically used instead. Networks use geometry, where a node is a place or person, and connections between nodes are indicated with lines. Lines typically have a value associated with them that identify either the strength of a connection, or conversely friction associated with moving between nodes. The idea for this project was to identify how networks formed, so the surface analysis served as a proto-network analysis. While there were roads and maritime routes in pre-modern times, these networks were weaker and less dense. Charting movement over a surface representing terrain could provide a decent approximation of routes (but if you’re interested in ancient Roman network routing, check out the ORBIS project at Stanford).

This example stems from a project I was helping a PhD student with; I don’t want to replicate his specific study, so I’ve modified the data sources and area of focus to model movement between large settlements and stone quarries in the ancient Roman world. My goal is to demonstrate the methods with a plausible example; if we were doing this as part of an actual study, we would need to be more discriminating in selecting and processing our data.

Preliminary Work

The Pleiades project will serve as our source for destinations; it’s an academic gazetteer that includes locations and place names for the ancient and early medieval world, stretching from Europe and North Africa through the Middle East to India. It’s published in many forms, and I’ve downloaded the Pleiades Data for GIS in a CSV format. Using QGIS, I used the Add Delimited Text tool to plot the places.csv to get all of the locations, and joined that file to the places_place_type.csv file which contains different categories of places. I used Select by Attributes to get locations classified as quarries, and exported the selection out to a geopackage.

The Pleiades data includes a category for settlements, but there are about ten thousand of these and there isn’t an easy way to create a subset of the largest places. So I opted to use Hanson’s dataset of the largest settlements in the ancient Roman world to serve as our source for origins (about 1,400 places). This data was packaged in an Excel file; I plotted the coordinates using the Create Points Layer from Table tool in QGIS and converted the result to a geopackage. For testing purposes, I selected a subset of ten major cities and saved them in a separate layer: Athenae, Alexandria (Aegyptus), Antiochia (Syria), Byzantium, Carthago, Ephesus, Lugdunum, Ostia, Pergamum, Roma.

For the friction grid, I downloaded a geoTIFF of the Human Mobility Index by Ozak. The description from the project:

“The Human Mobility Index (HMI) estimates the potential minimum travel time across the globe (measured in hours) accounting for human biological constraints, as well as geographical and technological factors that determined travel time before the widespread use of steam power.”

There are three separate grids that vary in extent based on the availability of seafaring technology. I chose the grid that incorporates seafaring prior to the advent of ocean-going ships, which is appropriate for the Mediterranean world during the classical era. The HMI is a global grid at 925 meter resolution. To minimize processing time, I clipped it to a bounding box that encompasses the area of study. The grid is in the World Cylindrical Equal Area system; I reprojected it to WGS 84 to match the rest of the layers. As long as we’re not measuring actual distances, we don’t need to worry about the system we’re using (but if we were, we’d use an equidistant system). Since the range of values is small and it’s hard to see differences in cell values, I symbolized the grid as single-band psuedo color and used a quantiles classification scheme with 12 categories.

Lastly, I grabbed some modern country boundaries from Natural Earth to serve as a general frame of reference. A screenshot of the workspace is below:

Least Cost Path in QGIS

QGIS has a third-party plug-in for doing a least cost path analysis, which works fine as long as you don’t have too many origin points. Go to Plugins > Manage and Install Plugins > Least Cost Path to turn it on. Then open the Processing toolbox and it will be listed at the bottom. See the screenshot below for the tool’s menu. The Cost raster layer is the friction surface, so the human mobility index in this example. The start points are the ten major cities and the end points are the quarries. The start-point layer dialog only accepts a single point; if you have multiple points, hit the green circular arrow button to iterate across all of them. There’s a checkbox for connecting the start point to just the nearest end point (as opposed to all of them). Save the output to a geopackage.

It took about five minutes to run the analysis and iterate across all ten points. Each path is saved in a separate file, but since they have an identical structure I subsequently used Vector > Data Management Tools > Merge Vector Layers to combine them into one file. The attribute table records the end point ID (for the quarry) and the accumulated cost, but does not include the origin ID; this ID is the number 1 repeated each time, as the tool was iterating over the origin points. We can see the result below; for Athens and Ephesus in the south, land routes were shortest, whereas for Pergamum and Byzantium in the north it was easier (distance and friction-wise) to move across the sea.

While this worked fine for ten cities, it would take a considerable amount of time to compute paths for all 1,400. The problem here is that the plugin was designed for one point at a time. Let’s outline the process so we can understand how alternatives would work.

Cost Surface Analysis

To calculate a least cost path, the first step is to create a cost surface, where we take our friction grid and the destinations and calculate the total cost of movement across all cells to the nearest destination. First, the destinations are placed on the grid, and they become the grid sources. Then, the accumulated cost of moving from each source to its adjacent cells is calculated. For horizontal and vertical movement, it’s the sum of the friction values divided by two, and for diagonal movement it’s the sum of the friction values divided by two then multiplied by 1.4142. Once those calculations are performed, those adjacent cells are assigned to each source. Next, the lowest accumulated cost cell in the grid is identified, the cost for moving to its unassigned neighbors is calculated, and these cells are assigned to the same source. This process is repeated by cycling through the lowest accumulated value until all calculations for the grid are finished. Illustrated in the example below, which I derived from Lloyd’s Spatial Data Analysis (2010) pp. 165-168.

For each cell, three items are recorded, and are saved either as separate bands in one raster, or in three separate raster files:

Accumulated cost of moving to that cell from the nearest source
Assignment or allocation of the cell to its source (the nearest one to which it “belongs”)
A vector that indicates direction from that source

With these cost surfaces, we can take the second step of calculating the least cost path. We place a number of starting points onto this surface, and each point is assigned to the closest destination based on where its grid cell was allocated. The direction to that destination is traced backward using the direction grid, and the total cost of movement is taken from the accumulated cost surface.

Knowing how this process works, there are two practical conclusions we can draw. First, when computing the cost surface, you use your destinations (not the origins) as the source for the cost surface. You use the origins as the start points for the least cost path. Second, there’s no need to recalculate the cost surface for every origin point; you only need to do this once. That’s why the QGIS plugin took so long; it was recomputing the cost surface each time. Knowing this, we can use GRASS GIS to compute the paths, as it’s designed to compute the surface just once (and it’s data structure will also boost performance a bit).

Cost Surface Analysis in GRASS

GRASS GIS comes bundled with QGIS. While it’s possible to run a number of GRASS tools directly within QGIS, it’s a bit undesirable as you’re not able to access the full range of parameters or options for each GRASS command. I opted to create the GRASS environment in QGIS, and loaded all the necessary data into the GRASS format. Then, I flipped over to the GRASS GUI to do the analysis.

GRASS uses a distinct database structure and file format, and we need to create a GRASS workspace and load our data into that database in order to use the cost surface tools. I followed the steps in the QGIS manual for creating a GRASS environment and loading data into a GRASS database. Once you create the database and mapset, you use the QGIS Browser to browse to the grassdata folder and designate your new mapset as your working mapset (mapsets have the little green grass icon beside them). With the GRASS tools open, I used v.in.ogr.qgis to load my my cities and quarries layers into this mapset, and r.in.gdal.qgis to load the mobility index (if these layers weren’t already in your QGIS project, you’d use the tools that don’t have the qgis suffix, i.e. v.in.ogr).

After exiting QGIS and launching GRASS, you select the mapset under the grassdata database at the top, right click and choose Switch mapset, and choose the mapset you want to work with (if you don’t see it, hit the database icon to browse and connect to the grassdata folder). You can display the layers in the GRASS window to visualize them, but it’s not necessary for running the tools. In the tool menu on the right, search for the Cost surface tool, r.cost, and choose the following options:

Required: input raster map with grid cell cost is the human mobility index, and output raster is cost_surface
Optional outputs: output raster with nearest start point as allocation_surface, output raster with movement as direction_surface
Start: vector starting points for the cost surface are the destinations, the quarries
Optional: check verbose output (to get more details on errors)

Running this operation on all 1,400 cities took a matter of seconds, and all three rasters described in the previous section were generated: cost, allocation, direction (shown below).

Using these outputs, we can run the Least cost route or flow tool, which is called r.drain (as it’s often used in earth sciences to chart the path that water will drain based on elevation).

Required: Name of input elevation or cost surface raster is cost_surface, Name of output raster path: is path_raster
Cost surface: check the Input raster map is a cost surface box, Name of input movement direction raster is direction_surface
Start: Name of starting points vector map: are the origins (cities)
Path settings: choose ONE option that you’d like to record (or none)
Optional: check Verbose mode, Name for output drain vector is path_vector

This also took mere seconds to complete (!) and generated the paths from each origin (city) to the closet destination (quarry) over the surface as both raster cells and vector lines. The output in GRASS is shown below.

At this stage, we can hop back into QGIS, and load these output paths into our original project to symbolize and study what’s going on. Notice the settlements in northeastern Italy and along the Dalmatian coast; for many of them the least cost path is to a quarry across the sea rather than through rugged mountainous terrain. Even though some quarries in the mountains may be closer in actual distance, it’s a tougher path to travel.

Conclusion

The benefit of using GRASS is that we can run these processes fairly quickly for large datasets. The GRASS commands can also be compiled into a batch script, so you can create a documented and automated process instead of having to drill through multiple menus.

A big downside of the GRASS tools for this analysis is that the resulting vector paths contain no information about the origin or destination points, and only the raster path output carries along values. You might be able to generate this information through some extra steps; using the QGIS field calculator, you can get the coordinates for the start point and end point of each path and add them explicitly to the attribute table. Then take those coordinates, and for the start point of the line select the closest city and get its attributes, and for the end point select the closet quarry and get its attributes. I say “closest” because the vector paths don’t snap perfectly to the start and end points. Modifying the resolution of the human mobility index to make it coarser (fewer cells) might help to resolve this, or converting the origin and destination points to a raster of the same resolution as the index. Alternatively, if you incorporate the GRASS commands into a Python script, you could iterate over the origins in the least cost path analysis and record the origin IDs as you step through.

I haven’t worked all the pieces, but hopefully this will be useful for those of you who are interested in conducting a basic cost surface analysis in open source. The student I was helping was interested in measuring the density of the paths across a grid, so this process worked for him as he didn’t need to associate the paths with origins and destinations. Beyond FOSS GIS, ArcGIS Pro has a full suite of tools for cost surface analysis, and the underlying methods and logic are the same.

Historic County Climate Data for the US

I recently had a question about finding historic climate data in the United States at the county-level. In this post I’ll show you how to access it, and how to parse fixed-width text files in Excel. Weather data is captured and reported by point-based weather stations, and then is often interpolated and modeled over gridded surfaces (rasters). The National Centers for Environmental Information at NOAA have used their models to create zonal statistics for counties, which they publish via the Climate at a Glance County Mapping program (I described what zonal statistics are in an earlier post).

The basic application lets you map the continental US or an individual state (includes AK but not HI). You choose a parameter (Avg / Min / Max temperature, precipitation, cooling / heating days, drought indexes), year (1895 to present), month, and time scale (1 month to 5 years). This creates a map that you can modify to depict that value, or to display ranks or anomalies. You can download the map as an image, or the underlying data as CSV or JSON.

A separate app allows you to create a time series profile for a particular county, with a table, chart, and data that you can download.

These apps are great for the basics, but bulk downloading the underlying data for all counties and years is a bit trickier. You crash land in a file directory and have to choose from an array of zipped files. Fortunately there is good documentation. In that folder, these are the county-level files for precipitation, min temp, max temp, and avg temp:

climdiv-pcpncy-vx.y.z-YYYYMMDD
climdiv-tmaxcy-vx.y.z-YYYYMMDD
climdiv-tmincy-vx.y.z-YYYYMMDD
climdiv-tmpccy-vx.y.z-YYYYMMDD

Where v is for “version”, the xyz is a version number, and the final portion is the date. The archive is updated monthly. The other files in the directory are for climate divisions, states and regions, and data that pertains to the drought indexes. There are also files that have climate normals for each of these areas. If you’re interested in these, you can go up to the parent-level directory and view the relevant documentation.

The county files are fixed-width text files, which means you have to parse them to separate the values. If you treat them as delimited files (using spaces), then all of the fields at the beginning of the file will be lumped together, which is not useful. Spreadsheets and stats packages have tools for importing delimited text, or you could script something in Python or R. Modern versions of Excel will allow you to parse fixed-width data by supplying a list of endpoints for each column; older versions of Excel and other spreadsheets have you “eyeball” the columns and manually insert breaks in an import screen.

If you’re using a modern version of Excel: open a blank workbook and on the Data ribbon click the From Text/CSV button. Browse and select the county text file you’ve downloaded. In the import screen change the Delimiter drop down to Fixed Width.

In the box underneath, begin with zero and type the end points for each position (with the exception of the final endpoint, 95) as a comma separated list. You’ll find these in the README file, but I’ve also tacked on the most salient bits to the end of this post. For your convenience:

0,5,7,11,18,25,32,39,46,53,60,67,74,81,88

If you click on the preview grid, it will parse the columns.

In this example, I’m not parsing the state and county code separately, but am keeping them together to create a single unique identifier. Once everything is parsed, hit the Transform Data button. For column 1, hit the small 123 button, and change the option to Text, and choose Replace data.

This will preserve the leading zero in the state/county code. It’s important to do this, so the codes in this table with match the codes in other county data table or spatial data files that you may wish to join this table to. Do the same for the element code in column 2. The remaining Year and Month columns can be left alone, as they’re already appropriately saved as integers and decimals respectively.

Hit the Close and Load button in the upper left hand corner, and Excel will parse and load the data. It formats the columns and applies a filter option. To get rid of the styling and filter dropdowns, I’d copy the entire table, and do a Paste-Special-Values in a new worksheet. Then replace the generic column labels with these:

CNTYCODE,ELEMENT,YEAR,JAN,FEB,MAR,APR,MAY,JUNE,JULY,AUG,SEPT,OCT,NOV,DEC

Save the file, and now you have something to work with. Each record represents the monthly temperature or precipitation for a particular county for a particular year. To create a unique record ID, you can concatenate the state/county code, element code, and year values. For GIS applications, you would need to pivot the data to a wide form, so that the year becomes a column to give you month-year as a column, and each row represents each county with no repeats. With over 120 years of monthly data, that would give you over 1500 columns – so filter out what you don’t need. The state / county code can be used to join the table to the Census Bureau’s Cartographic Boundary Files, using the CBF’s GEOID field.

When would you use this data? If you’re creating data profiles or are running a statistical analysis and are using counties as your geographic unit, and temperature or precipitation is one variable among many that you need. Or, you’re making a series of county-level maps, and this is one of your variables. This dataset is clearly pretty convenient for doing time series analyses, as compiling data for a times series is usually time consuming. The counties in this dataset represent present day boundaries, so normalizing geography over time isn’t necessary.

When not to use it? Counties vary in size and can encompass a great deal of internal variety in terms of elevation, land use and land cover, and proximity to / presence of water bodies, all of which impact the climate. So the weather in one part of a county could be quite different from another part. To capture these internal differences, it would be better to use gridded data, such as the 4×4 km rasters that PRISM produces for daily, monthly, annual, and normal summaries.

Gridded climate data and zonal stats derived from grids are estimates based on models; if you wanted or needed the actual measurements as they were recorded, you would need to go back and get point-based weather station data, from the Local Climatological Database for instance. There are a limited number of stations, and not one for every county. The closest station to a given place could be used to represent or approximate the weather for that place.

Codebook for county data files (extracted from README):

Element Record
Name Position Element Description

STATE-CODE 1-2 as indicated in State Code Table as described in FILE 1. Range of values is 01-48.

DIVISION-NUMBER 3-5 COUNTY FIPS - Range of values 001-999.

ELEMENT CODE 6-7

01 = Precipitation
02 = Average Temperature
25 = Heating Degree Days
26 = Cooling Degree Days
27 = Maximum Temperature
28 = Minimum Temperature

YEAR 8-11 This is the year of record. Range is 1895 to current year processed.

Monthly Divisional Temperature format (f7.2) Range of values -50.00 to 140.00 degrees Fahrenheit. Decimals retain a position in the 7-character field.  Missing values in the latest year are indicated by -99.99.

Monthly Divisional Precipitation format (f7.2) Range of values 00.00 to 99.99.  Decimal point retains a position in the 7-character field. Missing values in the latest year are indicated by -9.99.

JAN-VALUE 12-18

FEB-VALUE 19-25

MAR-VALUE 26-32

APR-VALUE 33-39

MAY-VALUE 40-46

JUNE-VALUE 47-53

JULY-VALUE 54-60

AUG-VALUE 61-67

SEPT-VALUE 68-74

OCT-VALUE 75-81

NOV-VALUE 82-88

DEC-VALUE 89-95

2020 Census Data Wrap-up

Right before the semester began, I updated the Rhode Island maps on my census research guide so that they link to the recently released Demographic Profile tables from the 2020 Census. I feel like the release of the 2020 census has flown lower on the radar compared to 2010 – it hasn’t made it into the news or social media feeds to the same degree. It has been released much later than usual for a variety of reasons, including the COVID pandemic and political upheaval and shenanigans. At this point in Sept 2023, most of what we can expect has been released, and is available via data.census.gov and the census APIs.

Here are the different series, and what they include.

Apportionment data. Released in Apr 2021. Just the total population counts for each state, used to reapportion seats in Congress.
Redistricting data. Released in Aug 2021. Also known as PL 91-171 (for the law that requires it), this data is intended for redrawing congressional and legislative districts. It includes just six tables, available for several geographies down to the block level. This was our first detailed glimpse of the count. The dataset contains population counts by race, Hispanic and Latino ethnicity, the 18 and over population, group quarters, and housing unit occupancy. Here are the six US-level tables.
Demographic and Housing Characteristics File. Released in May 2023. In the past, this series was called Summary File 1. It is the “primary” decennial census dataset that most people will use, and contains the full range of summary data tables for the 2020 census for practically all census geographies. There are fewer tables overall relative to the 2010 census, and fewer that provide a geographically granular level of detail (ostensibly due to privacy and cost concerns). The Data Table Guide is an Excel spreadsheet that lists every table and the variables they include.
Demographic Profile. Released in May 2023. This is a single table, DP1, that provides a broad cross-section of the variables included in the 2020 census. If you want a summary overview, this is the table you’ll consult. It’s an easily accessible option for folks who don’t want or need to compile data from several tables in the DHC. Here is the state-level table for all 50 states plus.
Detailed Demographic and Housing Characteristics File A. Released in Sept 2023. In the past, this series was called Summary File 2. It is a subset of the data collected in the DHC that includes more detailed cross-tabulations for race and ethnicity categories, down to the census tract level. It is primarily used by researchers who are specifically studying race, and the multiracial population.
Detailed Demographic and Housing Characteristics File B. Not released yet. This will be a subset of the data collected in the DHC that includes more detailed cross-tabulations on household relationships and tenure, down to the census tract level. Primarily of interest to researchers studying these characteristics.

There are a few aspects of the 2020 census data that vary from the past – I’ll link to some NPR stories that provide a good overview. Respondents were able to identify their race or ethnicity at a more granular level. In addition to checking the standard OMB race category boxes, respondents could write in additional details, which the Census Bureau standardized against a list of races, ethnicities, and national origins. This is particularly noteworthy for the Black and White populations, for whom this had not been an option in the recent past. It’s now easier to identify subgroups within these groups, such as Africans and Afro-Caribbeans within the Black population, and Middle Eastern and North Africans (MENA) within the White population. Another major change is that same-sex marriages and partnerships are now explicitly tabulated. In the past, same-sex marriages were all counted as unmarried partners, and instead of having clearly identifiable variables for same-sex partners, researchers had to impute this population from other variables.

Another major change was the implementation of the differential privacy mechanism, which is a complex statistical process to inject noise into the summary data to prevent someone from reverse engineering it to reveal information about individual people (in violation of laws to protect census respondent’s privacy). The social science community has been critical of the application of this procedure, and IPUMS has published research to study possible impacts. One big takeaway is that published block-level population data is less reliable than in the past (housing unit data on the other hand is not impacted, as it is not subjected to the mechanism).

When would you use decennial census data versus other census data? A few considerations – when you:

Want or need to work with actual counts rather than estimates
Only need basic demographic and housing characteristics
Need data that provides detailed cross-tabulations of race, which is not available elsewhere
Need a detailed breakdown of the group quarters population, which is not available elsewhere
Are explicitly working with voting and redistricting
Are making historical comparisons relative to previous 10-year censuses

In contrast, if you’re looking for detailed socio-economic characteristics of the population, you would need to look elsewhere as the decennial census does not collect this information. The annual American Community Survey or monthly Current Population Survey would be likely alternatives. If you need basic, annual population estimates or are studying the components of population change, the Population and Housing Unit Estimates Program is your best bet.

USGS Topo Map Vector Layers for GIS

I was working with a graduate student last month who was looking for contour lines for specific towns within the US, for large-scale (small area) mapping and analysis. They were specifically interested in elevation for landfills, and some of the contour data they found didn’t map these as they aren’t natural features. We looked at current USGS topographic maps, and they do indeed map contours for landfills. But the topo maps are raster images, and they wanted vectors. Is it possible to access the underlying GIS data that was used to create the topo maps?

Indeed, it is! Option 1 is to use the National Map Download app. Search for a place name to zoom into your area of interest. Use the Show Map Index dropdown menu to draw the quad boundaries for the topo scale you’re interested in on the map; the 7.5 minute / 1:24,000 series is the USGS topo scale that most people are familiar with. Adjust the zoom so your area of interest fits within the map window; that way when you search in the Datasets tab on the left, the default search looks within this map extent.

Next, choose the specific data product you’re interested in. Here’s a list and description of all the National Map Datasets. For example, if you just wanted contour lines, you can select that under Small-scale Datasets. Note that raster imagery and data that’s used to derive the vectors is also available for download. If you want all the vector features that appear on a particular topo map, check the Topo Map Data and Topo Stylesheet option. Once you check a product, you can choose a file format for the data. Given the size of these datasets, the FileGDB option is probably best.

USGS TNM Download — The National Map Download Interface, Showing the Datasets Tab for Selecting and Searching

Then, click the blue Search Products button. That flips you to the Products tab, and displays data available within the extent of the map view. If you chose Topo Map Data and Topo Stylesheet, the results will be maps of individual quads. You can add a bunch of maps to your shopping cart by clicking on the little cart icon, or download one immediately by clicking the Download Link (ZIP).

USGS Download Topo Map Vector Data — On the Product Tab, click Download Link (ZIP) to get data for a specific map

Option 2 for downloading data: skip the map interface and use the Stage Products Directory. This no frills option is good if you know exactly which products you’re looking for. For example, you can drill down through TopoMapVector, then by state, and then data format to get to the same files you would have downloaded via option 1. You would need to know the name of the quad that encompasses the area you want; consult an index to figure it out.

Once you download and unzip the file, you can launch your desktop GIS package to connect to the database and view the contents. In ArcGIS Pro, use the Catalog Pane, select the Databases option, right click, and Add Database. Browse to the location where you unzipped it, and select it. Then hit the dropdown for the newly added database and browse the contents, which are divided into schemas or groups. Foundation and Hydrography contain most of the features. GazVector has place name labels not captured in other features, and Cells contains outlines of the quad grid cells. Drag them into the Map Pane to view them.

USGS Topo Vector Data in ArcGIS Pro — USGS Topo Map Vector Data in ArcGIS Pro

QGIS users can use the Data Source Manager. With the Vector option selected, change the Source Type from File to Directory, and in the Type dropdown choose OpenFileGDB. Then hit the dots button to browse your file system and select the database folder. Click Add, and you’ll be prompted to choose layers and tables to add to a project. You’ll see the same schema organization described previously, and you can use the CTRL and / or Shift keys to select what you want. Add the Layers, hit OK, and close the Manager.

Adding File Geodatabase Features to QGIS — Adding File Geodatabase Features in the QGIS Data Source Manager

From there, it takes some artful manipulation of the overlays, color schemes, and labels to clearly symbolize the features. Both ArcGIS and QGIS have default symbol styles for topographic features that you can choose from. Apparently there’s a stylesheet packaged with the data, but I haven’t dug in enough yet to find and apply it. The attributes for the features seem fairly rich; the table includes columns that indicate the original data source for each feature, dates when records were added or updated, and a number of identifiers, labels, and categories. Some of the features, like bodies of water and county boundaries, extend beyond the quad cell for the map, as the USGS opted to keep whole features rather than clipping them. If the area you’re interested in happens to fall across two maps, you can download the topo map vector data for both quads, and use the Merge tool to combine them. The default CRS is un-projected NAD83 (EPSG 4269). You’ll probably want to reproject to a state plane or UTM zone that’s appropriate for your area. These post that describe styling and labeling contour lines in QGIS and ArcGIS Pro are helpful. Happy mapping!

USGS Topo Vector Data in QGIS — USGS Topo Map Vector Data in QGIS

Parsing the Internet Archive’s Twitter Stream Grab with Python

In this post I’ll share a process for getting geo-located tweets from Twitter, using large files of tweets archived by the Internet Archive. These are tweets where the user opted to have their phone or device record the longitude and latitude coordinates for their location, at the time of the tweet. I’ve created some straightforward scripts in Python without any 3rd party modules for processing a daily file of tweets. Given all the turmoil at Twitter in early 2023, most of the tried and true solutions for scraping tweets or using their APIs no longer function. What I’m presenting here is one, simple solution.

Social media data is not my forte, as I specialize in working with official government datasets. When such questions turn up from students, I’ve always turned to the great Web Scraping Toolkit developed by our library’s Center for Digital Scholarship. But the graduate student I was helping last week and I discovered that both the Twint and TAGS tools no longer function due to changes in Twitter’s developer policies. Surely there must be another solution – there are millions of posts on the internet that show how easy it is to grab tweets via R or Python! Alas, we tried several alternatives to no avail. Many of these projects rely on third party modules that are deprecated or dodgy (or both), and even if you can escape from dependency hell and get everything working, the changed policies rendered them moot.

You can register under Twitter’s new API policy and get access to a paltry number of records. But I thought – surely, someone else has scraped tons of tweets for academic research purposes and has archived them somewhere – could we just access those? Indeed, the folks at Harvard have. They have an archive of geolocated tweets in their dataverse repository, and another one for political tweets. They are also affiliated with a much larger project called DocNow with other schools that have different tweet archives. But alas, there are rules to follow, and to comply with Twitter’s license agreement Harvard and these institutions can’t directly share the raw tweets with anyone outside their institutions. You can search and get IDs to the tweets, using their Hydrator application, which you can use in turn to get the actual tweets. But then in small print:

“Twitter’s changes to their API which greatly reduce the amount of read-only access means that the Hydrator is no longer a useful application. The application keys, which functioned for the last 7 years, have been rescinded by Twitter.”

Fortunately, there is the Internet Archive, which has been working to preserve pieces of the internet for posterity for several decades. Their Twitter Stream Grab consists of monthly collections of daily files for the past few years, from 2016 to 2022. This project is no longer active, but there’s a newer one called the Twitter Archiving Project which has data from 2017 to now. I didn’t investigate this latter one, because I wasn’t sure if it provided the actual tweets or just metadata about them, while the older project definitely did. The IA describes the Stream Grab as the “spritzer” version of Twitter grabs (as opposed to a sprinkler or garden hose). Thanks to the internet, it’s easy to find statistics but hard to find reliable ones – this one, credible looking source (the GDELT Project) suggests that there are between 400 and 500 million tweets a day in recent years. The file I downloaded from IA for one day had over 4 million tweets, so that’s about 1% of all tweets.

I went into the November 2022 collection and downloaded the file for Nov 1st. It’s a TAR file that’s about 3 GB. Unzipping it gives you a folder for that data named for the date, with hundreds of gz ZIP files. Unzip those, and you have tons of JSON Line files. These are JSON files where each JSON record has been collapsed into one line.

Python to the rescue. See GitHub for the full scripts – I’ll just add some snippets here for illustration. I wrote two scripts: the first reads in and aggregates all the tweets from the JSONL files, parses them into a Python dictionary, and writes out the geo-located records into regular JSON. The second reads in that file, selects the elements and values that we want into a list format, and writes those out to a CSV. The rationale was to separate importing and parsing from making these selections, as we’re not going to want to repeat the time-consuming first part while we’re tweaking and modifying the second part.

In the sample data I used for 11/01/2022, unzipping the downloaded TAR file gave me a date folder, and in that date folder were hundreds of gz ZIP files. Unzipping those revealed the JSONL files. I wrote the script to look in that date folder, one level below the folder that holds the scripts, and read in anything that ended with .json. Not all of the Internet Archive’s stream’s are structured this way; if your downloads are structured differently, you can simply move all the unzipped json files to one directory below the script to read them. Or, you can modify the script to iterate through sub-directories.

Because the data was stored as JSONL, I wasn’t able to read it in as regular JSON. I read each line as a string that I appended to a list, iterated through that list to convert it into a dictionary, pulled out the records that had geo-located elements, and added those records to a larger dictionary where I used an identifier in the record as a key and the value as a dictionary with all elements and values for a tweet. This gets written out as regular JSON at the end. Reading the data in didn’t take long; parsing the strings into dictionaries was the time consuming part. Originally, I wanted to parse and save all 4 million records, but the process stalled around 750k as I ran out of memory. Since so few records are geo-located, just selecting these circumvented this problem. If you wanted to modify this part to get other kinds of records, you would need to apply some filter, or implement a more efficient process than what I’m using.

json_list=[] # list of lists, each sublist has 1 string element = 1 line

for f in os.listdir(json_dir):
    if f.endswith('.json'):
        json_file=os.path.join(json_dir,f)
        with open(json_file,'r',encoding='utf-8') as jf:
            jfile_list = list(jf) # create list with one element, a line saved as a string 
            json_list.extend(jfile_list)
            print('Processed file',f,'...')

geo_dict={} # dictionary of dicts, each dict has line parsed into keys / values
i=0   
for json_str in json_list:
    result = json.loads(json_str) # convert line / string to dict
    if result.get('geo')!=None: # only take records that were geocoded
        geo_dict[result['id']]=result 
    i=i+1
    if i%100000==0:
        print('Processed',i,'records...')

The second script reads the JSON output from the first, and simply iterates through the dictionary and chooses the elements and values I want and assigns them to variables. Some of these are straightforward, such as grabbing the timestamp and tweet. Others required additional work. The source element provides HTML code with a source link and name, so I split and strip this value to get them separately. The coordinates are stored as a list, so to get longitude and latitude as separate values I indicate the list position. In cases where I’m delving into a sub-dictionary to get a value (like the coordinates), I added if statements to set values to None if they don’t exist in the JSON, otherwise you get an error. Once I finish iterating, I append all these variables to a list, and add this list to the main one that captures every record. I create a matching header row list, and both are written out as a CSV.

with open(input_json) as json_file:
    twit_data = json.load(json_file)

twit_list=[]

# In this block, select just the keys / values to save
for k,v in twit_data.items():
    tweet_id=k
    timestamp=v.get('created_at')
    tweet=v.get('text')
    # Source is in HTML with anchors. Separate the link and source name
    source=v.get('source') # This is in HTML
    source_url=source.split('"')[1] # This gets the url
    source_name=source.strip('</a>').split('>')[-1] # This gets the name
    lang=v.get('lang')
    # Value for long / lat is stored in a list, must specify position
    if v['geo'] !=None:
        longitude=v.get('geo').get('coordinates')[1]
        latitude=v.get('geo').get('coordinates')[0]
    else:
        longitude=None
        latitude=None
...

My code could use improvement – much of this could be abstracted into a function to avoid repetition. We were in a hurry, and I’m also working with folks who need data but aren’t necessarily familiar with Python, so something that’s inefficient but understandable is okay (although I will polish this up in the future).

I provide the output in GitHub, examples of the final CSV appear below. Every language in the world is captured in these tweets, so Windows users need to import the CSV into Excel (Data – From Text/CSV) and choose UTF-8 encoding. Double-clicking the CSV to open it in Excel in Windows will render most of the text as junk, in the default Windows-1252 encoding.

Tweets extracted from Internet Archive with timestamp, tweets, and source information

Geolocated Twitter Data 1 — Tweets extracted from Internet Archive, showing geo-located information

So, is this data actually useful? That’s an open question. Of the 4 million tweets in this file, just over 1,158 were geo-located! I checked and this is not a mistake. The metadata record for the Harvard geolocated tweets mentions that only 1% to 2% of all tweets are geo-located. So of the 400 million daily tweets, only 4 million. And out of our daily 4 million sample from IA, just 1,158 (less than 1%). What we ended up with does give you a sense of variety and global coverage (see map at the top of the post, showing sample of tweets by language Nov 1, 2022). In this sample, the top five countries represented were: US (35%), Japan (17%), Brazil (4%), UK (4%), Mexico and Turkey (tied 3%). For languages, the top five: English (51%), Japanese (17%), Spanish (9%), Portuguese (5%), and Turkish (3%).

In many cases, I think you’d need a larger sample than a single day, assuming you’re interested in just geo-located records. Perhaps 4 million is large enough for certain non-spatial research? Again, not my area of expertise, but you would want to be aware of events that happened on a certain date that would influence what was tweeted. My graduate student wanted to see differences in certain kinds of tweets in the LA metro area versus the rest of the US, but this sample includes less than 20 tweets from LA. To do anything meaningful, she’d have to download and process a whole month of tweets (at least). Even then, there are certain tweeters that show up repeatedly in given areas. In NYC, most of the tweets on this date were from the 511 service, warning people where that day’s potholes were.

Beyond the location of the tweet, there is a lot of information about the user, including their self-reported location. This data is available in all tweets (not just the geo-located ones). But there are a lot problems with this attribute: the user isn’t necessarily tweeting from that location, as it represents their “static” home. This location is not geocoded, and it’s self reported and uncontrolled. In this example, some users dutifully reported their home as ‘Cleveland, OH’ or ‘New York City’. Other folks listed ‘NYC – LA – ATL – MIA’, ‘CIUDAD DE LAS BAJAS PASIONES’, ‘H E L L’, and ‘Earth. For now’.

Even for research that incorporated geo-located tweets from other, larger data sources that were previously accessible, how representative are all those studies when the data represents only 1% of the total tweet volume? I am skeptical. Also consider the information from the good folks at the Pew Research Center, that tells us that only one in five US adults use Twitter, and that the minority of Twitter users generate the vast majority of tweets: “The top 25% of US users by tweet volume produce 97% of all tweets, while the bottom 75% of users produce just 3%” (10 Facts About Americans and Twitter May 5, 2022).

For what’s it worth, if you need access to Twitter data for academic, non-commercial research purposes and the old methods aren’t working, perhaps the Internet Archive’s data and the solution posed here will fit the bill. You can see the geo-located output (JSON and CSV) from this example in the GitHub repo’s output folder. There is also a samples folder, which contains JSON and CSV for about 77k records that include both geo-located and non-geolocated examples. Looking at the examples can help you decide how to modify the scripts, to pull out different elements and values of interest.

Snazzy Thematic Maps with Project Linework

When I’m making global thematic maps, I usually turn to Natural Earth. They provide country polygons and boundary lines, as well as features like cities and rivers, at several different scales. I always reference it in workshops that I teach, including the 2-week GIS Institute that I participated in earlier this month. It’s a solid, free data source and a good example for illustrating how scale and generalization work in cartography. It’s a “natural choice”, as they provide boundaries that depict the way the world actually looks.

I also discussed aesthetics and map design during the Institute. What if you don’t necessarily care about representing the boundaries exactly the way they are? If you rely on the map reader’s knowledge of the relative shape of the countries and their position on the globe, and you employ good labeling, you can choose boundaries that are more artistic and fun (provided that your only goal is making a basic thematic map and it’s not being published in a stodgy journal).

Project Linework is part of Something About Maps, an excellent blog by Daniel Huffman. The project consists of different series of public domain boundary files that have been generalized to provide interesting and visually attractive alternatives to standard features. The gallery contains a sample image and brief description of each series, including details on geographic coverage. Most of the series cover just North America or select portions of the world.

The three I’ll mention below are global in coverage. They come in shapefile and geojson formats, are projected in World Gall Stereographic (ESRI 54016), and include line and polygon coverages. The attribute tables have fields for ISO country codes, which are standard unique identifiers that allow for table joins for thematic mapping. I took my map of Wheat and Meslin Exports from Ukraine from an earlier post to create the following examples.

With the Wargames series, the world has been rendered using the little hexagon grids familiar to many war board gamers, and plenty of non-war gamers for that matter (think Settlers of Catan). Hexes are a an alternative to grids for determining adjacency.

Project Lineworks Map - Wargames — Project Linework: Wargames

Moriarty Hand is a more whimsical interpretation. It was drawn by hand by tracing line work from Natural Earth. The end result is more organic compared to Wargames. It comes in two scales, small and large (with an example of the latter below):

Project Lineworks Map - Moriarty Hand — Project Linework: Moriarty Hand

My personal favorite is 1981. It’s inspired by the basic polygon shapes that you would have seen in early computer graphics. When I was little I remember loading a DOS-based atlas program from a floppy disk, and slowly panning across a CGA monochrome screen as the machine chunked away to render countries that looked like these. Good if you’re looking for a retro vibe.

Project Lineworks Map - 1981 — Project Linework: 1981

Happy mapping! Also from Something About Maps, check out this excellent poster and related post about families of map projections.

At These Coordinates

Dispatches from the Geospatial Data World