open data

USAID raster of vaccinated children in Senegal

GIS Data Sources Spring Roundup

It’s been awhile since I’ve written a post that showcases different GIS datasets. So in this one, I’ll provide an overview of some free and open data sources that I’ve learned about and worked with this past spring semester. The topics in these series include: global land use and land cover, US heat and temperature, detailed population data for India, and public health in low and middle income countries.

GLAD Global Land Analysis and Discovery

The GLAD lab at the Department of Geographical Sciences at the University of Maryland produces over a dozen GIS datasets related to global land use, land cover, and change in land surface over time. Last semester I had folks who were interested in looking at recent global change in cropland and forest. GLAD publishes rasters that include point-in-time coverage, period averages, and net change and loss over the period 2000 to 2020. Much of the data is generated from LANDSAT, and resolution varies from 30m to 3km. Other series include tropical forest cover and change, tree canopies, forest lost due to fires, a few non-global datasets that focus on specific regions, and LANDSAT imagery that’s been processed so it’s ready for LULC analysis.

Most of the sets have been divided up into tiles and segmented based on what they’re depicting (change in crops, forest, etc). The download process is basic point and click, and for larger sets they provide a list of tifs in a text file so you can automate downloading by writing a basic script. Alternatively, they also publish datasets via Google Earth Engine.

GLAD cropland in Rhode Island
GLAD Cropland Extent in 2019 in QGIS, Zoomed in to Optimal Resolution in SE Rhode Island

US Heat Severity Index

For the past few years, the Trust for Public Land has published an annual heat severity index. This layer represents the relative heat severity for 30m pixels for every city in the United States; depicting where areas of cities are hotter than the average temperature for that same city as a whole (i.e. the surface temperature for each pixel relative to the general air temperature reading for the entire city). Severity is measured on a scale of 1 to 5, with 1 being a relatively mild and 5 being severe heat. The index is generated from a Heat Anomalies raster which they also provide; it contains the relative degrees Fahrenheit difference between any given pixel and the mean heat value for the city in which the pixel is located. Both datasets are generated from 30-meter Landsat 8 imagery, band 10 (ground-level thermal sensor) from summertime images.

The dataset is published as an ArcGIS image service. The easiest way to access it is by to adding it from the Living Atlas to ArcGIS Pro (or Online), and then export the service from there as a raster feature class (while doing so, you can also clip the layer to a smaller area of interest). It’s possible that you can also connect to it as an ArcGIS REST Server in QGIS, but I haven’t tried. While there are files that go back to 2019, the methodology has changed over time, so studying this as a national, annual time series is not appropriate. The coverage area expanded from just large, incorporated cities in earlier years to the entire US in recent years.

US Heat Index in Rhode Island
US Heat Severity Index 2023 in ArcGIS Pro, Providence and Adjacent Areas with Census Blocks

SHRUG – India

Created and hosted by the Development Data Lab (a collaborative project created by academic researchers from several universities), the Socioeconomic High-resolution Rural-Urban Geographic Platform for India (SHRUG) is an open access repository consisting of datasets for India’s medium to small geographies (districts, subdistricts, constituencies, towns, and villages), linked together with a set of common geographic IDs. Getting geographically detailed census data for India is challenging as you have to purchase it through 3rd party vendors, and comparing data across time is tough given the complex sets of administrative subdivisions and constant revisions to geographic identifiers. SHRUG makes it easy and open source, providing boundaries from the 2011 census and a unique ID that links geographies together and across time, back to 1991. In addition to the census, there are also environmental and election datasets.

Polygon boundaries can be downloaded as shapefiles or geopackages, and tabular data is available in CSV and DTA (STATA) formats. Researchers can also contribute data created from their own research to the repository.

SHRUG India District Data
SHRUG India Districts Total Population Data from 2011 Census in QGIS

USAID Spatial Data and Population Indicators

USAID published the detailed Demographic and Health Surveys (DHS) as far back as the mid 1980s for many of the world’s low and middle income countries. The surveys captured information about fertility, family planning, maternal and child health, gender, HIV/AIDS, literacy, malaria, nutrition, and sanitation. A selection of different countries were surveyed each year, and for most countries data was captured at two or three different points in time over a 40 year period. While researchers had to submit proposals and request access to the microdata (individual person and household level responses), the agency generated population-level estimates for countries and country subdivisions that were readily downloadable. They also generated rasters that interpolated certain variables across the surface of a country (the header image for this post is a raster of Senegal in 2023, illustrating the percentage of children aged 12-36 months who are vaccinated for eight fundamental diseases, including measles and polio). The rasters, boundary files, and a selection of survey indicators pre-joined to country and subdivision boundaries were published in their Spatial Data Repository. You could access the full range of population indicators as tables from a point and click website, or alternatively via API.

I’m writing in the past tense, as USAID has been decimated and de-funded by DOGE. There is currently no way to request access to the microdata. The summary data is still available on the USAID website (via links in the previous paragraph), but who knows for how long. As part of the Data Rescue Project, I captured both the Spatial Data Repository and the Indicators data, and posted them on DataLumos, an archive of archived federal government datasets. You can download these datasets in bulk from DataLumos, from the links under the title for this section. Unfortunately this series is now an archive of data that will be frozen in time, with no updates expected. The loss of these surveys is not only detrimental to researchers and policymakers, but to millions of the world’s most vulnerable people, whose health and well-being were secured and improved thanks to the information this data provided.

USAID Spatial Indicators Data
USAID Country Subdivisions in QGIS where Recent Data is Available on % Children who are Vaccinated

Python Screen Scraping Code

Screen Scraping Government Data with Python

In my previous post, I summarized several efforts to rescue and preserve US federal government datasets that are being removed from the internet. In this post, I’ll provide a basic primer on screen scraping with Python, which is what I’ve used to capture datasets in participating in the Data Rescue Project. Screen scraping can employed to many ends, such as capturing text on web pages so it can be analyzed, or taking statistics embedded in HTML tables and saving them in machine readable formats. In the context of this post, screen scraping is an approach for downloading data and documentation files stored on websites.

There are several benefits to using a scripting approach for this work. It saves you from the tedious task of clicking and downloading files one by one. The script serves as documentation for what you did, and allows you to easily repeat the process in the future, if the datasets continue to exist and are updated. A scripted, screen-scraping approach may not be best or necessary if the website and datasets are relatively small and simple, or conversely if the site is complicated and difficult to scrape given the technology it employs. In both cases, manual downloading may be quicker, especially with a team of volunteers. Furthermore, if it seems clear that the dataset or website are not going to be updated, or are going to vanish, then the benefit of repeating the process in the future is moot.

In this example, we’ll assume that screen scraping is the way to go, and we’ll use Python to do it. I’ll address a few alternatives to this approach at the end, the primary one being using an API if and when it’s available, and will share links to working code that colleagues and I have written to save datasets.

You should only apply these approaches to public, open data. Capturing restricted or proprietary information violates licenses, terms of service, and in some cases privacy constraints, and is not condoned by any of the rescue projects. Even if the data is public, bear in mind that scraping can put undue pressure on web servers. For large websites, plan accordingly by building pauses into the process, breaking up the work into segments, or running programs at non-peak times (overnight). When writing and testing scripts, don’t repeat the process over and over again on the entire website; run your tests on samples until you get everything working.

Screen Scraping Basics

The first step is to explore the website where the data is hosted, to identify the best pages to use as a source and determine the feasibility of the approach. Many websites will have feature rich, user friendly pages that make it easy to view extracts of data and visualize it, such as the NOAA climate website below.

While easy to use, these pages can be complex and tedious to scrape. Always look for an option for bulk downloading datasets. They may lead you to a page sitting behind the scenes of the fancy website, such as the NOAA file directory below. Saving data from a page like this is fairly straightforward.

For the benefit of those of you who are not 1990s era people like myself and may not be familiar with working with HTML, the example below illustrates a simple webpage. With any browser, you can right click on a page and View the Source, to see the HTML code and stylesheets behind the page, which the browser processes and renders to display the site. HTML is a markup language where text is enclosed in tags that tell us something about the content within the tags, and which can be used for displaying the content in different ways. HTML is also hierarchical, so that content can be nested. For example, there is a head section that contains preliminary content about the page, and a body section that encloses the main content. Within the body there can be divisions, and anchor tags that represent links. In this example, one of these anchors is a link to a data file that we want to download.

<html>
<head>
    <title>Example Webpage</title>
</head>
<body>
    <div class='content'>
        <p>Paragraph with text.</p>
        <a href='https://www.page.gov/data.zip'/>
        <a href='https://www.page.gov/page.html'/>
    </div>
</body>
</html>

We can use Python to parse these tags and pull out desired content. There are four core modules I always use: Requests for downloading content, os for creating folders and working with paths, Beautiful Soup for screen scraping, and datetime for creating time stamps. In the code below, we begin by importing the modules and saving the url of the page we wish to scrape as a variable.

In most Python environments (unless you’ve modified some settings) it’s assumed that your current working directory is the folder where your Python script is stored. When you download files, they will automatically be stored in that folder. To keep things tidy, I always create a subfolder named with the date; I use the date function from datetime to retrieve today’s date, append that date to the word “downloaded-‘, and use the os module to create a subfolder with that name. If we run the program at a later date it will save everything in a new folder, rather than overwriting existing files.

import requests, os
from bs4 import BeautifulSoup as soup
from datetime import date

url='https://www.page.gov'

today = str(date.today())
outfolder='downloaded-'+today
if not os.path.exists(outfolder):
    os.makedirs(outfolder)
    
webpage=requests.get(url).content
soup_page=soup(webpage,'html.parser')
page_title = soup_page.title.text
container=soup_page.find('div',{'class':'content'})
links=container.findAll('a')

The final block in this example captures data from the website. We use requests to get the content stored at the url (the webpage), and then we pass this to Beautiful Soup, which parses all the HTML using their tags. Once parsed, we can retrieve specific objects. For example, we can save the page title (the text that appears in the heading of your browser for a particular site) as a variable. We also grab the section of the page that contains the links we want to capture by looking for a specific div or id tag. This isn’t strictly necessary for simple pages like this one, but speeds up processing for larger, more complex pages. Lastly, we can search through that specific container to find all the anchor tags, or links.

Once we have the links, we loop through and save the ones we want. My preference is to store them in a dictionary as key / value pairs, where the key is the name of the file, and the value is the file’s URL. We iterate through the links we saved, and with the soup we determine if the link has an ‘href’ attribute. If it does, we see if it ends with .zip, which is the data file. This skips any link that’s not a file we want, including links that go to other webpages as opposed to files. In practice, I provide a list of several file types here such as .zip, .csv, .txt, .xlsx, .pdf, etc to capture anything that could be data or documentation. If we find the zip, we split the link’s attributes from one string of text into a list of strings that are separated by the backslash, and grab the last element, which is the name of the file. Lastly, we add this to our datalinks dictionary; in this example, we’d have: {'data.zip':'https://www.page.gov/data.zip'}.

datalinks={}

for lnk in links:
    if 'href' in lnk.attrs:
        if lnk.attrs['href'].endswith(('.zip')):
            fname=lnk.attrs['href'].split('/')[-1]
            datalinks[fname]=lnk.attrs['href']

Time to download! We loop through each key (file name) and value (url) in our dictionary. We use the requests module to try and get the url (v), but if there’s a problem with the website or the link is invalid we bail out. If successful, we use the os module to go to our output folder and we supply the name of the file from the website (k) as the name of the file that we want to store on our computer. The ‘wb’ parameter specifies that we’re writing bytes to a file. I always like to keep count of the number of files I’ve done with an iterator (i) so I can print messages to a screen or a log file.

i = 0 
for k,v in datalinks.items():
    try:
        response=requests.get(v)
        response.raise_for_status()
        dfile=open(os.path.join(outfolder,k),'wb')
        dfile.write(response.content)
        dfile.close()
        i=i+1
        print('Downloaded',k)
    except requests.exceptions.RequestException as e:
        print('Could not get',k,'because of',e)
print('Downloaded',i,'files from',page_title)

It’s important to save documentation too, so people can understand how the data was created and structured. In addition to saving pdf and text files, you can also save a vanilla copy of the website; I use a generic name with a date stamp. This saves the basic HTML text of the page, but not any images, documents, or styling. Which is usually good enough for providing documentation.

wfile = '_WEBPAGE-{}.html'.format(today)
writefile=open(os.path.join(outfolder,wfile),'wb')
writefile.write(webpage)
writefile.close()

As mentioned previously, you don’t want to place undue burden on the webserver. With the time module, you can use the sleep function and add a pause to your script for a fixed amount of time, usually at the end of a loop, or after your iterator has recorded a certain number of files. The random module allows you to supply a random time value within a range, if you want to vary the length of the pause.

import time
from random import randint

# Pause fixed amount
time.sleep(5)

# Pause random amount within a range
time.sleep(randint(10,20))

Screen Scraping Caveats

Those are the basics! Now here are the primary exceptions. The first problem is that links to files may not be absolute links that contain the entire path to a file. Sometimes they’re relative, containing a reference to just the subfolder and file. The requests module won’t be able to find these, so we have to take the extra step of building the full path, as in the example below. You can do this by identifying what the relative path starts with (unless they’re all relative and the same), and you create the absolute by adding (concatenating) the root url and the relative one contained in the soup.

    <div class='content'>
        <p>Paragraph with text.</p>
        <a href='/us/data.zip'/>
    </div> 
url='https://www.page.gov'
datalinks={}
for lnk in links:
    if lnk.attrs['href'].endswith(('.zip')):
        if lnk.attrs['href'].startswith('/us/'):
            fname=lnk.attrs['href'].split('/')[-1]
            datalinks[fname]=url+lnk.attrs['href']
            ...

In other cases, a link to a data file may not lead directly to the file, but leads to another web page where that file is stored. We can embed another scraping block into a loop; retrieve and start scraping the main page, then once you find a link go to that page, and repeat retrieval and scraping. In these cases, it’s best to save these steps in a function, so you can call the function multiple times instead of repeating the same code.

<div class='content'>
        <p>Paragraph with text.</p>
        <a href='https://www.page.gov/us/'>
 </div>

Some websites will have dedicated pages where they embed a parameter in the url, such as codes for countries or states. If you know what these are, you can define them in a list, and iterate through that list by formatting the url to insert the code, and then scrape that page. If a page uses a unique integer as an ID and you know what the upper limit is, you can use for i in range(1,n) to step through each page (but make sure you handle exceptions, in case an integer isn’t used or is missing).

codes=['us','ca','mx']
url='https://www.page.gov/{}'
for c in codes:
    webpage=requests.get(url.format(c)).content
    soup_page=soup(webpage,'html.parser')
    ... 

For complicated sites with several pages, you might not want to dump all the files into the same folder. Instead, as you iterate through pages, you can create a dedicated folder for that iteration. Using the example above, if there is a page for each country code, you can create a folder for that code and when writing files, use the path module to store files in that folder for that iteration.

codes=['us','ca','mx']
for c in codes:
    ...
    cfolder=os.path.join(outfolder,c)
        if not os.path.exists(cfolder):
            os.makedirs(cfolder)
    ...
    response=requests.get(v)
        response.raise_for_status()
        dfile=open(os.path.join(cfolder,k),'wb')
        dfile.write(response.content)
        dfile.close()

For websites with lots of files, or with a few big files, you may run out of memory during the download process and your script will go kaput. To avoid this, you can stream a file in chunks instead of trying to download it in one go. Use the request module’s iter_content function, and supply a reasonable chunk size in bytes (10000000 bytes is 10 MB).

...
try:
    with requests.get(v,stream=True) as response:
        response.raise_for_status()
        fpath=os.path.join(outfolder,k)
        with open(fpath,'wb') as writefile:
            for chunk in response.iter_content(chunk_size=10000000):
                writefile.write(chunk)
    i=i+1
    print('Downloaded',k)
except requests.exceptions.RequestException as e:
        print('Could not get',fname,'because of',e) 

If you view the page source for a website, and don’t actually see the anchor links and file names in the HTML, you’re probably dealing with a page that employs JavaScript, which is a show stopper if you’re using Beautiful Soup. There may be a dropdown menu or option you have to choose first, in order to render the actual page (and you may be able to use the page parameters trick above, if the url on each page varies). But you may be stuck; instead of links, there may be download buttons you have to press or a dropdown menu option you have to choose in order to download the file.

One option would be to use a Python module called Selenium, which allows you to automate the process of using a web browser, to open a page, find a button, and click it. I’ve tried Selenium with some success, but find that it’s complex and clunky for screen scraping. It’s browser dependent (you’re automating the use of a browser, and they’re all different), and you’re forced to incorporate lots of pauses; waiting for a page to load before attempting to parse it, and dealing with pop up menus in the browser as you attempt to download multiple files, etc.

Another option that I’m not familiar with, and thus haven’t tried, would be to use JavaScript since that’s what the page uses. Most browsers have web developer console add-ons that allow you to execute snippets of JavaScript code in order to do something on a page. So some automation may be possible.

Using an API

You may be able to avoid scraping altogether if the data is made available via an API. With a REST API, you pass parameters into a base link to make a specific request. Using requests, you go to that URL, and instead of getting a web page you get the data that you’ve asked for, usually packaged in a JSON type object within your program (Python or another scripting language). Some APIs retrieve documents or dataset files, that you can stream and download as described previously. But most APIs for statistical data retrieve individual data records, which you would store in a nested list or dictionary and then write out to a CSV. The example below grabs the total population for four large cities in Rhode Island from 2020 decennial census public redistricting dataset.

import requests,csv

year='2020'
dsource='dec' # survey
dseries='pl' # dataset
cols='NAME,P1_001N' # variables
state='44' # geocodes for states
place='19180,54640,59000,74300' # geocodes for places
outfile='census_pop2020.csv'
keyfile='census_key.txt'

with open(keyfile) as key:
	api_key=key.read().strip()

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dseries}'

# for sub-geography within larger geography - geographies must nest
data_url = f'{base_url}?get={cols}&for=place:{place}&in=state:{state}&key={api_key}'

response=requests.get(data_url)

popdata=response.json()
for record in popdata:
    print(record)
    
with open(outfile, 'w', newline='') as writefile:
    writer=csv.writer(writefile, quoting=csv.QUOTE_MINIMAL, delimiter=',')
    writer.writerows(popdata)

The benefit of an API is that it’s designed to retrieve machine readable data, and might be easier than scraping pages that have complex interfaces. The major downside is, if you’re forced to download individual records as opposed to entire files, the process can take a long time, to the point where it may be infeasible if the datasets are too large. It’s always worth checking to see if there is a bulk download option as that could be easier and more efficient (for example, the Census Bureau has an FTP site for downloading datasets in their entirety). Using an API also requires you to invest time in studying how it works, so you can build the appropriate links and ensure that you’re capturing everything.

Conclusion

Screen scraping will vary from website to website, but once you have enough examples it becomes easy to resample your code. You’ll always need to modify the Beautiful Soup step based on the structure of the individual pages, but the requests downloading step is more rote and may not require much modification. While I use Python, you can use other languages like R to achieve similar results.

Visit my library’s US Federal Government Data Backup GitHub for working examples of code that I and colleagues have used to capture datasets. In my programs I’ve added extra components, like writing a basic metadata file and error logs, which I haven’t covered in this post. The NOAA County at a Glance, IRS-SOI, and IMLS, scripts are basic examples, and the IMLS ones include some of the caveats I’ve described. The NOAA lake and sea level rise scripts are far more complex, and include cycling through many pages, creating multiple folders, streaming downloads, and encapsulating processes into functions. The USAID DHS Indicators scripts used APIs that retrieved files, while the USAID DHS SDR script used Selenium to step through a series of JavaScript pages.

You’ll find scripts but no datasets in the GitHub repo due to file size limitations. If you’re a member of an institution that has access to GLOBUS, you can access the data files by following the instructions at the top of the page. Otherwise, we’ve contributed all of our datasets to DataLumos (except for the sea level rise data, I’m working with another university to host that).

USAID DHS No Data Available

Rescuing US Government Data

There’s been a lot of turmoil emanating from Washington DC lately. One development that’s been more under the radar than others has been the modification or removal of US federal government datasets from the internet (for some news, see these articles in the New Yorker, Salon, Forbes, and CEN). In some cases, this is the intentional scrubbing or deletion of datasets that focus on topics the current administration doesn’t particularly like, such as climate and public health. In other cases, the dismemberment of agencies and bureaus makes data unavailable, as there’s no one left to maintain or administer it. While most government data is still available via functioning portals, most of the faculty and researchers I work with can identify at least a few series they rely on that have disappeared.

Librarians, archivists, researchers, professors, and non-profits across the country (and even in other parts of the world), have established rescue projects, where they are actively downloading and saving data in repositories. I’ve been participating in these efforts since January, and will outline some of the initiatives in this post.

The Internet Archive

The place of last resort for finding deleted web content is the Internet Archive. This large, non-profit project has been around as long as the web has existed, with the goal of creating a historic archive of the internet. It uses web crawlers or spiders to creep across the web and make copies of websites. With the Wayback Machine, you can enter a URL and find previous copies of web pages, including sites that no longer exist. You’re presented with a calendar page where you can scroll by year and month to select a date when a page was captured, which opens up a copy.

A Wayback Machine search for https://tools.niehs.nih.gov/cchhl/index.cfm. Blue circles on the calendar indicate when the page was captured.

This allows you to see the content, navigate through the old website, and in many cases download files that were stored on those pages. It’s a great resource, but it can’t capture everything; given the variety and complexity of web pages and evolving web technologies, some websites can’t be saved in working order (either partially or entirely). Content that was generated and presented dynamically with JavaScript, or was pulled and presented from a database, is often not preserved, as are restricted pages that required log-ins.

An archived copy of the NIEHS page (the actual website was deleted in mid February 2025)

The Internet Archive also hosts a number of special collections where folks have saved documents, images, sound and video, and software. For example, you can find many research articles that are available in PubMed from the PubMed Central collection, a ton of documents from the USDA’s National Agricultural Library, and about 100 GB of data someone captured from the CDC in January 2025. A large project called the End of Term Archive was launched in 2008 to capture what federal government websites looked like at the end of each presidential term. The pages are saved in a special collection in the IA.

Data Rescue Project

Dozens of new data archiving projects were launched at the end of 2024 and beginning of 2025 with the intention of saving federal datasets. The Data Rescue Project is one of the larger efforts, which has been driven by data librarians and archivists with non-profit partners. Professional groups including IASSIST, ICPSR, RDAP, the Data Curation Network, and the Safeguarding Research & Culture project have been active organizers and participators. While this will be an oversimplification, I’ll summarize the project as having two goals

The first goal is to keep track of what the other archiving projects are, and what they have saved. To this end, they created the Data Rescue Tracker, which has two modules. The Downloads List is an archive of datasets that have been saved, with details about where the data came from and locations of archived copies. The Maintainers List is a catalog of all the different preservation projects, with links to their home pages. There is also a narrative page with a comprehensive list of links to the various rescue efforts, data repositories, alternate sources for government data, and tools and resources you can use to save and archive data.

The Data Rescue Tracker Downloads List

The second goal is to contribute to the effort of saving and archiving data. The team maintains an online spreadsheet with tabs for agencies that contain lists of datasets and URLs that are currently prioritized for saving. Volunteers sign up for a dataset, and then go out and get it. Some folks are manually downloading and saving files (pointing and clicking), while others write short screen scraping scripts to automate the process. The Data Rescue Project has partnered with ICPSR, a preeminent social science research center and repository in the US, at the University of Michigan. They created a repository called DataLumos, which was launched specifically for hosting extracts of US federal government data. Once data is captured, volunteers organize it and generate metadata records prior to submitting it to DataLumos (provided that the datasets are not too big).

DataLumos archive for federal government datasets, maintained by ICPSR

Most of the datasets that DRP is focused on are related to the social sciences and public policy. The Data Rescue Project coordinates with the Environmental and Government Data Initiative and the Public Environmental Data Partners (which I believe are driven by non-profit and academic partners), who are saving data related to the environment and health. They have their own workflows and internal tracking spreadsheets, and are archiving datasets in various places depending on how large they are. Data may be submitted to the Internet Archive, the Harvard Dataverse, GitHub, SciOp, and Zenodo (you can find out where in the Data Rescue Tracker Download’s List).

Mega Projects

There are different approaches for tackling these data preservation efforts. For the Data Rescue Project and related efforts, it’s like attacking the problem with millions of ants. Individual people are coordinating with one another in thousands of manual and semi-automated download efforts. A different approach would be to attack the problem with a small herd of elephants, who can employ larger resources and an automated approach.

For example, the Harvard Law School Library Innovation Lab launched the Archive of data.gov, a large project to crawl and download everything that’s in data.gov, the US federal government’s centralized data repository. It mirrors all the data files stored there and is updated regularly. The benefit of this approach is that it captures a comprehensive amount of data in one go, and can be readily updated. The primary limitation is that there are many cases where a dataset is not actually stored in data.gov, but is referenced in a catalog record with a link that goes out to a specific agency’s website. These datasets are not captured with this approach.

If trying to find back-ups is a bit bewildering, there’s a tool that can help. Boston University’s School of Public Health and Center for Health Data Science have created a find lost* data search engine, which crawls across the Harvard Project, DataLumos, the Data Rescue Project, and others.

Beyond the immediate data preservation projects that have sprung up recently, there are a number of large, on-going projects that serve as repositories for current and historical datasets. Some, like IPUMS at the University of Minnesota and the Election Lab at MIT focus on specific datasets (census data for the former, election results data for the latter). There are also more heterogeneous repositories like ICPSR (including OpenICPSR which doesn’t require a subscription), and university-based repositories like the Harvard Dataverse (which includes some special collections of federal data extracts, like CAFE). There are also private-sector partners that have an equal stake in preserving and providing access to government data, including PolicyMap and the Social Explorer.

Wrap-up

I’ve been practicing my Python screen scraping skills these past few months, and will share some tips in a subsequent post. I’ve been busy contributing data to these projects and coordinating a response on my campus. We’ve created a short list of data archives and alternative sources, which captures many of the sources I’ve mentioned here plus a few others. My library colleagues in the health and medical sciences have created a list of alternatives to government medical databases including PubMed and ClinicalTrials.gov

Having access to a public and robust federal statistical system is a non-partisan issue that we should all be concerned about. Our Constitution justifies (in several sections) that we should have such a system, and we have a large body of federal laws that require it. Like many other public goods, the federal statistical system contributes to providing a solid foundation on which our society and economy rest, and helps drive innovation in business, policy, science, and medicine. It’s up to us to protect and preserve it.

Map Vacant Housing Units

Census Time Series Tables from NHGIS

I’m often asked about what the best approaches are for comparing US census data over time, to account for changes in census geography and to limit the amount of data processing you have to do in stitching data from different census years together. Census geography changes significantly each decade, and by and large the Census Bureau does not compile and publish historical comparison tables.

My primary suggestion is to use the National Historical Geographic Information System or NHGIS (I’ll mention some additional suggestions at the end of this post). Maintained by IPUMS at the University of Minnesota, NHGIS is the repository for all historic US census summary data from 1790 to present. While most of the data in the archive is published nominally (the format and structure in which the data was originally published), they do publish a set of Time Series Tables that compile multiple years of census data in one table. These tables come in two formats:

  • Nominal tables: the data is published “as is”, based on the boundaries that existed at each point in time. If a geography was added or dropped over the course of the years, it falls in or out of the table in the given year that the change occurred. With a few exceptions, the earliest nominal tables begin with the 1970 census and are published for eight geographies: nation, regions, divisions, states, counties, census tracts, county subdivisions, and places.
  • Standardized tables: the data has been normalized, where a geography for a single time period serves as the basis for all data in the table. The NHGIS is currently using 2010 as the basis, so that data prior and subsequent to 2010 has been modified to fit within the 2010 boundaries. This is achieved by aggregating block or block group data from each period to fit within the 2010 boundaries, and apportioning the data in cases where a block or group is split by a boundary. The earliest standardized tables begin with the 1990 census, and cover the basic 100% count data. Data is published for ten geographies: states, counties, census tracts, block groups, county subdivisions, places, congressional districts (as defined for the 110th-112th Congresses, 2007-2013), core based statistical areas (using 2009 metro area definitions), urban areas, and ZIP Code Tabulation Areas (ZCTAs).

Included in the documentation is a full list of time series tables, and whether they are available in nominal or standardized format. The availability of specific time periods and geographies varies. As of late 2024, the availability of standardized tables that include the 2020 census is currently limited to what was published in the early Public Redistricting Files. This will likely change in the near future to include additional 2020 data, and it’s possible that the standardized geography will eventually switch from 2010 to 2020 geography.

To access the Time Series Tables, you can browse the NHGIS without an account but you’ll need to create one in order to download anything. Once you launch NHGIS click on the Topics filter. In the list of topics, any topic under the Population or Housing category that has a “TS” flag next to it has at least one time series table. In the example below, I’ve used the filters to select census tracts for Geographic Level, 2010 and 2020 for Years, and HousingOccupancy and Vacancy status as my Topic.

NHGIS Time Series flags in topics filter

In the results at the bottom, the original Source Tables from each census are shown in the first tab. The Time Series Tables can be viewed by selecting the adjacent tab. The first two tables in this example are Housing Units by Occupancy Status. Clicking on the name of the tables reveals the variables that are included, and the source for the statistics. The first table is a nominal one that stretches from 1970 to the most recent ACS. The second table is a standardized one that covers 1990 to 2020. I’ve checked both boxes to add these to my cart.

NHGIS Select Time Series Tables

The third tab in the results are GIS Files. If we want to map standardized data, we would choose just the boundaries for the standardized year, as all of the data in the table has been modified to fit these boundaries. If we were mapping nominal data, we would need to download boundary files for each time period and map them separately (unless they were stable geographies like states that haven’t changed since 1970).

We hit the Continue button in the Cart panel when we’re ready to download. By default the extract will only include years and geographies we have filtered for. To add additional years or geos we can add them on this next screen. I’ve modified my list to get all available decennial years for each table. Note that if you’re going to select 5-year ACS data for nominal tables, choose only a few non-overlapping periods. In most cases you can’t filter geographies (i.e. select tracts within a state), you have to take them all. On the final screen you choose your structure; CSV is usually best, as is Time varies by column. Once you submit your request you’ll be prompted to log in if you haven’t already done so. Wait a bit for the extract to compile, then you can download the table and codebook.

NHGIS Select Years for Time Series

A portion of the nominal table is depicted below. This table includes identifiers and labels for each of the census years. The variables follow, ordered by variable and then by year. In this example, occupied housing units from 1970 to 2020 appear in the first block, and vacant units in the second. All the 1970 census tract values for Autauga County, Alabama are blank (as many rural counties in 1970 were un-tracted). We can see that values for census tract 205 run only from 1980 to 2010, with no value for 2020. The tract was split into three parts in 2020, and we see values for tracts 205.01, .02, and .03 appear in 2020. So in the nominal tables, geographies appear and disappear as they are created or destroyed. However, if geographic boundaries change but the name and designation for the geography do not, that geography persists throughout the time series in spite of the change.

Nominal tract table

A portion of the standardized table is below. This table only includes identifiers and labels for the 2010 census, as all data was modified to fit the tract geography of that year. The values for each census year except 2010 are published in triplicate: an estimate, and a lower and upper bound for the estimate. If the values in these three columns differ, it indicates that a block (or block group) was split and reapportioned to fit within the tract boundary for 2010 (you may also see decimals, indicating a split occurred). You’d use the estimate in your work, while the bounds provide some indication of the estimate’s accuracy. Note in this table, tract 205 in Autauga County persists from 1990 to 2020, as it existed in 2010. Data from the three 2020 tracts was aggregated to fit the 2010 boundary.

Standardized tract table

The crosswalk tables that IPUMS used to create the standardized data are available, if you wanted or needed to generate your own normalized data. The best approach is to proceed from the bottom up, aggregating blocks to reformulate the data to the geography you wish to use. Some decennial census data, and all data from the ACS, is not available at the block level, which necessitates using block groups instead.

There are some alternatives for obtaining or creating time series census data, which could fit the bill depending on your use case (esp if you are looking at larger geographies). There’s also reference material that can help you make sense of changes.

  1. The Longitudinal Tract Database at Brown University provides tract-level crosswalks from 1970 to 2020. They also provide some pre-compiled data tables generated from the crosswalk.
  2. For short term comparisons, the ACS includes Comparison Profile Tables for states, counties, places, and metro areas that compare two non-overlapping time periods. For example, here is the 5-year ACS Comparative Demographic Estimates profile for Providence RI in 2022 (compares 2018-2022 with 2013-2017).
  3. Use an interactive mapping tool like the Social Explorer to make side by side comparison maps from two time periods. They also incorporate some of the NHGIS standardized data into their database. (SE is a subscription-based product; if you’re at a university see if your library subscribes).
  4. The Population and Housing Unit Estimates program publishes annual estimates for states, counties, and metro areas in decade by decade spreadsheets. The MCDC has created some easy to use tools for summarizing and charting this data to show annual population change and changes in demographic characteristics.
  5. I had previously written about pulling population and economic time series tables for states, counties, and metro areas from the Bureau of Economic Analysis data portal.
  6. Counties change more often than you think. The Census Bureau has a running list of changes to counties from 1970 to present. Metro areas change frequently too, but since they are built from counties you can aggregate older county data to fit modern metro boundaries. The census provides delineation files that assign counties to metros.

Route from SciLi to Libraries on OSM

Plotting Routes with OpenRouteService and Python

I made my first foray into network routing recently, and drafted a python script and notebook that plots routes using the OpenRouteService (ORS) API. ORS is based on underlying data from the OpenStreetMap (OSM), and was created by the Heidelberg Institute for Geoinformation Technology, at Heidelberg University in Germany. They publish several routing APIs that include directions, isochrones, distance matricies, geocoding, and route optimization. You can access them via a basic REST API, but they also have a dedicated Python wrapper and an R package which makes things a bit easier. For non-programmers, there is a plugin for QGIS.

Regardless of which tool you use, you need to register for an API key first. The standard plan is free for small projects; for example you can make 2,000 direction requests per day with a limit of 40 per minute. If you’re affiliated with higher ed, government, or a non-profit and are doing non-commercial research, you can upgrade to a collaborative plan that ups the limits. It’s also possible to install OSR locally on your own server for large jobs.

I opted for Python and used the openrouteservice Python module, in conjunction with other geospatial modules including geopandas and shapely. In my script / notebook I read in two CSV files, one with origins and the other with destinations. At minimum both files must contain a header row, and attributes for unique identifier, place label, longitude, and latitude in the WGS 84 spatial reference system. The script plots a route between each origin and every destination, and outputs three shapefiles that include the origin points, destination points, and routes. Each line in the route file includes the ID and names of each origin and destination, as well as distance and travel time. The script and notebook are identical, except that the script plots the end result (points and lines) using geopanda’s plot function, while the Jupyter Notebook plots the results on a Folium map (Folium is a Python implementation of the popular Leaflet JS langauge).

Visit the GitHub repo to access the scripts; a basic explanation with code snippets follows.

After importing the modules, you define several variables that determine the output, including a general label for naming the output file (routename), and several parameters for the API including the mode of travel (driving, walking, cycling, etc), distance units (meters, kilometers, miles), and route preference (fastest or shortest). Next, you provide the positions or “column” locations of attributes in the origin and destination CSV files for the id, name, longitude, and latitude. Lastly, you specify the location of those input files and the file that contains your API key. The location and names of output files are generated automatically based on the input; all will contain today’s date stamp, and the route file name includes route mode and preference. I always use the os module’s path function to ensure the scripts are cross-platform.

import openrouteservice, os, csv, pandas as pd, geopandas as gpd
from shapely.geometry import shape
from openrouteservice.directions import directions
from openrouteservice import convert
from datetime import date
from time import sleep

# VARIABLES
# general description, used in output file
routename='scili_to_libs'
# transit modes: [“driving-car”, “driving-hgv”, “foot-walking”, “foot-hiking”, “cycling-regular”, “cycling-road”,”cycling-mountain”, “cycling-electric”,]
tmode='driving-car'
# distance units: [“m”, “km”, “mi”]
dunits='mi'
# route preference: [“fastest, “shortest”, “recommended”]
rpref='fastest'

# Column positions in csv files that contain: unique ID, name, longitude, latitude
# Origin file
ogn_id=0
ogn_name=1
ogn_long=2
ogn_lat=3
# Destination file
d_id=0
d_name=1
d_long=2
d_lat=3

# INPUTS and OUTPUTS
today=str(date.today()).replace('-','_')

keyfile='ors_key.txt'
origin_file=os.path.join('input','origins.csv') #CSV must have header row
dest_file=os.path.join('input','destinations.csv') #CSV must have header row
route_file=routename+'_'+tmode+'_'+rpref+'_'+today+'.shp'
out_file=os.path.join('output',route_file)
out_origin=os.path.join('output',os.path.basename(origin_file).split('.')[0]+'_'+today+'.shp')
out_dest=os.path.join('output',os.path.basename(dest_file).split('.')[0]+'_'+today+'.shp')

I define some general functions for reading the origin and destination files into nested lists, and for taking those lists and generating shapefiles out of them (by converting them to geopanda’s geodataframes). We read the origin and destination data in, grab the API key, set up a list to hold the routes, and create a header for the eventual output.

# For reading origin and dest files
def file_reader(infile,outlist):
    with open(infile,'r') as f:
        reader = csv.reader(f)    
        for row in reader:
            rec = [i.strip() for i in row]
            outlist.append(rec)
            
# For converting origins and destinations to geodataframes            
def coords_to_gdf(data_list,long,lat,export):
    """Provide: list of places that includes a header row,
    positions in list that have longitude and latitude, and
    path for output file.
    """
    df = pd.DataFrame(data_list[1:], columns=data_list[0])
    longcol=data_list[0][long]
    latcol=data_list[0][lat]
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df[longcol], df[latcol]), crs='EPSG:4326')
    gdf.to_file(export,index=True)
    print('Wrote shapefile',export,'\n')
    return(gdf)
      
origins=[]
dest=[]
file_reader(origin_file,origins)
file_reader(dest_file,dest)

# Read api key in from file
with open(keyfile) as key:
    api_key=key.read().strip()

route_count=0
route_list=[]
# Column header for route output file:
header=['ogn_id','ogn_name','dest_id','dest_name','distance','travtime','route']

Here are some nested lists from my sample origin and destination CSV files:

[['origin_id', 'name', 'long', 'lat'], ['0', 'SciLi', '-71.4', '41.8269']]
[['dest_id', 'name', 'long', 'lat'],
 ['1', 'Rock', '-71.405089', '41.825725'],
 ['2', 'Hay', '-71.404947', '41.826433'],
 ['3', 'Orwig', '-71.396609', '41.824581'],
 ['4', 'Champlin', '-71.408194', '41.818912']]

Then the API call begins. For every record in the origin list, we iterate through each record in the destination list (in both cases starting at index 1, skipping the header row) and calculate a route. We create a tuple with each pair of origin and destination coordinates (coords), which we supply to the OSR directions API. We pass in the parameters supplied earlier, and specify instructions as False (instructions are the actual turn by turn directions returned as text).

The result is returned as a JSON object, which we can manipulate like a nested Python dictionary. At the first level in the dictionary, we have three keys and values: a bounding box for the route area with a list value, metadata with a dictionary value, and routes with a list value. Dive into route, and the list contains a single dictionary, and inside that dictionary – more dictionaries that contain the values we want!

1st level, dictionary with three keys, the routes key has a single list value
2nd level, the routes list has a single element, another dictionary
3rd level, inside the dictionary in that list element, four keys with route data

The next step is we extract the values that we need from this container by specifying their location. For example, the distance value is inside the first list of routes, inside summary and inside distance. Travel time is in a similar spot, and we take an extra step of dividing by 60 to get minutes instead of seconds. The geometry is trickier as its encoded in some binary line format. We use OSR’s decoding function to turn it into plain text, and shapely to convert it into WKT text; we’ll need WKT in order to get the geometry into a geodataframe, and eventually output as a shapefile. Once we have the bits we need, we string them together as a list for that origin / destination pair, and append this to our route list.

# API CALL
for ogn in origins[1:]:
    for d in dest[1:]:
        try:
            coords=((ogn[ogn_long],ogn[ogn_lat]),(d[d_long],d[d_lat]))
            client = openrouteservice.Client(key=api_key) 
            # Take the returned object, save into nested dicts:
            results = directions(client, coords, 
                                profile=tmode,instructions=False, preference=rpref,units=dunits)
            dist = results['routes'][0]['summary']['distance']
            travtime=results['routes'][0]['summary']['duration']/60 # Get minutes
            encoded_geom = results['routes'][0]['geometry']
            decoded_geom = convert.decode_polyline(encoded_geom) #convert from binary to txt
            wkt_geom=shape(decoded_geom).wkt #convert from json polyline to wkt
            route=[ogn[ogn_id],ogn[ogn_name],d[d_id],d[d_name],dist,travtime,wkt_geom]
            route_list.append(route)
            route_count=route_count+1
            if route_count%40==0: # API limit is 40 requests per minute
                print('Pausing 1 minute, processed',route_count,'records...')
                sleep(60)
        except Exception as e:
            print(str(e))
            
api_key=''
print('Plotted',route_count,'routes...' )

Here is some sample output for the final origin / destination pair, which contains the IDs and labels for the origin and destination, distance in miles, time in minutes, and a string of coordinates that represents the route:

['0', 'SciLi', '4', 'Champlin', 1.229, 3.8699999999999997,
 'LINESTRING (-71.39989 41.82704, -71.39993 41.82724, -71.39959 41.82727, -71.39961 41.82737, -71.39932 41.8274, -71.39926 41.82704, -71.39924000000001 41.82692, -71.39906000000001 41.82564, -71.39901999999999 41.82534, -71.39896 41.82489, -71.39885 41.82403, -71.39870000000001 41.82308, -71.39863 41.82269, -71.39861999999999 41.82265, -71.39858 41.82248, -71.39855 41.82216, -71.39851 41.8218, -71.39843 41.82114, -71.39838 41.82056, -71.39832 41.82, -71.39825999999999 41.8195, -71.39906000000001 41.81945, -71.39941 41.81939, -71.39964999999999 41.81932, -71.39969000000001 41.81931, -71.39978000000001 41.81931, -71.40055 41.81915, -71.40098999999999 41.81903, -71.40115 41.81899, -71.40186 41.81876, -71.40212 41.81866, -71.40243 41.81852, -71.40266 41.81844, -71.40276 41.81838, -71.40452000000001 41.81765, -71.405 41.81749, -71.40551000000001 41.81726, -71.40639 41.81694, -71.40647 41.81688, -71.40664 41.81712, -71.40705 41.81769, -71.40725 41.81796, -71.40748000000001 41.81827, -71.40792 41.81891, -71.40794 41.81895)']

Finally, we can write the output. We convert the nested route list to a pandas dataframe and use the header row for column names, and convert that dataframe to a geodataframe, building the geometry from the WKT string, and write that out. In contrast, the origins and destinations have simple coordinates (not in WKT), and we create XY geometry from those coordinates. Writing the geodataframe out to a shapefile is straightforward, but for debugging purposes it’s helpful to see the result without having to launch GIS. We can use geopandas’s plot function to draw the resulting geometry. I’m using the Spyder IDE, which displays plots in a dedicated window (in my example the coordinate labels for the X axis look strange, as the distances I’m plotting are small).

# Create shapefiles for routes
df = pd.DataFrame(route_list, columns=header)
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkt(df["route"]),crs = 'EPSG:4326')
gdf.drop(['route'],axis=1,inplace=True) # drop the wkt text
gdf.to_file(out_file,index=True)
print('Wrote route shapefile to:',out_file,'\n')

# Create shapefiles for origins and destinations
ogdf=coords_to_gdf(origins,ogn_long,ogn_lat,out_origin)
dgdf=coords_to_gdf(dest,d_long,d_lat,out_dest)

# Plot
base=gdf.plot(column="dest_id", kind='geo',cmap="Set1")
ogdf.plot(ax=base, marker='o',color='black')
dgdf.plot(ax=base, marker='x', color='red');

In a notebook environment we can employ something like Folium more readily, which gives us a basemap and some basic interactivity for zooming around and clicking on features to see attributes. Implementing this was a more complex than I thought it would be, and took me longer to figure out compared to the routing process. I’ll return to those details in a subsequent post…

In my sample data (output rendered below in QGIS) I was plotting fastest driving distance from the Brown University Sciences Library to the other libraries in our system. Compared to Google or Apple Maps the result made sense, although the origin coordinates I used for the SciLi had an impact on the outcome (assumed you left from the loading dock behind the building as opposed to the front door as Google did, which produces different routes in this area of one-way streets). My real application was plotting distances of hundreds of miles across South America, which went well and was useful for generating different outcomes using fastest or shortest route.

Take a look at the full script in GitHub, or if programming is not your thing check out the QGIS plugin instead (activate in the Plugins menu, search for OSR). Remember to get your API key first.

Least Cost Paths in QGIS and GRASS

Cost Surfaces and Least Cost Paths in QGIS and GRASS

In this post I’ll demonstrate how to create least cost paths using QGIS and GRASS GIS, and in doing so will describe how a cost surface is constructed. In a surface analysis, you model movement across a grid whose values represent friction encountered in moving across it. In computing a least cost path, you’re seeking an optimal route from an origin to the closest destination, where ‘close’ incorporates distance and ease of movement across that surface. These kinds of analyses are often conducted in the environmental sciences, in modeling the movement of water across terrain, and in zoology in predicting migration paths for land-based animals.

In this example the idea was to chart the origin of settlements and possible trade routes in ancient history. In applications where we’re studying human activity, network analysis is typically used instead. Networks use geometry, where a node is a place or person, and connections between nodes are indicated with lines. Lines typically have a value associated with them that identify either the strength of a connection, or conversely friction associated with moving between nodes. The idea for this project was to identify how networks formed, so the surface analysis served as a proto-network analysis. While there were roads and maritime routes in pre-modern times, these networks were weaker and less dense. Charting movement over a surface representing terrain could provide a decent approximation of routes (but if you’re interested in ancient Roman network routing, check out the ORBIS project at Stanford).

This example stems from a project I was helping a PhD student with; I don’t want to replicate his specific study, so I’ve modified the data sources and area of focus to model movement between large settlements and stone quarries in the ancient Roman world. My goal is to demonstrate the methods with a plausible example; if we were doing this as part of an actual study, we would need to be more discriminating in selecting and processing our data.

Preliminary Work

The Pleiades project will serve as our source for destinations; it’s an academic gazetteer that includes locations and place names for the ancient and early medieval world, stretching from Europe and North Africa through the Middle East to India. It’s published in many forms, and I’ve downloaded the Pleiades Data for GIS in a CSV format. Using QGIS, I used the Add Delimited Text tool to plot the places.csv to get all of the locations, and joined that file to the places_place_type.csv file which contains different categories of places. I used Select by Attributes to get locations classified as quarries, and exported the selection out to a geopackage.

The Pleiades data includes a category for settlements, but there are about ten thousand of these and there isn’t an easy way to create a subset of the largest places. So I opted to use Hanson’s dataset of the largest settlements in the ancient Roman world to serve as our source for origins (about 1,400 places). This data was packaged in an Excel file; I plotted the coordinates using the Create Points Layer from Table tool in QGIS and converted the result to a geopackage. For testing purposes, I selected a subset of ten major cities and saved them in a separate layer: Athenae, Alexandria (Aegyptus), Antiochia (Syria), Byzantium, Carthago, Ephesus, Lugdunum, Ostia, Pergamum, Roma.

For the friction grid, I downloaded a geoTIFF of the Human Mobility Index by Ozak. The description from the project:

“The Human Mobility Index (HMI) estimates the potential minimum travel time across the globe (measured in hours) accounting for human biological constraints, as well as geographical and technological factors that determined travel time before the widespread use of steam power.”

There are three separate grids that vary in extent based on the availability of seafaring technology. I chose the grid that incorporates seafaring prior to the advent of ocean-going ships, which is appropriate for the Mediterranean world during the classical era. The HMI is a global grid at 925 meter resolution. To minimize processing time, I clipped it to a bounding box that encompasses the area of study. The grid is in the World Cylindrical Equal Area system; I reprojected it to WGS 84 to match the rest of the layers. As long as we’re not measuring actual distances, we don’t need to worry about the system we’re using (but if we were, we’d use an equidistant system). Since the range of values is small and it’s hard to see differences in cell values, I symbolized the grid as single-band psuedo color and used a quantiles classification scheme with 12 categories.

Lastly, I grabbed some modern country boundaries from Natural Earth to serve as a general frame of reference. A screenshot of the workspace is below:

Least Cost Path in QGIS

QGIS has a third-party plug-in for doing a least cost path analysis, which works fine as long as you don’t have too many origin points. Go to Plugins > Manage and Install Plugins > Least Cost Path to turn it on. Then open the Processing toolbox and it will be listed at the bottom. See the screenshot below for the tool’s menu. The Cost raster layer is the friction surface, so the human mobility index in this example. The start points are the ten major cities and the end points are the quarries. The start-point layer dialog only accepts a single point; if you have multiple points, hit the green circular arrow button to iterate across all of them. There’s a checkbox for connecting the start point to just the nearest end point (as opposed to all of them). Save the output to a geopackage.

It took about five minutes to run the analysis and iterate across all ten points. Each path is saved in a separate file, but since they have an identical structure I subsequently used Vector > Data Management Tools > Merge Vector Layers to combine them into one file. The attribute table records the end point ID (for the quarry) and the accumulated cost, but does not include the origin ID; this ID is the number 1 repeated each time, as the tool was iterating over the origin points. We can see the result below; for Athens and Ephesus in the south, land routes were shortest, whereas for Pergamum and Byzantium in the north it was easier (distance and friction-wise) to move across the sea.

While this worked fine for ten cities, it would take a considerable amount of time to compute paths for all 1,400. The problem here is that the plugin was designed for one point at a time. Let’s outline the process so we can understand how alternatives would work.

Cost Surface Analysis

To calculate a least cost path, the first step is to create a cost surface, where we take our friction grid and the destinations and calculate the total cost of movement across all cells to the nearest destination. First, the destinations are placed on the grid, and they become the grid sources. Then, the accumulated cost of moving from each source to its adjacent cells is calculated. For horizontal and vertical movement, it’s the sum of the friction values divided by two, and for diagonal movement it’s the sum of the friction values divided by two then multiplied by 1.4142. Once those calculations are performed, those adjacent cells are assigned to each source. Next, the lowest accumulated cost cell in the grid is identified, the cost for moving to its unassigned neighbors is calculated, and these cells are assigned to the same source. This process is repeated by cycling through the lowest accumulated value until all calculations for the grid are finished. Illustrated in the example below, which I derived from Lloyd’s Spatial Data Analysis (2010) pp. 165-168.

For each cell, three items are recorded, and are saved either as separate bands in one raster, or in three separate raster files:

  1. Accumulated cost of moving to that cell from the nearest source
  2. Assignment or allocation of the cell to its source (the nearest one to which it “belongs”)
  3. A vector that indicates direction from that source

With these cost surfaces, we can take the second step of calculating the least cost path. We place a number of starting points onto this surface, and each point is assigned to the closest destination based on where its grid cell was allocated. The direction to that destination is traced backward using the direction grid, and the total cost of movement is taken from the accumulated cost surface.

Knowing how this process works, there are two practical conclusions we can draw. First, when computing the cost surface, you use your destinations (not the origins) as the source for the cost surface. You use the origins as the start points for the least cost path. Second, there’s no need to recalculate the cost surface for every origin point; you only need to do this once. That’s why the QGIS plugin took so long; it was recomputing the cost surface each time. Knowing this, we can use GRASS GIS to compute the paths, as it’s designed to compute the surface just once (and it’s data structure will also boost performance a bit).

Cost Surface Analysis in GRASS

GRASS GIS comes bundled with QGIS. While it’s possible to run a number of GRASS tools directly within QGIS, it’s a bit undesirable as you’re not able to access the full range of parameters or options for each GRASS command. I opted to create the GRASS environment in QGIS, and loaded all the necessary data into the GRASS format. Then, I flipped over to the GRASS GUI to do the analysis.

GRASS uses a distinct database structure and file format, and we need to create a GRASS workspace and load our data into that database in order to use the cost surface tools. I followed the steps in the QGIS manual for creating a GRASS environment and loading data into a GRASS database. Once you create the database and mapset, you use the QGIS Browser to browse to the grassdata folder and designate your new mapset as your working mapset (mapsets have the little green grass icon beside them). With the GRASS tools open, I used v.in.ogr.qgis to load my my cities and quarries layers into this mapset, and r.in.gdal.qgis to load the mobility index (if these layers weren’t already in your QGIS project, you’d use the tools that don’t have the qgis suffix, i.e. v.in.ogr).

After exiting QGIS and launching GRASS, you select the mapset under the grassdata database at the top, right click and choose Switch mapset, and choose the mapset you want to work with (if you don’t see it, hit the database icon to browse and connect to the grassdata folder). You can display the layers in the GRASS window to visualize them, but it’s not necessary for running the tools. In the tool menu on the right, search for the Cost surface tool, r.cost, and choose the following options:

  • Required: input raster map with grid cell cost is the human mobility index, and output raster is cost_surface
  • Optional outputs: output raster with nearest start point as allocation_surface, output raster with movement as direction_surface
  • Start: vector starting points for the cost surface are the destinations, the quarries
  • Optional: check verbose output (to get more details on errors)
GRASS GIS GUI and r.cost

Running this operation on all 1,400 cities took a matter of seconds, and all three rasters described in the previous section were generated: cost, allocation, direction (shown below).

Using these outputs, we can run the Least cost route or flow tool, which is called r.drain (as it’s often used in earth sciences to chart the path that water will drain based on elevation).

  • Required: Name of input elevation or cost surface raster is cost_surface, Name of output raster path: is path_raster
  • Cost surface: check the Input raster map is a cost surface box, Name of input movement direction raster is direction_surface
  • Start: Name of starting points vector map: are the origins (cities)
  • Path settings: choose ONE option that you’d like to record (or none)
  • Optional: check Verbose mode, Name for output drain vector is path_vector

This also took mere seconds to complete (!) and generated the paths from each origin (city) to the closet destination (quarry) over the surface as both raster cells and vector lines. The output in GRASS is shown below.

Least cost path output in GRASS GIS

At this stage, we can hop back into QGIS, and load these output paths into our original project to symbolize and study what’s going on. Notice the settlements in northeastern Italy and along the Dalmatian coast; for many of them the least cost path is to a quarry across the sea rather than through rugged mountainous terrain. Even though some quarries in the mountains may be closer in actual distance, it’s a tougher path to travel.

Conclusion

The benefit of using GRASS is that we can run these processes fairly quickly for large datasets. The GRASS commands can also be compiled into a batch script, so you can create a documented and automated process instead of having to drill through multiple menus.

A big downside of the GRASS tools for this analysis is that the resulting vector paths contain no information about the origin or destination points, and only the raster path output carries along values. You might be able to generate this information through some extra steps; using the QGIS field calculator, you can get the coordinates for the start point and end point of each path and add them explicitly to the attribute table. Then take those coordinates, and for the start point of the line select the closest city and get its attributes, and for the end point select the closet quarry and get its attributes. I say “closest” because the vector paths don’t snap perfectly to the start and end points. Modifying the resolution of the human mobility index to make it coarser (fewer cells) might help to resolve this, or converting the origin and destination points to a raster of the same resolution as the index. Alternatively, if you incorporate the GRASS commands into a Python script, you could iterate over the origins in the least cost path analysis and record the origin IDs as you step through.

I haven’t worked all the pieces, but hopefully this will be useful for those of you who are interested in conducting a basic cost surface analysis in open source. The student I was helping was interested in measuring the density of the paths across a grid, so this process worked for him as he didn’t need to associate the paths with origins and destinations. Beyond FOSS GIS, ArcGIS Pro has a full suite of tools for cost surface analysis, and the underlying methods and logic are the same.