gis

Anything about GIS software

Points to Areas with Convex Hulls in GIS

Let’s say you have different sets of points, and each set represents a distinct category of features. Maybe villages where residents speak different languages, or historical events that occurred during different epochs. Beyond plotting and symbolizing the points, perhaps you would like to create areas for each set that represent generalized territory, and you’d like to see how these areas correspond. I’ll demonstrate a few approaches for achieving this, using convex hulls, attribute table calculations, and geoprocessing tools like intersection and union. A convex hull is a minimum bounding polygon, where an area is drawn around all points in a set, where the outermost points serve as vertices for creating boundaries.

I’ll use QGIS for this example, but will mention the corresponding tools in ArcGIS Pro at the end. In QGIS we’ll use the tools that are located within the Processing Toolbox (gear icon on the toolbar). Unlike the shortcut tools under the Vector menu, these tools provide more options and allow us to process multiple files at once.

Steps in QGIS

First, we need either distinct point files for each set of features, or a single file with a categorical variable that distinctly identifies different sets of features. For this example I’ll use three distinct files that I’ve generated using phony sample data. The points are in a projected coordinate system (important!) that’s appropriate for the area I’m mapping.

In the QGIS Processing Toolbox, we select the Minimum Bounding Geometry (MBG) tool, and under the Geometry Type specify that we want to create a convex hull. I ran this tool for each file, creating three convex hull files (alternatively, if you had one file with distinct categories, you could use the Field option to generate separate hulls for each category). I’ve symbolized the output below, making the fill hollow and assigning an outline that matches the color of the points. This gives you a good sense for the coverage areas for the points, and how they overlap.

Before running additional tools to explicitly measure overlap, we need to modify the attribute tables of the convex hulls, so we’ll have useful attributes to carry over. The MBG tool creates a new layer with an ID number, area, and perimeter. The ID is set to zero for each hull file, but we should change it to distinctly represent the file / category. With the attribute table open, we can go into an edit mode and type in a new integer value; in this case I’m assigning 1, 2, and 3 to each of the test layers. Alternatively, you could add a new field and assign it a meaningful category value.
The units for area and perimeter match the units used by the map projection of the layer, which is why we want to use a projected coordinate system that uses meters or feet, and not a geographic one (like WGS 84 or NAD 83) that uses degrees. I’m using a state plane system, so the area is in square feet. To convert this to square miles, within the attribute table view I use the Field Calculator to add a new decimal field, and divide the value of the area by 27,878,400 (the number of sq feet in a sq mile; for metric units in meters, we’d divide by 1,000,000 to get sq km). We calculate the area directly from the polygon geometry:

area( @geometry) / 27878400

To generate the area of intersection, we go into the Processing tool box and run the Intersection (multiple) tool. The first convex hull is the input layer, while the overlay layers are the other two hull files (in the dialog box, we check the layers we want, and then use the arrow to navigate back to the tool to run it). The output is a new file with polygon(s) that cover the area where all three layers intersect. Its attribute table contains an ID, area, and perimeter field, and we can calculate a new area field in sq miles and see how it compares to the total areas. In my example, the area where all three territories intersect covers about 112 sq miles, while the areas for the individual territories are 512, 563, and 256 sq miles respectively.

To identify distinct areas of overlap between the territories, we return to the Processing toolbox and run the Union (multiple) tool. The dialog is similar to the intersection tool, where the first hull is the union layer and the additional hulls are overlay layers. The output of this tool is a layer with distinct polygons where the hulls coincide. The attribute table for the union layer carries over the attributes from each of the three layers, with columns suffixed with underscores and sequential integers. So if a polygon consists of area covered by hulls 1 and 2, those attributes will be filled in, while the attributes of 3 will be null. As before, we can calculate an area in sq miles for the new polygons. In this case, we’d see that the area covered by hull 1 without any overlapping hulls is 240 sq miles, the largest of all territories.

To explicitly categorize these areas, we can add a new field in the attribute table. This will be a text field, where we take the ID numbers, convert them to strings, and concatenate them. In the example above, IDs 1 and 2 would be concatenated to 12, and since the value for 3 is null, no text is appended. (Variation – if you created distinct text-based category fields instead of using the integer IDs, you could concatenate them directly without having to convert them to strings). Using the symbology tool, we can classify the data using these new categories, and can modify the color scheme to something appropriate for displaying the contributions from each area. So a polygon with category 1 includes areas covered by the first convex hull and no others, while category 12 includes areas where hulls 1 and 2 overlapped.

concat(
to_string( "id" ),
to_string( "id_2" ),
to_string( "id_3" )
)

Additional Considerations:

With the areas of the individual union pieces, we can compute the percentages of each territory that fall inside and outside various overlapping zones with the field calculator. For example, we can calculate the total area of the union file (which is NOT the sum of each hull, as there’s overlap between them), and then divide each feature by that total to get its percent total. The expression for doing this is below; the numerator has the name of the field that contains the area of each polygon in sq miles, while the denominator includes the calculation for the sum of all parts (alternatively you could use the QGIS Statistics tool to compute this, and hard code the total into the formula):

area_part / (sum(area( geometry(@feature)))/27878400) *100

If the idea is to create areas of territory that the points exert influence on, you may want to add a buffer to each hull, to account for the fact that the outer points that form the boundaries will exert influence on both sides of the boundary. Use the Processing – Buffer tool. For the buffer distance, you can use an arbitrary value that makes sense for the circumstances. Or you can generate a relative value that represents a fraction of each convex hull’s area. The output of the buffer tool would then serve as the input to the intersection and union tools.
These examples focus on area. If the number of points that falls within the areas is important, you can use the Points in Polygon tool on each of the hulls to count points, and then do the same for the output of the intersection and union tools to get the different points counts for each set of polygons.

ArcGIS Pro Corollaries

Following the same steps above for QGIS, but with ArcGIS Pro:

In the red toolbox, the Minimum Boundary Geometry tool is used to create convex halls. It’s quite similar to the one in QGIS: specify the geometry type, and there is an option to Group (if you have one file with categories). If you leave the Add geometry characteristics box unchecked, it will still compute basic area and perimeter; the checkbox adds a bunch of additional fields.
Unlike QGIS, ArcGIS will not allow you to modify its OBJECTID field. To create a unique value for each hull, you will have to open the attribute table and use the Calculate tool to create a category field (integer or text). To ensure that you can carry it over, in ArcGIS you need to give this column a different name in each hull: cat1, cat2, cat3. Set the value at the bottom in the expression box.
You can use the calculate tool in the attribute table to generate an area column in sqft or sqkm, or use the Calculate Geometry Tool in the toolbox instead. The latter is actually simpler: create a new column, and choose Area and the output units.
The Intersect tool will create the intersection, and functions similarly to QGIS.
The Union tool creates the union, and also functions similarly.
Creating the category field in the union file is a bit more complicated, as ArcGIS assigns values of 0 instead of NULL for non-overlapping polygons. In the Calculate window, with the input file as Union
and the field as category, change the Expression type to Arcade (ESRI’s scripting language). First, run an expression to concatenate the categories and convert integers to strings (if necessary). Then, replace that expression with a second one that replace the zeros with nothing.

Concatenate(TEXT($feature.cat1)+
TEXT($feature.cat2)+
TEXT($feature.cat3)
)
Replace($feature.category,'0','')

Conclusion

This is a basic approach, appropriate for certain use cases where you want to generate areas from points; particularly when different point sets have a well defined category, so there’s no question of how to group them. Also appropriate where you don’t have – or don’t want – hard boundaries between sets of points and want to see areas of overlap. More sophisticated methods exist for separating points into clusters based on density, distance, and similar attributes, such as K-Means and DB Scan. You can generate non-overlapping territories for individual points using Thiessen / Voronoi polygons, and for points with a sufficiently high density, you can generate rasters with kernel tools.

Folium Maps with Geopandas and Random Color Schemes

In this post I’ll demonstrate one method for assigning random colors to features on a Folium map, with specific consideration for plotting features from a geopandas geodataframe. I’m picking up where I left off in my previous post, where we plotted routes from a file of origins and destinations using the OpenRouteService. I wrote a regular Python script where I made a simple plot of the points and lines using geopanda’s plot, but in a Jupyter Notebook version I went a step further and used Folium to plot the features on the OpenStreetMap.

Folium is a Python port of Leaflet, a popular javascript library for making basic interactive web maps. I’m not going to cover the basics in this post, so for starters you can view Folium’s documentation: there’s a basic getting started tutorial, and a more detailed user guide. Geopanda’s documentation includes Folium examples that illustrate how you can work with the two packages in tandem.

One important concept to grasp, is that many of the basic examples assume you are working with latitude and longitude coordinates that are either hard-coded as variables, stored in lists, or stored in dedicated columns in a dataframe as attributes. But if you are working with actual geometry stored in a geodataframe, to display features you need to use the folium.GeoJson function instead. For example, this tutorial illustrates how to render points stored in a dedicated geometry column, and not as coordinates stored as numeric attributes in separate columns.

In my example, I have the following geodataframe, where each record is a route with linear geometry, with a sequential integer as the index value:

All I wanted to do was assign random colors to each line to depict them differently on the map – and I could not find a straightforward way to do that! In odd contrast, making a choropleth map (i.e. values shaded by quantity) seems easy. I found a couple posts on stack exchange, here and here, that demonstrated how to create and assign colors by randomly creating hexadecimal color strings. Another post illustrated how to assign specific non-random colors, and this thorough blog post walked through assigning colors to categorical data (where each category gets a distinct color).

I pooled aspects of the last two solutions together to come up with the following:

# Get colors for lines
gdfcount=len(gdf) # number of routes
colors=['red','green','blue','gray','purple','brown']
clist=[] # list of colors, one per route
c=0
for i in range(gdfcount):
    clist.append(colors[c])
    c=c+1
    if c>len(colors)-1:
        c=0 # if we run out of colors, start over
color_series = pd.Series(clist,name='color') # create series in order to...
gdf_c=pd.merge(gdf, color_series, left_index=True,right_index=True) # join to routes on seq index #
gdf_c

I get the number of features in my geodataframe, and create a list with a set number of colors. I iterate through my colors list, generating a new list with one color for each record in my frame. If I cycle through all the colors, I reset a counter and cycle through again from the beginning. When finished, I turn the list into a pandas series, which generates a sequential integer index paired with the color. Then I merge the series with my geodataframe (called gdf) using the sequential index, so the color is now part of each record:

Then I can create the map. When creating a Folium map you need to provide a location on which the map will be centered. I used Geopanda’s total_bounds method to get the bounding box of my features, which returns a list of coordinates: min X, min Y, max X, max Y. I sum the appropriate coordinates for X and Y and divide each by two, which gives me the coordinate pair for the center of the bounding box.

bnds=gdf.total_bounds 
clong=(bnds[0]+bnds[2])/2
clat=(bnds[1]+bnds[3])/2

Next, I generate the map using those coordinates (the name of the map object in the example is “m”). I create a popup menu first, specifying which fields from the geodataframe to display when you click on a line. Then I add the actual geodataframe to the map, applying a style_function parameter to apply the colors and popup to each feature. The lambda business is necessary here, apparently when you’re applying styles that differ for each feature you have to iterate over the features. I guess this is the norm when you’re styling a GeoJSON file (but is at odds with how you would normally operate in a dataframe or GIS environment).

m = folium.Map(location=[clat,clong], tiles="OpenStreetMap")
popup = folium.GeoJsonPopup(
    fields=["ogn_name", "dest_name","distance","travtime"],
    localize=True,
    labels=True)
folium.GeoJson(gdf_c,style_function=lambda x: {'color':x['properties']['color']},popup=popup).add_to(m)

Once the lines are added to the map, we can continue to add additional features. I use folium.GeoJson twice more to plot my origin and destination points. It took me quite a bit of searching around to get the syntax right, so I could change the icon and color. This tutorial helped me identify the possibilities, but again the basic iteration examples don’t work if you’re using GeoJSON or geometry in a geodataframe. If you’re assigning several different colors you have to apply lambda, as in this example in the Folium docs. In my case I just wanted to change the default icon and color and apply them uniformly to all symbols, so I didn’t need to iterate. Once you’ve added everything to the map, you can finally display it!

folium.GeoJson(ogdf,marker=folium.Marker(icon=folium.Icon(icon='home',color='black'))).add_to(m)
folium.GeoJson(dgdf,marker=folium.Marker(icon=folium.Icon(icon='star',color='lightgray'))).add_to(m)
m.fit_bounds(m.get_bounds()) # zoom to bounding box
m

A static screenshot of the result follows, as I can’t embed Folium maps into my WordPress site. Everything works perfectly when using the notebook locally, but GitHub won’t display the map either, even if the Notebook is saved as a trusted source. If you open the notebook in the nbviewer, it renders just fine (scroll to the bottom of the notebook to see the map and interact).

Undoubtedly there are other options for achieving this same result. I saw examples where people had embedded all the styles they needed within a dedicated dataframe column as code and applied them, or wrote a function for applying the styles. I have rarely worked with javascript or coded my own web maps, so I expect there may be aspects of doings things in that world that were ported over to Folium that aren’t intuitive to me. But not being able to readily apply a random color scheme is bizarre, and is a big shortcoming of the module.

I’d certainly benefit from a more formal introduction to Folium / Leaflet, as scattered blog posts and stack exchange solutions can only take you so far. But here’s hoping that this post has added some useful scraps to your knowledge!

Plotting Routes with OpenRouteService and Python

I made my first foray into network routing recently, and drafted a python script and notebook that plots routes using the OpenRouteService (ORS) API. ORS is based on underlying data from the OpenStreetMap (OSM), and was created by the Heidelberg Institute for Geoinformation Technology, at Heidelberg University in Germany. They publish several routing APIs that include directions, isochrones, distance matricies, geocoding, and route optimization. You can access them via a basic REST API, but they also have a dedicated Python wrapper and an R package which makes things a bit easier. For non-programmers, there is a plugin for QGIS.

Regardless of which tool you use, you need to register for an API key first. The standard plan is free for small projects; for example you can make 2,000 direction requests per day with a limit of 40 per minute. If you’re affiliated with higher ed, government, or a non-profit and are doing non-commercial research, you can upgrade to a collaborative plan that ups the limits. It’s also possible to install OSR locally on your own server for large jobs.

I opted for Python and used the openrouteservice Python module, in conjunction with other geospatial modules including geopandas and shapely. In my script / notebook I read in two CSV files, one with origins and the other with destinations. At minimum both files must contain a header row, and attributes for unique identifier, place label, longitude, and latitude in the WGS 84 spatial reference system. The script plots a route between each origin and every destination, and outputs three shapefiles that include the origin points, destination points, and routes. Each line in the route file includes the ID and names of each origin and destination, as well as distance and travel time. The script and notebook are identical, except that the script plots the end result (points and lines) using geopanda’s plot function, while the Jupyter Notebook plots the results on a Folium map (Folium is a Python implementation of the popular Leaflet JS langauge).

Visit the GitHub repo to access the scripts; a basic explanation with code snippets follows.

After importing the modules, you define several variables that determine the output, including a general label for naming the output file (routename), and several parameters for the API including the mode of travel (driving, walking, cycling, etc), distance units (meters, kilometers, miles), and route preference (fastest or shortest). Next, you provide the positions or “column” locations of attributes in the origin and destination CSV files for the id, name, longitude, and latitude. Lastly, you specify the location of those input files and the file that contains your API key. The location and names of output files are generated automatically based on the input; all will contain today’s date stamp, and the route file name includes route mode and preference. I always use the os module’s path function to ensure the scripts are cross-platform.

import openrouteservice, os, csv, pandas as pd, geopandas as gpd
from shapely.geometry import shape
from openrouteservice.directions import directions
from openrouteservice import convert
from datetime import date
from time import sleep

# VARIABLES
# general description, used in output file
routename='scili_to_libs'
# transit modes: [“driving-car”, “driving-hgv”, “foot-walking”, “foot-hiking”, “cycling-regular”, “cycling-road”,”cycling-mountain”, “cycling-electric”,]
tmode='driving-car'
# distance units: [“m”, “km”, “mi”]
dunits='mi'
# route preference: [“fastest, “shortest”, “recommended”]
rpref='fastest'

# Column positions in csv files that contain: unique ID, name, longitude, latitude
# Origin file
ogn_id=0
ogn_name=1
ogn_long=2
ogn_lat=3
# Destination file
d_id=0
d_name=1
d_long=2
d_lat=3

# INPUTS and OUTPUTS
today=str(date.today()).replace('-','_')

keyfile='ors_key.txt'
origin_file=os.path.join('input','origins.csv') #CSV must have header row
dest_file=os.path.join('input','destinations.csv') #CSV must have header row
route_file=routename+'_'+tmode+'_'+rpref+'_'+today+'.shp'
out_file=os.path.join('output',route_file)
out_origin=os.path.join('output',os.path.basename(origin_file).split('.')[0]+'_'+today+'.shp')
out_dest=os.path.join('output',os.path.basename(dest_file).split('.')[0]+'_'+today+'.shp')

I define some general functions for reading the origin and destination files into nested lists, and for taking those lists and generating shapefiles out of them (by converting them to geopanda’s geodataframes). We read the origin and destination data in, grab the API key, set up a list to hold the routes, and create a header for the eventual output.

# For reading origin and dest files
def file_reader(infile,outlist):
    with open(infile,'r') as f:
        reader = csv.reader(f)    
        for row in reader:
            rec = [i.strip() for i in row]
            outlist.append(rec)
            
# For converting origins and destinations to geodataframes            
def coords_to_gdf(data_list,long,lat,export):
    """Provide: list of places that includes a header row,
    positions in list that have longitude and latitude, and
    path for output file.
    """
    df = pd.DataFrame(data_list[1:], columns=data_list[0])
    longcol=data_list[0][long]
    latcol=data_list[0][lat]
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df[longcol], df[latcol]), crs='EPSG:4326')
    gdf.to_file(export,index=True)
    print('Wrote shapefile',export,'\n')
    return(gdf)
      
origins=[]
dest=[]
file_reader(origin_file,origins)
file_reader(dest_file,dest)

# Read api key in from file
with open(keyfile) as key:
    api_key=key.read().strip()

route_count=0
route_list=[]
# Column header for route output file:
header=['ogn_id','ogn_name','dest_id','dest_name','distance','travtime','route']

Here are some nested lists from my sample origin and destination CSV files:

[['origin_id', 'name', 'long', 'lat'], ['0', 'SciLi', '-71.4', '41.8269']]

[['dest_id', 'name', 'long', 'lat'],
 ['1', 'Rock', '-71.405089', '41.825725'],
 ['2', 'Hay', '-71.404947', '41.826433'],
 ['3', 'Orwig', '-71.396609', '41.824581'],
 ['4', 'Champlin', '-71.408194', '41.818912']]

Then the API call begins. For every record in the origin list, we iterate through each record in the destination list (in both cases starting at index 1, skipping the header row) and calculate a route. We create a tuple with each pair of origin and destination coordinates (coords), which we supply to the OSR directions API. We pass in the parameters supplied earlier, and specify instructions as False (instructions are the actual turn by turn directions returned as text).

The result is returned as a JSON object, which we can manipulate like a nested Python dictionary. At the first level in the dictionary, we have three keys and values: a bounding box for the route area with a list value, metadata with a dictionary value, and routes with a list value. Dive into route, and the list contains a single dictionary, and inside that dictionary – more dictionaries that contain the values we want!

1st level, dictionary with three keys, the routes key has a single list value

2nd level, the routes list has a single element, another dictionary

3rd level, inside the dictionary in that list element, four keys with route data

The next step is we extract the values that we need from this container by specifying their location. For example, the distance value is inside the first list of routes, inside summary and inside distance. Travel time is in a similar spot, and we take an extra step of dividing by 60 to get minutes instead of seconds. The geometry is trickier as its encoded in some binary line format. We use OSR’s decoding function to turn it into plain text, and shapely to convert it into WKT text; we’ll need WKT in order to get the geometry into a geodataframe, and eventually output as a shapefile. Once we have the bits we need, we string them together as a list for that origin / destination pair, and append this to our route list.

# API CALL
for ogn in origins[1:]:
    for d in dest[1:]:
        try:
            coords=((ogn[ogn_long],ogn[ogn_lat]),(d[d_long],d[d_lat]))
            client = openrouteservice.Client(key=api_key) 
            # Take the returned object, save into nested dicts:
            results = directions(client, coords, 
                                profile=tmode,instructions=False, preference=rpref,units=dunits)
            dist = results['routes'][0]['summary']['distance']
            travtime=results['routes'][0]['summary']['duration']/60 # Get minutes
            encoded_geom = results['routes'][0]['geometry']
            decoded_geom = convert.decode_polyline(encoded_geom) #convert from binary to txt
            wkt_geom=shape(decoded_geom).wkt #convert from json polyline to wkt
            route=[ogn[ogn_id],ogn[ogn_name],d[d_id],d[d_name],dist,travtime,wkt_geom]
            route_list.append(route)
            route_count=route_count+1
            if route_count%40==0: # API limit is 40 requests per minute
                print('Pausing 1 minute, processed',route_count,'records...')
                sleep(60)
        except Exception as e:
            print(str(e))
            
api_key=''
print('Plotted',route_count,'routes...' )

Here is some sample output for the final origin / destination pair, which contains the IDs and labels for the origin and destination, distance in miles, time in minutes, and a string of coordinates that represents the route:

['0', 'SciLi', '4', 'Champlin', 1.229, 3.8699999999999997,
 'LINESTRING (-71.39989 41.82704, -71.39993 41.82724, -71.39959 41.82727, -71.39961 41.82737, -71.39932 41.8274, -71.39926 41.82704, -71.39924000000001 41.82692, -71.39906000000001 41.82564, -71.39901999999999 41.82534, -71.39896 41.82489, -71.39885 41.82403, -71.39870000000001 41.82308, -71.39863 41.82269, -71.39861999999999 41.82265, -71.39858 41.82248, -71.39855 41.82216, -71.39851 41.8218, -71.39843 41.82114, -71.39838 41.82056, -71.39832 41.82, -71.39825999999999 41.8195, -71.39906000000001 41.81945, -71.39941 41.81939, -71.39964999999999 41.81932, -71.39969000000001 41.81931, -71.39978000000001 41.81931, -71.40055 41.81915, -71.40098999999999 41.81903, -71.40115 41.81899, -71.40186 41.81876, -71.40212 41.81866, -71.40243 41.81852, -71.40266 41.81844, -71.40276 41.81838, -71.40452000000001 41.81765, -71.405 41.81749, -71.40551000000001 41.81726, -71.40639 41.81694, -71.40647 41.81688, -71.40664 41.81712, -71.40705 41.81769, -71.40725 41.81796, -71.40748000000001 41.81827, -71.40792 41.81891, -71.40794 41.81895)']

Finally, we can write the output. We convert the nested route list to a pandas dataframe and use the header row for column names, and convert that dataframe to a geodataframe, building the geometry from the WKT string, and write that out. In contrast, the origins and destinations have simple coordinates (not in WKT), and we create XY geometry from those coordinates. Writing the geodataframe out to a shapefile is straightforward, but for debugging purposes it’s helpful to see the result without having to launch GIS. We can use geopandas’s plot function to draw the resulting geometry. I’m using the Spyder IDE, which displays plots in a dedicated window (in my example the coordinate labels for the X axis look strange, as the distances I’m plotting are small).

# Create shapefiles for routes
df = pd.DataFrame(route_list, columns=header)
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkt(df["route"]),crs = 'EPSG:4326')
gdf.drop(['route'],axis=1,inplace=True) # drop the wkt text
gdf.to_file(out_file,index=True)
print('Wrote route shapefile to:',out_file,'\n')

# Create shapefiles for origins and destinations
ogdf=coords_to_gdf(origins,ogn_long,ogn_lat,out_origin)
dgdf=coords_to_gdf(dest,d_long,d_lat,out_dest)

# Plot
base=gdf.plot(column="dest_id", kind='geo',cmap="Set1")
ogdf.plot(ax=base, marker='o',color='black')
dgdf.plot(ax=base, marker='x', color='red');

In a notebook environment we can employ something like Folium more readily, which gives us a basemap and some basic interactivity for zooming around and clicking on features to see attributes. Implementing this was a more complex than I thought it would be, and took me longer to figure out compared to the routing process. I’ll return to those details in a subsequent post…

In my sample data (output rendered below in QGIS) I was plotting fastest driving distance from the Brown University Sciences Library to the other libraries in our system. Compared to Google or Apple Maps the result made sense, although the origin coordinates I used for the SciLi had an impact on the outcome (assumed you left from the loading dock behind the building as opposed to the front door as Google did, which produces different routes in this area of one-way streets). My real application was plotting distances of hundreds of miles across South America, which went well and was useful for generating different outcomes using fastest or shortest route.

Take a look at the full script in GitHub, or if programming is not your thing check out the QGIS plugin instead (activate in the Plugins menu, search for OSR). Remember to get your API key first.

Cost Surfaces and Least Cost Paths in QGIS and GRASS

In this post I’ll demonstrate how to create least cost paths using QGIS and GRASS GIS, and in doing so will describe how a cost surface is constructed. In a surface analysis, you model movement across a grid whose values represent friction encountered in moving across it. In computing a least cost path, you’re seeking an optimal route from an origin to the closest destination, where ‘close’ incorporates distance and ease of movement across that surface. These kinds of analyses are often conducted in the environmental sciences, in modeling the movement of water across terrain, and in zoology in predicting migration paths for land-based animals.

In this example the idea was to chart the origin of settlements and possible trade routes in ancient history. In applications where we’re studying human activity, network analysis is typically used instead. Networks use geometry, where a node is a place or person, and connections between nodes are indicated with lines. Lines typically have a value associated with them that identify either the strength of a connection, or conversely friction associated with moving between nodes. The idea for this project was to identify how networks formed, so the surface analysis served as a proto-network analysis. While there were roads and maritime routes in pre-modern times, these networks were weaker and less dense. Charting movement over a surface representing terrain could provide a decent approximation of routes (but if you’re interested in ancient Roman network routing, check out the ORBIS project at Stanford).

This example stems from a project I was helping a PhD student with; I don’t want to replicate his specific study, so I’ve modified the data sources and area of focus to model movement between large settlements and stone quarries in the ancient Roman world. My goal is to demonstrate the methods with a plausible example; if we were doing this as part of an actual study, we would need to be more discriminating in selecting and processing our data.

Preliminary Work

The Pleiades project will serve as our source for destinations; it’s an academic gazetteer that includes locations and place names for the ancient and early medieval world, stretching from Europe and North Africa through the Middle East to India. It’s published in many forms, and I’ve downloaded the Pleiades Data for GIS in a CSV format. Using QGIS, I used the Add Delimited Text tool to plot the places.csv to get all of the locations, and joined that file to the places_place_type.csv file which contains different categories of places. I used Select by Attributes to get locations classified as quarries, and exported the selection out to a geopackage.

The Pleiades data includes a category for settlements, but there are about ten thousand of these and there isn’t an easy way to create a subset of the largest places. So I opted to use Hanson’s dataset of the largest settlements in the ancient Roman world to serve as our source for origins (about 1,400 places). This data was packaged in an Excel file; I plotted the coordinates using the Create Points Layer from Table tool in QGIS and converted the result to a geopackage. For testing purposes, I selected a subset of ten major cities and saved them in a separate layer: Athenae, Alexandria (Aegyptus), Antiochia (Syria), Byzantium, Carthago, Ephesus, Lugdunum, Ostia, Pergamum, Roma.

For the friction grid, I downloaded a geoTIFF of the Human Mobility Index by Ozak. The description from the project:

“The Human Mobility Index (HMI) estimates the potential minimum travel time across the globe (measured in hours) accounting for human biological constraints, as well as geographical and technological factors that determined travel time before the widespread use of steam power.”

There are three separate grids that vary in extent based on the availability of seafaring technology. I chose the grid that incorporates seafaring prior to the advent of ocean-going ships, which is appropriate for the Mediterranean world during the classical era. The HMI is a global grid at 925 meter resolution. To minimize processing time, I clipped it to a bounding box that encompasses the area of study. The grid is in the World Cylindrical Equal Area system; I reprojected it to WGS 84 to match the rest of the layers. As long as we’re not measuring actual distances, we don’t need to worry about the system we’re using (but if we were, we’d use an equidistant system). Since the range of values is small and it’s hard to see differences in cell values, I symbolized the grid as single-band psuedo color and used a quantiles classification scheme with 12 categories.

Lastly, I grabbed some modern country boundaries from Natural Earth to serve as a general frame of reference. A screenshot of the workspace is below:

Least Cost Path in QGIS

QGIS has a third-party plug-in for doing a least cost path analysis, which works fine as long as you don’t have too many origin points. Go to Plugins > Manage and Install Plugins > Least Cost Path to turn it on. Then open the Processing toolbox and it will be listed at the bottom. See the screenshot below for the tool’s menu. The Cost raster layer is the friction surface, so the human mobility index in this example. The start points are the ten major cities and the end points are the quarries. The start-point layer dialog only accepts a single point; if you have multiple points, hit the green circular arrow button to iterate across all of them. There’s a checkbox for connecting the start point to just the nearest end point (as opposed to all of them). Save the output to a geopackage.

It took about five minutes to run the analysis and iterate across all ten points. Each path is saved in a separate file, but since they have an identical structure I subsequently used Vector > Data Management Tools > Merge Vector Layers to combine them into one file. The attribute table records the end point ID (for the quarry) and the accumulated cost, but does not include the origin ID; this ID is the number 1 repeated each time, as the tool was iterating over the origin points. We can see the result below; for Athens and Ephesus in the south, land routes were shortest, whereas for Pergamum and Byzantium in the north it was easier (distance and friction-wise) to move across the sea.

While this worked fine for ten cities, it would take a considerable amount of time to compute paths for all 1,400. The problem here is that the plugin was designed for one point at a time. Let’s outline the process so we can understand how alternatives would work.

Cost Surface Analysis

To calculate a least cost path, the first step is to create a cost surface, where we take our friction grid and the destinations and calculate the total cost of movement across all cells to the nearest destination. First, the destinations are placed on the grid, and they become the grid sources. Then, the accumulated cost of moving from each source to its adjacent cells is calculated. For horizontal and vertical movement, it’s the sum of the friction values divided by two, and for diagonal movement it’s the sum of the friction values divided by two then multiplied by 1.4142. Once those calculations are performed, those adjacent cells are assigned to each source. Next, the lowest accumulated cost cell in the grid is identified, the cost for moving to its unassigned neighbors is calculated, and these cells are assigned to the same source. This process is repeated by cycling through the lowest accumulated value until all calculations for the grid are finished. Illustrated in the example below, which I derived from Lloyd’s Spatial Data Analysis (2010) pp. 165-168.

For each cell, three items are recorded, and are saved either as separate bands in one raster, or in three separate raster files:

Accumulated cost of moving to that cell from the nearest source
Assignment or allocation of the cell to its source (the nearest one to which it “belongs”)
A vector that indicates direction from that source

With these cost surfaces, we can take the second step of calculating the least cost path. We place a number of starting points onto this surface, and each point is assigned to the closest destination based on where its grid cell was allocated. The direction to that destination is traced backward using the direction grid, and the total cost of movement is taken from the accumulated cost surface.

Knowing how this process works, there are two practical conclusions we can draw. First, when computing the cost surface, you use your destinations (not the origins) as the source for the cost surface. You use the origins as the start points for the least cost path. Second, there’s no need to recalculate the cost surface for every origin point; you only need to do this once. That’s why the QGIS plugin took so long; it was recomputing the cost surface each time. Knowing this, we can use GRASS GIS to compute the paths, as it’s designed to compute the surface just once (and it’s data structure will also boost performance a bit).

Cost Surface Analysis in GRASS

GRASS GIS comes bundled with QGIS. While it’s possible to run a number of GRASS tools directly within QGIS, it’s a bit undesirable as you’re not able to access the full range of parameters or options for each GRASS command. I opted to create the GRASS environment in QGIS, and loaded all the necessary data into the GRASS format. Then, I flipped over to the GRASS GUI to do the analysis.

GRASS uses a distinct database structure and file format, and we need to create a GRASS workspace and load our data into that database in order to use the cost surface tools. I followed the steps in the QGIS manual for creating a GRASS environment and loading data into a GRASS database. Once you create the database and mapset, you use the QGIS Browser to browse to the grassdata folder and designate your new mapset as your working mapset (mapsets have the little green grass icon beside them). With the GRASS tools open, I used v.in.ogr.qgis to load my my cities and quarries layers into this mapset, and r.in.gdal.qgis to load the mobility index (if these layers weren’t already in your QGIS project, you’d use the tools that don’t have the qgis suffix, i.e. v.in.ogr).

After exiting QGIS and launching GRASS, you select the mapset under the grassdata database at the top, right click and choose Switch mapset, and choose the mapset you want to work with (if you don’t see it, hit the database icon to browse and connect to the grassdata folder). You can display the layers in the GRASS window to visualize them, but it’s not necessary for running the tools. In the tool menu on the right, search for the Cost surface tool, r.cost, and choose the following options:

Required: input raster map with grid cell cost is the human mobility index, and output raster is cost_surface
Optional outputs: output raster with nearest start point as allocation_surface, output raster with movement as direction_surface
Start: vector starting points for the cost surface are the destinations, the quarries
Optional: check verbose output (to get more details on errors)

Running this operation on all 1,400 cities took a matter of seconds, and all three rasters described in the previous section were generated: cost, allocation, direction (shown below).

Using these outputs, we can run the Least cost route or flow tool, which is called r.drain (as it’s often used in earth sciences to chart the path that water will drain based on elevation).

Required: Name of input elevation or cost surface raster is cost_surface, Name of output raster path: is path_raster
Cost surface: check the Input raster map is a cost surface box, Name of input movement direction raster is direction_surface
Start: Name of starting points vector map: are the origins (cities)
Path settings: choose ONE option that you’d like to record (or none)
Optional: check Verbose mode, Name for output drain vector is path_vector

This also took mere seconds to complete (!) and generated the paths from each origin (city) to the closet destination (quarry) over the surface as both raster cells and vector lines. The output in GRASS is shown below.

At this stage, we can hop back into QGIS, and load these output paths into our original project to symbolize and study what’s going on. Notice the settlements in northeastern Italy and along the Dalmatian coast; for many of them the least cost path is to a quarry across the sea rather than through rugged mountainous terrain. Even though some quarries in the mountains may be closer in actual distance, it’s a tougher path to travel.

Conclusion

The benefit of using GRASS is that we can run these processes fairly quickly for large datasets. The GRASS commands can also be compiled into a batch script, so you can create a documented and automated process instead of having to drill through multiple menus.

A big downside of the GRASS tools for this analysis is that the resulting vector paths contain no information about the origin or destination points, and only the raster path output carries along values. You might be able to generate this information through some extra steps; using the QGIS field calculator, you can get the coordinates for the start point and end point of each path and add them explicitly to the attribute table. Then take those coordinates, and for the start point of the line select the closest city and get its attributes, and for the end point select the closet quarry and get its attributes. I say “closest” because the vector paths don’t snap perfectly to the start and end points. Modifying the resolution of the human mobility index to make it coarser (fewer cells) might help to resolve this, or converting the origin and destination points to a raster of the same resolution as the index. Alternatively, if you incorporate the GRASS commands into a Python script, you could iterate over the origins in the least cost path analysis and record the origin IDs as you step through.

I haven’t worked all the pieces, but hopefully this will be useful for those of you who are interested in conducting a basic cost surface analysis in open source. The student I was helping was interested in measuring the density of the paths across a grid, so this process worked for him as he didn’t need to associate the paths with origins and destinations. Beyond FOSS GIS, ArcGIS Pro has a full suite of tools for cost surface analysis, and the underlying methods and logic are the same.

Spatial SQL with Spatialite and QGIS

I’ve recently given a few presentations on the Ocean State Spatial Database, which is a basic geodatabase for Rhode Island that we’ve created in our lab. The database was designed so that new and experienced users alike could easily access a curated collection of foundational layers and data tables for thematic mapping and geospatial analysis. The database is available for download on GitHub, and there is documentation that describes the layers and tables that are included. The database comes in two formats: SQLite/ Spatialite that’s great for QGIS, and a File Geoadatabase version for ArcGIS Pro users.

One of the big advantages of using the Spatialite database in QGIS is that you can take advantage of the Database Manager, and write SQL and spatial SQL queries for selecting records and doing spatial analysis. Instead of using a series of point and click tools that create a bunch of new files, you can write a single block of code to perform an entire operation, and you can save that code to document your work. Access the Database Manager above the toolbars at the top of the QGIS interface. Once you’re in, you can select the Spatialite option, right click and then browse your file system to point to the database to establish a connection. At the top of the DB Manager is a button (piece of paper with wrench) to open a SQL query window.

Database Manager in QGIS with SQL Window Open

The following commands are basic SQL: SELECT some columns FROM some tables WHERE some criteria is met. This returns all rows and columns from the public libraries layer in the database:

SELECT *
FROM d_public_libraries;

This returns just some of the columns for all rows:

SELECT libid, libname, city, cnty
FROM d_public_libraries;

While this returns some of the columns and rows that meet specific criteria, in this case where libraries are located in Providence County, RI:

SELECT libid, libname, city, cnty, geom
FROM d_public_libraries
WHERE cnty='PROVIDENCE'
ORDER BY city;

Traditional database column types include strings (aka text), integers, and decimal numbers, which limit the values that can be stored in the column, and allow specific functions that can operate on values of that type (math on numeric columns, string operations on text columns). Beyond the basic data types, many databases have special ones, such as date types that allow you to store and manipulate dates and times as distinct objects.

Spatial databases incorporate special columns for storing the geometry of features as strings of coordinates, and provide functions that can operate on that geometry. In the example above, the values stored in the geometry column were returned in a binary format. But we can apply a spatial function called ST_AsText to display the geometry as readable text:

SELECT libid, libname, city, cnty, ST_AsText(geom) AS geom
FROM d_public_libraries
WHERE cnty='PROVIDENCE'
ORDER BY city;

We can see that this is point geometry (as opposed to lines or polygons), and we have an X and Y coordinate for each point. The layers in this database are in the Rhode Island State Plane System, so the coordinates that are returned are in that system. We can convert these to longitude and latitude using the ST_Transform function:

SELECT libid, libname, city, cnty, ST_AsText(ST_Transform(geom,4269)) AS geom
FROM d_public_libraries
WHERE cnty='PROVIDENCE'
ORDER BY city;

This illustrates that the functions can be nested, first we transform the geometry and then display the result of that function as text. The number in the transform function is the unique identifier of the spatial reference system that we wish to transform the geometry to. In the open source world these are EPSG codes, and 4269 is the identifier for NAD 83, the basic long / lat system for North America (alternatively, we could use 4326 for WGS 84, the standard global long / lat system). The geometry column in a spatial table is connected to a series of internal tables that store all the definitions of the spatial reference systems. You can view the spatial reference system table:

SELECT * from spatial_ref_sys;

You can also get a read out of all the spatial tables in the database which include their type of geometry and the spatial reference system (3438 is the EPSG code for the RI State Plane zone, geometry of type 6 is a multipolygon, while type 1 is a point):

SELECT * from geometry_columns;

With a spatial database, you perform operations within and between tables by running functions against the geometry columns. For example, to return all public libraries and schools that are within a mile of a library while measuring the distance:

SELECT pl.libid, pl.libname, s.name, s.grade_span, ST_Distance(pl.geom, s.geom) AS dist
FROM d_public_libraries pl, d_schools_pk12 s
WHERE PtDistWithin(pl.geom, s.geom, 5280)
ORDER BY dist;

The ST_Distance function returns the actual distance in a new column, while the PtDistWithin function only returns libraries that have a school within one mile (5,280 feet – we have to express the measurement in the units used by the spatial reference system of both layers). In the FROM statement we provide aliases after each table name, so we can use those as shorthand (if our statement includes multiple tables, we need to indicate which table each column comes from).

You can also do summaries, like you would in standard SQL using GROUP BY. To count the number of schools that are within a mile of every library:

SELECT pl.libid, pl.libname, CAST(COUNT (s.name) AS integer) AS school_count, pl.geom
FROM d_public_libraries pl, d_schools_pk12 s
WHERE PtDistWithin(pl.geom, s.geom, 5280)
GROUP BY pl.libid, pl.libname, pl.geom
ORDER BY school_count DESC;

The rule for GROUP BY is that every column in the select statement must be used as a grouping variable, or has an aggregate function applied to it (COUNT, SUM, MEAN, etc). In this example we added the CAST function, which defines the data type for new columns that you create. Unless we explicitly declare it as an integer or real (decimal), values are returned as strings.

You can save your statements as views, by adding CREATE VIEW [view name] AS followed by the statement. Views are saved statements that appear as objects in the database; by opening a view, the statement is rerun and the result is returned. This approach works if you want to save a non-spatial view, i.e. a table without geometry. To save a spatial one with geometry, omit the VIEW statement and hit the Create a view button below the SQL window (each record must have a unique identifier and the geometry column in order for this to work). That registers the geometry column of the view in the database. Then, you can return to the main QGIS window, add the view and symbolize it. Alternatively, there is a Load as new layer button at the bottom of the screen, which allows you to see a temporary result without saving anything (while you can see features and records returned, you won’t be able to symbolize or manipulate the layer).

Count schools within 1 mile of libraries, and save as a spatial view

Symbolize the spatial query out in the main QGIS window

One of the primary reasons to use a database is to join related data stored in separate tables. This statement has two joins: a tabular join between the census tracts and an ACS data table, and a spatial join between the geometry of public libraries and tracts:

SELECT pl.libid, pl.libname, a.geoidshort, a.name, c.hshd01_e, c.hshd01_m
FROM d_public_libraries pl, a_census_tracts a
INNER JOIN c_tracts_acs2021_socecon c
ON a.geoidlong=c.geoidlong
WHERE ST_Intersects(pl.geom, a.geom);

This returns all public libraries and their intersecting tracts based on the relationship between their two geometries (could also have done ST_Within in this case to get the same result). Spatialite supports most of the spatial relationship functions defined by the OGC. The estimated number of households for these tracts are returned based on the shared unique census identifier between the two census tract tables.

You can visit the following references for a full list of SQLite functions and Spatialite functions. As it’s designed to be “Lite”, SQLite contains a smaller subset of the SQL standard. Spatialite contains a pretty full range of OGC spatial SQL functions, but there are instances where it deviates from the standard. PostgreSQL / PostGIS provides a greater range of functions that adhere more closely to the standard; it also provides you with greater storage, efficiency, and processing power. As a file-based database, SQLite / Spatialite’s strengths are that it’s compact and transportable, and gives you the option to write SQL rather than relying solely on the point and click tools of a desktop GIS package.

In addition to the QGIS DB Manager, you could also use the Spatialite command line tools provided by the developer, and the Spatialite GUI (graphic user interface) that gives you a standard, stand-alone database interface. Downloading it is a bit confusing; Windows users can grab one of the binaries at the bottom of this page. If you’re a Linux person, search for it in your package manager. Mac users can get it via Homebrew.

Comparing ACS Estimates Over Time: Are They Really Different?

I often get questions about comparing American Community Survey (ACS) estimates from the US Census Bureau over time. This process is more complicated than you’d think, as the ACS wasn’t designed as a time series dataset. The Census Bureau does publish comparative profile tables that compare two period estimates (in data.census.gov), but for a limited number of geographies (states, counties, metro areas).

For me, this question often takes the form of comparing change at the census tract-level for mapping and GIS projects. In this post, we’ll look at the primary considerations for comparing estimates over time, and I will walk through an example with spreadsheet formulas for calculating: change and percent change (estimates and margins of error), coefficients of variation, and tests for statistical difference. We’ll conclude with examples of mapping this data.

Primary considerations

The ACS is published in 1-year and 5-year period estimates. 1-year estimates are only available for areas that have at least 65,000 people, which means if you’re looking at small geographies (census tracts, ZCTAs) or rural areas that have small populations (most counties, county subdivisions, places) you will need to use the 5-year series. When comparing 5-year estimates, you should only compare non-overlapping time periods. For example, you would not compare the 2021 ACS (2017-2021) with the 2020 ACS (2016-2020) as these estimates have four years of sample data in common. In contrast, 2021 and 2016 (2012-2016) could be compared as they do not overlap…
…but, census geography changes over time. All statistical areas (block groups, tracts, ZCTAs, PUMAs, census designated-places, etc.) are updated every ten years with each decennial census. Areas can be re-numbered, aggregated, subdivided, or modified as populations change. This complicates comparisons; 2021 data uses geography created in 2020, while 2016 data uses geography from 2010. The only non-overlapping ACS periods with identical geographic areas would be 2014 (2010-2014) and 2019 (2015-2019). The only other alternative would be to use normalized census data, which involves additional work. While most legal areas (states, counties) can change at any time, they are generally more stable and you can make comparisons over a longer-period with modest adjustments.
All ACS estimates are fuzzy, representing a midpoint within a possible range of values (indicated with a margin of error) at a 90% confidence level. Because of sampling variability, any difference that you see between one time period and the next could be noise and not actual change. If you’re working with small geographies or small population groups, you’ll encounter large margins of error and it will be difficult to measure actual change. In addition, it’s often difficult to detect change in any area that isn’t experiencing either substantive growth or decline.

ACS Formulas

Let’s look at an example where we’ll use formulas to: calculate change over time, measure the reliability of a difference estimate, and determine whether two estimates are significantly different. I downloaded table B25064 Median Gross Rent (dollars) from the 5-year 2014 (2010-2014) and 2019 (2015-2019) ACS for all census tracts in Providence County, RI, and stitched them together into one spreadsheet. In this post I’ve replaced the cell references with an abbreviated label that indicates what should be referenced (i.e. Est1_MOE is the margin of error for the first estimate). You can download a copy of the spreadsheet with these examples.

To calculate the change / difference for an estimate, subtract one from the other.
To calculate the margin of error for this difference, take the square root of the sum of the squares for each estimate’s margin of error (MOE):

=ROUND(SQRT((Est1_MOE^2)+(Est2_MOE^2)),0)

Spreadsheet with ACS formula to compute margin of error for change / difference

To calculate percent change, divide the difference by the earliest estimate (Est1), and multiply by 100.
To calculate the margin of error for the percent change, use the ACS formula for computing a ratio:

=ROUND(((SQRT(Est2_MOE^2+((Est2/Est1)^2*Est1_MOE^2)))/Est1)100,1)

Divide the 2nd estimate by the 1st and square it, multiply that by the square of the 1st estimate’s MOE, add that to the square of the 2nd estimate’s MOE. Take the square root of that result, then divide by the 1st estimate and multiply by 100. Note that this is formula for percent change is different from the one used for calculating a percent total (the latter uses the formula for a proportion; switch the plus symbol under the square root to a minus for percent totals).

Spreadsheet with ACS formula to compute margin of error for percent change / difference

To characterize the overall accuracy of the new difference estimate, calculate its coefficient of variation (CV):

=ROUND(ABS((Est_MOE/1.645)/Est)*100,0)

Divide the MOE for the difference by 1.645, which is the Z-value for a 90% confidence interval. Divide that by the difference itself, and multiply by 100. Since we can have positive or negative change, we take the absolute value of the result.

Spreadsheet with ACS formula to compute coefficient of variation

To convert the CV into the generally recognized reliability categories:

=IF(CV<=12,"high",IF(CV>=35,"low","medium"))

If the CV value is between 0 to 12, then it’s considered to be highly reliable, else if the CV value is greater than or equal to 35 it’s considered to be of low reliability, else it is considered to be of medium reliability (between 13 and 34). Note: this is a conservative range; search around and you’ll find more liberal examples that use 0-15, 16-40, 41+.

To measure whether two estimates are significantly different from each other, use the statistical difference formula:

=ROUND(ABS((Est2-Est1)/(SQRT((Est1_MOE/1.645)^2+(Est2_MOE/1.645)^2))),3)

Divide the MOE for both the 1st and 2nd estimate by 1.645 (Z value for 90% confidence), take the sum of their squares, and then square root. Subtract the 1st estimate from the 2nd, and then divide. Again in this case, since we could have a positive or negative value we take the absolute value.

Spreadsheet with ACS formula to compute significant difference

To create a boolean significant or not value:

=IF(SigDif>1.645,1,0)

If the significant difference value is greater than 1.645, then the two estimates are significantly different from each other (TRUE 1), implying that some actual change occurred. Otherwise, the estimates are not significantly different (FALSE 0), which means any difference is likely the result of variability in the sample, or any true difference is hidden by this variability.

ALWAYS CHECK YOUR WORK! It’s easy to put parentheses in the wrong place or transpose a cell reference. Take one or two examples and plug them into Cornell PAD’s ACS Calculator, or into Fairfax County VA’s ACS Tools (spreadsheets with formulas – bottom of page). The Census Bureau also provides a spreadsheet that lets you test multiple values for significant difference. Caveat: for the Cornell calculator use the ratio option instead of change when testing. For some reason its change formula never matches my results, but the Fairfax spreadsheets do. I’ve also checked my formulas against the Census Bureau’s ACS Handbooks, and they clearly say to use the ratio formula for percent change.

Interpreting Results

Let’s take a look at a few of the records to understand the results. In Census Tract 1.01, median gross rent increased from $958 (+/- 125) in 2014 to $1113 (+/- 73) in 2019, a change of $155 (+/- 145) and a percent change of 16.2% (+/- 17%). The CV for the change estimate was 57, indicating that this estimate has low reliability; the margin of error is almost equal to the estimate, and the change could have been as little as $10 or as great as $300! The rent estimates for 2014 and 2019 are statistically different but not by much (1.761, higher than 1.645). The margins of error for the two estimates do overlap slightly (with $1,083 being the highest possible value in 2014 and $1,040 the lowest possible value in 2019).

Spreadsheet comparing values for different census tracts

In Census Tract 4, rent increased from $863 (+/- 122) to $1003 (+/- 126), a change of $140 (+/- 175) and percent change of 16.2% (+/- 22%). The CV for the change estimate was 76, indicating very low reliability; indeed the MOE exceeds the value of the estimate. With a score of 1.313 the two estimates for 2014 / 2019 are not significantly different from each other, so any difference here is clouded by sample noise.

In Census Tract 9, rent increased from $875 (+/- 56) to $1083 (+/- 62), a change of $208 (+/- 84) or 23.8% (+/- 10.6%). Compared to the previous examples, these MOEs are much lower than the estimates, and the CV value for the difference is 25, indicating medium reliability. With a score of 4.095, these two estimates are significantly different from each other, indicating substantive change in rent in this tract. The highest possible value in 2014 was $931, and the lowest possible value in 2019 was $1021, so there is no overlap in the value ranges over time.

Mapping Significant Difference and CVs

I grabbed the Census Cartographic Boundary File for tracts for Rhode Island in 2019, and selected out just the tracts for Providence County. I made a copy of my worksheet where I saved the data as text and values in a separate sheet (removing the formulas and encoding the actual outputs), and joined this sheet to the shapefile using the AFFGEOID. The City of Providence and surrounding cities and suburban areas appear in the southeast corner of the county.

The map on the left displays simple percent change over time. In the map on the right, I applied a filter to select just tracts where change was significantly different (the non-significant tracts are symbolized with hash marks). In the screenshots, the count of the number of tracts in each class appears in brackets; I used natural breaks, then modified to place all negative values in the same class. Of the 141 tracts, only 49 had statistically different values. The first map is a gross misrepresentation, as change for most of the tracts can’t be distinguished from sampling variability.

Map of difference on left, significant difference on right — Percent Change in Median Gross Rent 2010-14 to 2015-19: Change on Left, Change Where Both Rent Estimates were Significantly Different on Right

A refined version of the map on the right appears below. In this one, I converted the tracts from polygons to points in a new layer, applied a filter to select significantly different tracts, and symbolized the points by their CV category. Of the 49 statistically different tracts, the actual estimate of change was of low reliability for 32 and medium reliability for the rest. So even if the difference is significant, the precision of most of these estimates is poor.

Providence County, Significant Difference in Median Rent Map — Percent Change in Median Gross Rent 2010-14 to 2015-19 with CV Values, for Tracts with Significantly Different Estimates, Providence County RI

Conclusion

Comparing change over time for ACS estimates is complex, time consuming, and yields many dubious results. What can you do? The size of the MOE relative to the estimate tends to decline as you look at either larger or more populous areas, or larger and fewer subcategories (i.e. 4 income brackets instead of 8). You could also look at two period estimates that are further apart, making it more likely that you’ll see changes; say 2005-2009 compared to 2016-2020. But – you’ll have to cope with normalizing the data. Places that are rapidly changing will exhibit more difference than places that aren’t. If you are studying basic demographics (age / sex / race / tenure) and not socio-economic indicators, use the decennial census instead, as that’s a count and not a sample survey. Ultimately, it’s important to address these issues, and be honest. There’s a lot of bad research where people ignore these considerations, and thus make faulty claims.

For more information, visit the Census Bureau’s page on Comparing ACS Data. Chapter 6 of my book Exploring the US Census covers the American Community Survey and has additional examples of these formulas. As luck would have it, it’s freely accessible as a preview chapter from my publisher, SAGE.

Final caveat: dollar values in the ACS are based on the release year of the period estimate, so 2010-2014 rent is in 2014 dollars, and 2015-2019 is in 2019 dollars. When comparing dollar values over time you should adjust for inflation; I skipped that here to keep the examples a bit simpler. Inflation in the 2010s was rather modest compared to the 2020s, but still could push tracts that had small changes in rent to none when accounted for.

At These Coordinates

Dispatches from the Geospatial Data World

gis

Points to Areas with Convex Hulls in GIS

Steps in QGIS

Additional Considerations:

ArcGIS Pro Corollaries

Conclusion

Folium Maps with Geopandas and Random Color Schemes

Plotting Routes with OpenRouteService and Python

Cost Surfaces and Least Cost Paths in QGIS and GRASS

Preliminary Work

Least Cost Path in QGIS

Cost Surface Analysis

Cost Surface Analysis in GRASS

Conclusion

Spatial SQL with Spatialite and QGIS

Comparing ACS Estimates Over Time: Are They Really Different?

Primary considerations

ACS Formulas

Interpreting Results

Mapping Significant Difference and CVs

Conclusion