analysis

Python Spyder IDE

Python Tips for Somewhat Bigger Data

I’m fortunate to be on sabbatical for much of this summer, and am working on a project where I’m measuring the effectiveness of comparing census American Community Survey estimates over time. I’ve written a lot of Python code over the past six weeks, and thought I’d share some general tips for working with bigger datasets.

For my project, I’m looking at 317 variables stored in 25 tables for over 406,000 individual geographic areas; approximately 129.5 million data points. Multiply that by two, as I’m comparing two time periods. While this wouldn’t fall into the realm of what data scientists would consider as ‘big data’, it is big enough that you have to think strategically about how handle it, so you don’t run out of memory or have to wait hours while tasks grind away. While you could take advantage of parallel processing, or find access to a high performance computer, with this amount of data you can stick with a decent laptop, if you take steps to ensure that it doesn’t go kaput.

While the following suggestions may seem obvious to experienced programmers, it should be helpful to novices. I work with a lot of students whose exposure to Python programming is using Google Colab with Pandas. While that’s a fine place to start, the basic approaches you learn in an intro course will fall flat once you start working with datasets that are this big.

  • Don’t use a notebook. Ipython notebooks like Jupyter or Colab are popular, and are great for doing iterative analysis, visualization, and annotation of your work. But they run via web browsers which introduce extra overhead memory-wise. Iterative notebooks are unnecessary if you’re processing lots of data and don’t need to see step by step results. Use a traditional development environment instead (Spyder is my favorite – see the pic in this post’s header).
  • Don’t rely so much on Pandas DataFrames. They offer convenience as you can explicitly reference rows and columns, and reading and writing data to and from files is straightforward. But DataFrames can hog memory, and processing them can be inefficient (depending on what you’re doing). Instead of loading all your data from a file into a frame, and then making a copy of it where you filter out records you don’t need, it’s more efficient to read a file line by line and omit records while reading. Appending records to a DataFrame one at a time is terribly slow. Instead, use Python’s basic CSV module for reading and append records to nested lists. When you reach the point where a DataFrame would be easier for subsequent steps, you can convert the nested list to a frame. The basic Python data structures – lists, dictionaries, and sets – give you a lot of power at less cost. Novices would benefit from learning how to use these structures effectively, rather than relying on DataFrames for everything. Case in point: after loading a csv file with 406,000 records and 49 columns into a Pandas DataFrame, the frame consumed 240 MB of memory. Loading that same file with the csv module into a nested list, the list consumed about 3 MB.

Reads a file, skips the header row, adds a key / value pair to a dictionary for each row using the first and second value (assuming the key value is unique).

import os, csv

keep_ids={}
with open(recskeep_file,'r') as csv_file:
    reader=csv.reader(csv_file,delimiter='\t')
    next(reader)
    for row in reader:        
        keep_ids[row[0]]=row[1]

Or, save all the records as a list in a nested list, while keeping the header row in a separate list.

records=[]
with open(recskeep_file,'r') as csv_file:
    reader=csv.reader(csv_file,delimiter='\t')
    header=next(reader)
    for row in reader:        
        records.append(row)]
  • Delete big variables when you’re done with them. The files I was reading were segmented in twos: one file for estimates, and one for margins of error for those same estimates. I read each into separate, nested lists while filtering for records I wanted. I had to associate each set with a header row, filter by columns, and then join the two together. Arguably that part was easier to do with DataFrames, so at that stage I read both into separate frames, filtered by column, and joined the two. Once I had the joined frame as a distinct copy, I deleted the two individual frames to save memory.
  • Take out the garbage. Python automatically frees up memory when it can, but you can force the issue by emptying deleted objects from memory by calling the garbage collection module. After I deleted the two DataFrames in the previous step, I explicitly called gc.collect() to free up the space.
...
del est_df
del moe_df
gc.collect()
  • Write as you read. There’s no way I could read all my data in and hold it in memory before writing it all out. Instead I had to iterate – in my case the data is segmented by data tables, which were logical collections of variables. After I read and processed one table, I wrote it out as a file, then moved on to the next one. The variable that held the table was overwritten each time by the next table, and never grew in size beyond the table I was actively processing.
  • Take a break. You can use the sleep module to build in brief pauses between big operations. This can give your program time to “catch up”, finishing one task and freeing up some juice before proceeding to the next one.
time.sleep(3)
  • Write several small scripts, not one big one. The process for reading, processing, and writing my files was going to be one of the longer processes that I had to run. It’s also one that I’d likely not have to repeat if all went well. In contrast, there were subsequent analytical tasks that I knew would require a lot of back and forth, and revision. So I wrote several scripts to handle individual parts of the process, to avoid having to repeat a lot of long, unnecessary tasks.
  • Lean on a database for heavy stuff. Relational databases can handle large, structured data more efficiently compared to scripts reading data from text files. I installed PostgreSQL on my laptop to operate as a localhost database server. After I created my filtered, processed CSV files, I wrote a second program that loaded them into the database using Psycopg2, a Python module that interacts with PostgreSQL (this is a good tutorial that demonstrates how it works). SQL statements can be long, but you can use Python to iteratively write the statements for you, by building strings and filling placeholders in with the format method. This gives you two options. Option 1, you execute the SQL statements from within Python. This is what I did when I loaded my processed CSV files; I used Python to iterate and read the files into memory, wrote CREATE TABLE and INSERT statements in the script, and then inserted the data from Python’s data structures into the database. Option 2, is you can use Python to write a SQL transaction statement, save the transaction as a SQL text file, and then load it in the database and run it. I followed this approach later in my process, where I had to iterate through two sets of 25 tables for each year, and perform calculations to create a host of new variables. It was much quicker to do these operations within the database rather than have Python do them, and executing the SQL script as a separate process made it easier for me to debug problems.

Connect to a database, save SQL statement as a string, loops through a list of variables IDs, and for each variable format the string by passing the values in as parameters, execute the statement and fetch the result – fetchone() in this case, but could also fetchmany():

# Database connection parameters
pgdb='acs_timeseries'
pguser='postgres'
pgpswd='password'
pgport='5432'
pghost='localhost'
pgschema='public'

conpg = psycopg2.connect(database=pgdb, user=pguser, password=pgpswd,
                             host=pghost, port=pgport)
curpg=conpg.cursor()

sql_varname="SELECT var_lbl from acs{}_variables_mod WHERE var_id='{}'"
year='2019'

for v in varids:
    # Get labels associated with variables
    qvarname=sql_varname.format(year, v)
    curpg.execute(qvarname)
    vname=curpg.fetchone()[0]
... #do stuff...

curpg.close()
  • When using Psycopg2, don’t use the executemany() function. When performing an INSERT statement, you can have the module executeone() statement at a time, or executemany(). But the latter was excruciatingly slow – in my case it ran overnight before it finished. Instead I found this trick called mogrify, where you convert your INSERT arguments into one enormous string, and pass that to the mogrify() function. This was lightning fast, but because the text string is massive I ran out of memory if my tables were too big. My solution was to split tables in half if the number of columns exceeded a certain number, and pass them in one after the other.
  • Use the database and script for what they do best. Once I finished my processing, I was ready to begin analyzing. I needed to do several different cross-tabulations on the entire dataset, which was segmented into 25 tables. PostgreSQL is able to summarize data quickly, but it would be cumbersome to union all these tables together, and calculating percent totals in SQL for groups of data is a pain. Python with Pandas would be much better at the latter, but there’s no way I could load a giant flat file of my data into Python to use as the basis for all my summaries. So, I figured out the minimal level of grouping that I would need to do, which would still allow me to run summaries on the output for different combinations of groups (i.e. in total and by types of geography, tables, types of variables, and by variables). I used Python to write and execute GROUP BY statements in the database, iterating over each table and appending the result to a nested list, where one record represented a summary count for a variable by table and geography type. This gave me a manageable number of records. Since the GROUP BY operation took some time, I did that in one script to produce output files. Creating different summaries and reports was a more iterative process that required many revisions, but was quick to execute, so I performed those operations in a subsequent script.
SQL GROUP BY Output
Instead of 386 mil records for (406k geographies * 317 variables * 3 categories), about 18k summary counts for 19 groups of geography

Lastly, while writing and perfecting your script, run it against a sample of your data and not the entire dataset! This will save you time and needless frustration. If I have to iterate through hundreds of files, I’ll begin by creating a list that has a couple of file names in it and iterate over those. If I have a giant nested list of records to loop through, I’ll take a slice and just go through the first ten. Once I’m confident that all is well, then I’ll go back and make changes to execute the program on everything.

Providence Census Geography Map

Crosswalking Census Data to Neighborhood Geographies

Last semester we completed a project to create a crosswalk between census geographies and local geographies in Providence, RI. Crosswalks are used for relating two disparate sets of geography, so that you can compile data that’s published for one set of geography in another. Many cities have locally-defined jurisdictions like wards or community districts, as well as informally defined areas like neighborhoods. When you’re working with US Census data, you use small statistical areas that the Bureau defines and publishes data for; blocks, block groups, census tracts, and perhaps ZCTAs and PUMAs. A crosswalk allows you to apportion data that’s published for census areas, to create estimates for local areas (there are also crosswalks that are used for relating census geography that changes over time, such as the IPUMS crosswalks).

How the Crosswalk Works

For example, in the Providence Census Geography Crosswalk we have two crosswalks that allow you to take census tract data, and convert it to either neighborhoods or wards. I’ll refer to the neighborhoods in this post. In the crosswalk table, there is one record for each portion of a tract that overlaps a neighborhood. For each record, there are attribute columns that indicate the count and the percentage of a tract’s population, housing units, land area, and total area that fall within a given neighborhood. If a tract appears just once in the table, that means it is located entirely within one neighborhood. In the image below, we see that tract 1.01 appears in the table once, and its population percentage is 1. That means that it falls entirely within the Washington Park neighborhood, and 100% of its population is in that neighborhood. In contrast, tract 1.02 appears in the table twice, which means it’s split between two neighborhoods. Its pct_pop column indicates that 31.5% of its population is in South Elmwood, while 68.5% is in Washington Park. The population count represents the number of people from that tract that are in that neighborhood.

Looking at the map below, we can see that census tract 1.01 falls entirely within Washington Park, and tract 1.02 is split between Washington Park and South Elmwood. To generate estimates for Washington Park, we would sum data for tract 1.01 and the portion of tract 1.02 that falls within it. Estimates for South Elmwood would be based solely on the portion of tract 1.02 that falls within it. With the crosswalk, “portion” can be defined as the percentage of the tract’s population, housing units, land area, or total area that falls within a neighborhood.

The primary purpose of the crosswalk is to generate census data estimates for neighborhoods. You apportion tract data to neighborhoods using an allocation factor (population, housing units, or area) and aggregate the result. For example, if we have a census tract table from the 2020 census with the population that’s 65 years and older, we can use the crosswalk to generate neighborhood-level estimates of the 65+ population. To do that, we would:

  1. Join the data table to the crosswalk using the tract’s unique ID; the crosswalk has both the long and short form of the GEOIDs used by the Census Bureau. So for each crosswalk record, we would associate the 65+ population for the entire tract with it.
  2. Multiply the 65+ population by one of the allocation columns – the percent population in this example. This would give us an estimate of the 65+ population that live in that tract / neighborhood piece.
  3. Group or aggregate this product by the neighborhood name, to obtain a neighborhood-level table of the 65+ population.
  4. Round decimals to whole numbers.

To do the calculations in a spreadsheet, you would import the appropriate crosswalk sheet into the workbook that contains the census data that you want to apportion, so that they appear as separate sheets in the same workbook. In the crosswalk worksheet, use the VLOOKUP formula and reference the GEOID to “join” the census tract data to the crosswalk. The formula requires: cell containing the ID value you wish to look up, the range of cells in a worksheet that you will search through, the number of the column that contains the value you wish to retrieve (column A is 1, Z is 26, etc.), and the parameter “FALSE” to get an exact match. It is assumed that the look up value in the target table (the matching ID) appears in the first column (A).

The tract data is now repeated for each tract / neighborhood segment. Next, use formulas to multiply the allocation percentage (pct_pop in this example) by the census data value (over 65 pop for the entire tract) to create an allocated estimate for each tract / neighborhood piece.

Then you can generate a pivot table (on the Insert ribbon in Excel) where you group and sum that allocated result by neighborhood (neighborhoods as rows, census data as summed values in columns). Final step is to round the estimates.

This process is okay for small projects where you have a few estimates you want to quickly tabulate, but it doesn’t scale well. I’d use a relational database instead; import the crosswalk and census data table into SQLite, where you can easily do a left join, calculated field, and then a group by statement. Or, use the joining / calculating / aggregating equivalents in Python or R.

I used the percentage of population as the allocation factor in this example. If the census data you’re apportioning pertains to housing units, you could use the housing units percentage instead. In any case, there is an implicit assumption that the data you are apportioning has the same distribution as the allocation factor. In reality this may not be true; the distribution of children, seniors, homeowners, people in poverty etc. may vary from the total population’s distribution. It’s important to bear in mind that you’re creating an estimate. If you are apportioning American Community Survey data this process gets more complicated, as the ACS statistics are fuzzy estimates. You’d also need to apportion the margin of error (MOE) and create a new MOE for the neighborhood-level estimates.

The Providence crosswalk has some additional sheets that allow you to go from tracts, ZCTAs, or blocks to neighborhoods or wards. The tract crosswalk is by far the most useful. The ZCTA crosswalk was an exercise in futility; I created it to demonstrate the complete lack of correlation between ZCTAs and the other geographies, and recommend against using it (we also produced a series of maps to visually demonstrate the relationship between all the geographies). There is a limited amount of data published at the block level, but I included it in the crosswalk for another reason…

Creating the Crosswalk

I used census blocks to create the crosswalk. They are the smallest unit of census geography, and nest within all other census geographies. I used GIS to assign each block to a neighborhood or ward based on the geography the block fell within, and then aggregated the blocks into distinct tract / ward and tract / neighborhood combinations. Then I calculated the allocation factors, the percentage of the tract’s total attributes that fell in a particular neighborhood or ward. This operation was straightforward for the wards; the city constructed them using 2020 census blocks, so the blocks nested or fit perfectly within the wards.

The neighborhoods were more complicated, as these were older boundaries that didn’t correspond to the 2020 blocks, and there were many instances where blocks were split between neighborhoods. My approach was to create a new set of neighborhood boundaries based on the 2020 blocks, and then use those new boundaries to create the crosswalk. I began with a spatial join, assigning each block a neighborhood ID based on where the center of the block fell. Then, I manually inspected the borders between each neighborhood, to determine whether I should manually re-assign a block. In almost all instances, blocks I reassigned were unpopulated and consisted of slivers that contained large highways, or blocks of greenspace or water. I struck a balance between remaining as faithful to the original boundaries as possible, while avoiding the separation of unpopulated blocks from a tract IF the rest of the blocks in that tract fell entirely within one neighborhood. In two cases where I had to assign a populated block, I used satellite imagery to determine that the population of the block lived entirely on one side of a neighborhood boundary, and made the assignment accordingly.

In the example below, 2020 tract boundaries are shown in red, 2020 block boundaries are light grey, original neighborhood boundaries are shown with dotted black lines, and reconstituted neighborhoods using 2020 blocks are shown in different colors. The boundaries of Federal Hill and the West End are shifted west, to incorporate thin unpopulated blocks that contain expressways. These empty blocks are part of tracts (10 and 13) that fall entirely within these neighborhoods; so splitting them off to adjacent Olneyville and Silver Lake didn’t make sense (as there would be no population or homes to apportion). Reassigning them doesn’t change the fact that the true boundary between these neighborhoods is still the expressway. We also see an example between Olneyville and Silver Lake where the old neighborhood boundary was just poorly aligned, and in this case blocks are assigned based on where the center of the block fell.

Creating the crosswalk from the ground up with blocks was the best approach for accounting how population is distributed within larger areas. It was primarily an aggregation-based approach, where I could sum blocks that fell within geographies. This approach allowed me to generate allocation factors for population and housing units, since this data was published with the blocks and could be carried along.

Conversely, in GIS 101 you would learn how to calculate the percentage of an area that falls within another area. You could use that approach to create a tract-level crosswalk based on area, i.e. if a tract’s area is split 50/50 between two neighborhoods, we’ll apportion its population 50/50. While this top down approach is simpler to implement, it’s far less ideal because you often can’t assume that population and area are equally distributed. Reconsider the example we began with: 31.5% of tract 1.02’s population is in South Elmwood, while 68.5% is in Washington Park. In contrast, 75.3% of tract 1.02’s land area is in South Elmwood, versus only 24.7% in Washington Park! If we apportioned our census data by area instead of population, we’d get a dramatically different, and less accurate, result. Roger Williams Park is primarily located in the portion of tract 1.02 that falls within Elmwood; it covers a lot of land but includes zero people.

Why can’t we just simply aggregate block-level census data to neighborhoods and skip the whole apportionment step? The answer is that there isn’t much data published at the block level. There’s a small set of tables that capture basic demographic variables as part of the decennial census, and that’s it. There was a sharp reduction in the number of block-level tables in the 2020 census due to new privacy regulations, and the ACS isn’t published at the block-level at all. While you can use the block-level table in the crosswalk to join and aggregate block data, in most cases you’ll need to work with tract-data and apportion it.

I used spatial SQL to create the crosswalks in Spatialite and QGIS , and if you’re interested in seeing all the gory details you can look at the code and spatial database in source folder of the project’s GitHub repo. I always prefer SQL for spatial join and aggregation operations, as I can write a single block of code instead of running 4 or 5 different of desktop GIS tools in a sequence. I’ll be updating the project this semester to include additional geographies (block groups – the level between blocks and tracts), and perhaps an introductory tutorial for using it (there are some basic docs at present).

GeoPandas Python Script in Spyder IDE

Basic Geospatial Python with GeoPandas

Last month I cobbled together bits and pieces of geospatial Python code I’ve written in various scripts into one cohesive example. You can script, automate, and document a lot of GIS operations with Python, and if you use a combination of Pandas, GeoPandas, and Shapely you don’t even need to have desktop GIS software installed (packages like ArcPy and PyQGIS rely on their underlying base software).

I’ve created a GitHub repository that contains sample data, a basic Python script, and a Jupyter Notebook (same code and examples, in two different formats). The script covers these fundamental operations: reading shapefiles into a geodataframe, reading coordinate data into a dataframe and creating geometry, getting coordinate reference system (CRS) information and transforming the CRS of a geodataframe, generating line geometry from groups and sequences of points, measuring length, spatially joining polygons and points to assign the attributes of one to the other, plotting geodataframes to create a basic map, and exporting geodataframes out as shapefiles.

A Pandas dataframe is a Python structure for tabular data that allows you to store and manipulate data in rows and columns. Like a database, Pandas columns are assigned explicit data types (text, integers, decimals, dates, etc). A GeoPandas geodataframe adds a special geometry column for holding and manipulating coordinate data that’s encoded as point, line, or polygon objects (either single or multi). Similar to a spatial database, the geometry column is referenced with standard coordinate reference system definitions, and there are many different spatial functions that you can apply to the geometry. GeoPandas allows you to work with vector GIS datasets; there are wholly different third-party modules for working with rasters (Rasterio for instance – see this post for examples).

First, you’ll likely have to install the packages Pandas, GeoPandas, and Shapely with pip or your distro’s package handler. Then you can import them. The Shapely package is used for building geometry from other geometry. Matplotlib is used for plotting, but isn’t strictly necessary depending on how detailed you want your plots to be (you could simply use Panda’s own plot library).

import os, pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
import matplotlib.pyplot as plt
%matplotlib inline

Reading a shapefile into a geodataframe is a piece of cake with read_file. We use path.join from the os module to build paths that work in any operating system. Reading in a polygon file of Rhode Island counties:

county_file=os.path.join('input','ri_county_bndy.shp')
gdf_cnty=gpd.read_file(county_file)
gdf_cnty.head()
GeoDataframe of County Polygons

If you have coordinate data in a CSV file, there’s a two step process where you load the coordinates as numbers into a dataframe, and then convert the dataframe and coordinates into a geodataframe with actual point geometry. Pandas / GeoPandas makes assumptions about the column types when you read a CSV, but you have the option to explicitly define them. In this example I define the Census Bureau’s IDs as strings to avoid dropping leading zeros (an annoying and perennial problem). The points_from_xy function takes the longitude and latitude (in that order!) and creates the points; you also have to define what system the coordinates are presently in. This sample data came from the US Census Bureau, so they’re in NAD 83 (EPSG 4269) which is what most federal agencies use. For other modern coordinate data, WGS 84 (EPSG 4326) is usually a safe bet. GeoPandas relies on the EPSG / ESRI CRS library, and familiarity with these codes is a must for working with spatial data.

point_file=os.path.join('input','test_points.csv')
df_pnts=pd.read_csv(point_file, index_col='OBS_NUM', delimiter=',',dtype={'GEOID':str})

gdf_pnts = gpd.GeoDataFrame(df_pnts,geometry=gpd.points_from_xy(
df_pnts['INTPTLONG'],df_pnts['INTPTLAT']),crs = 'EPSG:4269')
gdf_pnts

In the output below, you can see the distinction between the coordinates, stored separately in two numeric columns, and point-based geometry in the geometry column. The sample data consists of eleven point locations, ten in Rhode Island and one in Connecticut, labeled alfa through kilo. Each point is assigned to a group labeled a, b, or c.

Geodataframe of Point Features

You can obtain the CRS metadata for a geodataframe with this simple command:

gdf_cnty.crs
CRS Information for GeoDataframe

You can also get the bounding box for the geometry:

gdf_cnty.total_bounds

These commands are helpful for determining whether different geodataframes share the same CRS. If they don’t, you can transform the CRS of one to match the other. The geometry in the frames must share the same CRS if you want to interact with the data. In this example, we transform our points from NAD 83 to the RI State Plane zone that the counties are in with to_crs; the EPSG code is 3438.

gdf_pnts.to_crs(3438,inplace=True)
gdf_pnts.crs

If our points represent a sequence of events, we can do a points to lines operation to create paths. In this example our points are ordered in the correct sequence; if this was not the case, we’d sort the frame on a sequence column first. If there are different events or individuals in the table that have an identifying field, we use this as the group field to create distinct lines. We use lambda to repeat Shapely’s LineString function across the points to build the lines, and then assign them to a new geodataframe. Then we add a column where we compute the length of the lines; this RI CRS uses feet for units, so we divide by 5,280 feet to get miles. The Panda’s loc function grabs all the rows and a subset of the columns to display them on the screen (we could save them to a new geodataframe if we wanted to subset rows or columns).

lines = gdf_pnts.groupby('GROUP')['geometry'].apply(lambda x: LineString(x.tolist()))
gdf_lines = gpd.GeoDataFrame(lines, geometry='geometry',crs = 'EPSG:3438').reset_index()
gdf_lines['length_mi']=(gdf_lines.length)/5280
gdf_lines.loc[:,['GROUP','length_mi']]
New GeoDataFrame with Line Geometry

To assign every point the attributes of the polygon (county) that it intersects with , we do a spatial join with the sjoin function. Here we take all attributes from the points frame, and a select number of columns from the polygon frame; we have to take the geometry from both frames to do the join. In this example we do a left join, keeping all the points on the left regardless of whether they have a matching polygon on the right. There’s one point that falls oustide of RI, so it will be assigned null values on the right. We rename a few of the columns, and use loc again to display a subset of them to the screen.

gdf_pnts_wcnty=gpd.sjoin(gdf_pnts, gdf_cnty[['geoid','namelsad','geometry']],
how='left', predicate='intersects')
gdf_pnts_wcnty.rename(columns={'geoid': 'COUNTY_ID', 'namelsad': 'COUNTY'}, inplace=True)
gdf_pnts_wcnty.loc[:,['OBS_NAME','OBS_DATE','COUNTY']]
GeoDataframe Created from Spatial Join

To see what’s going on, we can generate a basic plot to display the polygons, points, and lines. I used matplotlib to create a figure and axes, and then placed each layer one on top of the other. We could opt to simply use Pandas / GeoPandas internal plotting instead as illustrated in this tutorial, which works for basic plots. If we want more flexibility or need additional functions we can call on matplotlib. In this example the default placement for the tick marks (coordinates in the state plane system) was bad, and the only way I could fix them was by rotating the labels, which required matplotlib.

fig, ax = plt.subplots()
plt.xticks(rotation=315)
gdf_cnty.plot(ax=ax, color='yellow', edgecolor='grey')
gdf_pnts.plot(ax=ax,color='black', markersize=5)
gdf_lines.plot(ax=ax, column="GROUP", legend=True)
Basic Map Plot

Exporting the results out a shapefiles is also pretty straightforward with to_file. Shapefiles come with many limitations, such as a limit on ten characters for column names. You can opt to export to a variety of other vector formats, such a geopackages or geoJSON.

out_points=os.path.join('output','test_points_counties.shp')
out_lines=os.path.join('output','test_lines.shp')

gdf_pnts_wcnty.to_file(out_points)
gdf_lines.to_file(out_lines)

Hopefully this little intro will help get you started with using geospatial Python with GeoPandas. Happy New Year!

Best – Frank

QGIS Convex Hulls header

Points to Areas with Convex Hulls in GIS

Let’s say you have different sets of points, and each set represents a distinct category of features. Maybe villages where residents speak different languages, or historical events that occurred during different epochs. Beyond plotting and symbolizing the points, perhaps you would like to create areas for each set that represent generalized territory, and you’d like to see how these areas correspond. I’ll demonstrate a few approaches for achieving this, using convex hulls, attribute table calculations, and geoprocessing tools like intersection and union. A convex hull is a minimum bounding polygon, where an area is drawn around all points in a set, where the outermost points serve as vertices for creating boundaries.

I’ll use QGIS for this example, but will mention the corresponding tools in ArcGIS Pro at the end. In QGIS we’ll use the tools that are located within the Processing Toolbox (gear icon on the toolbar). Unlike the shortcut tools under the Vector menu, these tools provide more options and allow us to process multiple files at once.

Steps in QGIS

First, we need either distinct point files for each set of features, or a single file with a categorical variable that distinctly identifies different sets of features. For this example I’ll use three distinct files that I’ve generated using phony sample data. The points are in a projected coordinate system (important!) that’s appropriate for the area I’m mapping.

  1. In the QGIS Processing Toolbox, we select the Minimum Bounding Geometry (MBG) tool, and under the Geometry Type specify that we want to create a convex hull. I ran this tool for each file, creating three convex hull files (alternatively, if you had one file with distinct categories, you could use the Field option to generate separate hulls for each category). I’ve symbolized the output below, making the fill hollow and assigning an outline that matches the color of the points. This gives you a good sense for the coverage areas for the points, and how they overlap.
  1. Before running additional tools to explicitly measure overlap, we need to modify the attribute tables of the convex hulls, so we’ll have useful attributes to carry over. The MBG tool creates a new layer with an ID number, area, and perimeter. The ID is set to zero for each hull file, but we should change it to distinctly represent the file / category. With the attribute table open, we can go into an edit mode and type in a new integer value; in this case I’m assigning 1, 2, and 3 to each of the test layers. Alternatively, you could add a new field and assign it a meaningful category value.
  2. The units for area and perimeter match the units used by the map projection of the layer, which is why we want to use a projected coordinate system that uses meters or feet, and not a geographic one (like WGS 84 or NAD 83) that uses degrees. I’m using a state plane system, so the area is in square feet. To convert this to square miles, within the attribute table view I use the Field Calculator to add a new decimal field, and divide the value of the area by 27,878,400 (the number of sq feet in a sq mile; for metric units in meters, we’d divide by 1,000,000 to get sq km). We calculate the area directly from the polygon geometry:
area( @geometry) / 27878400
  1. To generate the area of intersection, we go into the Processing tool box and run the Intersection (multiple) tool. The first convex hull is the input layer, while the overlay layers are the other two hull files (in the dialog box, we check the layers we want, and then use the arrow to navigate back to the tool to run it). The output is a new file with polygon(s) that cover the area where all three layers intersect. Its attribute table contains an ID, area, and perimeter field, and we can calculate a new area field in sq miles and see how it compares to the total areas. In my example, the area where all three territories intersect covers about 112 sq miles, while the areas for the individual territories are 512, 563, and 256 sq miles respectively.
  1. To identify distinct areas of overlap between the territories, we return to the Processing toolbox and run the Union (multiple) tool. The dialog is similar to the intersection tool, where the first hull is the union layer and the additional hulls are overlay layers. The output of this tool is a layer with distinct polygons where the hulls coincide. The attribute table for the union layer carries over the attributes from each of the three layers, with columns suffixed with underscores and sequential integers. So if a polygon consists of area covered by hulls 1 and 2, those attributes will be filled in, while the attributes of 3 will be null. As before, we can calculate an area in sq miles for the new polygons. In this case, we’d see that the area covered by hull 1 without any overlapping hulls is 240 sq miles, the largest of all territories.
QGIS Union attribute table
  1. To explicitly categorize these areas, we can add a new field in the attribute table. This will be a text field, where we take the ID numbers, convert them to strings, and concatenate them. In the example above, IDs 1 and 2 would be concatenated to 12, and since the value for 3 is null, no text is appended. (Variation – if you created distinct text-based category fields instead of using the integer IDs, you could concatenate them directly without having to convert them to strings). Using the symbology tool, we can classify the data using these new categories, and can modify the color scheme to something appropriate for displaying the contributions from each area. So a polygon with category 1 includes areas covered by the first convex hull and no others, while category 12 includes areas where hulls 1 and 2 overlapped.
concat(
to_string( "id" ),
to_string( "id_2" ),
to_string( "id_3" )
)

Additional Considerations:

  • With the areas of the individual union pieces, we can compute the percentages of each territory that fall inside and outside various overlapping zones with the field calculator. For example, we can calculate the total area of the union file (which is NOT the sum of each hull, as there’s overlap between them), and then divide each feature by that total to get its percent total. The expression for doing this is below; the numerator has the name of the field that contains the area of each polygon in sq miles, while the denominator includes the calculation for the sum of all parts (alternatively you could use the QGIS Statistics tool to compute this, and hard code the total into the formula):
area_part / (sum(area( geometry(@feature)))/27878400) *100
  • If the idea is to create areas of territory that the points exert influence on, you may want to add a buffer to each hull, to account for the fact that the outer points that form the boundaries will exert influence on both sides of the boundary. Use the Processing – Buffer tool. For the buffer distance, you can use an arbitrary value that makes sense for the circumstances. Or you can generate a relative value that represents a fraction of each convex hull’s area. The output of the buffer tool would then serve as the input to the intersection and union tools.
  • These examples focus on area. If the number of points that falls within the areas is important, you can use the Points in Polygon tool on each of the hulls to count points, and then do the same for the output of the intersection and union tools to get the different points counts for each set of polygons.

ArcGIS Pro Corollaries

Following the same steps above for QGIS, but with ArcGIS Pro:

  1. In the red toolbox, the Minimum Boundary Geometry tool is used to create convex halls. It’s quite similar to the one in QGIS: specify the geometry type, and there is an option to Group (if you have one file with categories). If you leave the Add geometry characteristics box unchecked, it will still compute basic area and perimeter; the checkbox adds a bunch of additional fields.
  2. Unlike QGIS, ArcGIS will not allow you to modify its OBJECTID field. To create a unique value for each hull, you will have to open the attribute table and use the Calculate tool to create a category field (integer or text). To ensure that you can carry it over, in ArcGIS you need to give this column a different name in each hull: cat1, cat2, cat3. Set the value at the bottom in the expression box.
  3. You can use the calculate tool in the attribute table to generate an area column in sqft or sqkm, or use the Calculate Geometry Tool in the toolbox instead. The latter is actually simpler: create a new column, and choose Area and the output units.
  4. The Intersect tool will create the intersection, and functions similarly to QGIS.
  5. The Union tool creates the union, and also functions similarly.
  6. Creating the category field in the union file is a bit more complicated, as ArcGIS assigns values of 0 instead of NULL for non-overlapping polygons. In the Calculate window, with the input file as Union
    and the field as category, change the Expression type to Arcade (ESRI’s scripting language). First, run an expression to concatenate the categories and convert integers to strings (if necessary). Then, replace that expression with a second one that replace the zeros with nothing.
Concatenate(TEXT($feature.cat1)+
TEXT($feature.cat2)+
TEXT($feature.cat3)
)
Replace($feature.category,'0','')

Conclusion

This is a basic approach, appropriate for certain use cases where you want to generate areas from points; particularly when different point sets have a well defined category, so there’s no question of how to group them. Also appropriate where you don’t have – or don’t want – hard boundaries between sets of points and want to see areas of overlap. More sophisticated methods exist for separating points into clusters based on density, distance, and similar attributes, such as K-Means and DB Scan. You can generate non-overlapping territories for individual points using Thiessen / Voronoi polygons, and for points with a sufficiently high density, you can generate rasters with kernel tools.

Least Cost Paths in QGIS and GRASS

Cost Surfaces and Least Cost Paths in QGIS and GRASS

In this post I’ll demonstrate how to create least cost paths using QGIS and GRASS GIS, and in doing so will describe how a cost surface is constructed. In a surface analysis, you model movement across a grid whose values represent friction encountered in moving across it. In computing a least cost path, you’re seeking an optimal route from an origin to the closest destination, where ‘close’ incorporates distance and ease of movement across that surface. These kinds of analyses are often conducted in the environmental sciences, in modeling the movement of water across terrain, and in zoology in predicting migration paths for land-based animals.

In this example the idea was to chart the origin of settlements and possible trade routes in ancient history. In applications where we’re studying human activity, network analysis is typically used instead. Networks use geometry, where a node is a place or person, and connections between nodes are indicated with lines. Lines typically have a value associated with them that identify either the strength of a connection, or conversely friction associated with moving between nodes. The idea for this project was to identify how networks formed, so the surface analysis served as a proto-network analysis. While there were roads and maritime routes in pre-modern times, these networks were weaker and less dense. Charting movement over a surface representing terrain could provide a decent approximation of routes (but if you’re interested in ancient Roman network routing, check out the ORBIS project at Stanford).

This example stems from a project I was helping a PhD student with; I don’t want to replicate his specific study, so I’ve modified the data sources and area of focus to model movement between large settlements and stone quarries in the ancient Roman world. My goal is to demonstrate the methods with a plausible example; if we were doing this as part of an actual study, we would need to be more discriminating in selecting and processing our data.

Preliminary Work

The Pleiades project will serve as our source for destinations; it’s an academic gazetteer that includes locations and place names for the ancient and early medieval world, stretching from Europe and North Africa through the Middle East to India. It’s published in many forms, and I’ve downloaded the Pleiades Data for GIS in a CSV format. Using QGIS, I used the Add Delimited Text tool to plot the places.csv to get all of the locations, and joined that file to the places_place_type.csv file which contains different categories of places. I used Select by Attributes to get locations classified as quarries, and exported the selection out to a geopackage.

The Pleiades data includes a category for settlements, but there are about ten thousand of these and there isn’t an easy way to create a subset of the largest places. So I opted to use Hanson’s dataset of the largest settlements in the ancient Roman world to serve as our source for origins (about 1,400 places). This data was packaged in an Excel file; I plotted the coordinates using the Create Points Layer from Table tool in QGIS and converted the result to a geopackage. For testing purposes, I selected a subset of ten major cities and saved them in a separate layer: Athenae, Alexandria (Aegyptus), Antiochia (Syria), Byzantium, Carthago, Ephesus, Lugdunum, Ostia, Pergamum, Roma.

For the friction grid, I downloaded a geoTIFF of the Human Mobility Index by Ozak. The description from the project:

“The Human Mobility Index (HMI) estimates the potential minimum travel time across the globe (measured in hours) accounting for human biological constraints, as well as geographical and technological factors that determined travel time before the widespread use of steam power.”

There are three separate grids that vary in extent based on the availability of seafaring technology. I chose the grid that incorporates seafaring prior to the advent of ocean-going ships, which is appropriate for the Mediterranean world during the classical era. The HMI is a global grid at 925 meter resolution. To minimize processing time, I clipped it to a bounding box that encompasses the area of study. The grid is in the World Cylindrical Equal Area system; I reprojected it to WGS 84 to match the rest of the layers. As long as we’re not measuring actual distances, we don’t need to worry about the system we’re using (but if we were, we’d use an equidistant system). Since the range of values is small and it’s hard to see differences in cell values, I symbolized the grid as single-band psuedo color and used a quantiles classification scheme with 12 categories.

Lastly, I grabbed some modern country boundaries from Natural Earth to serve as a general frame of reference. A screenshot of the workspace is below:

Least Cost Path in QGIS

QGIS has a third-party plug-in for doing a least cost path analysis, which works fine as long as you don’t have too many origin points. Go to Plugins > Manage and Install Plugins > Least Cost Path to turn it on. Then open the Processing toolbox and it will be listed at the bottom. See the screenshot below for the tool’s menu. The Cost raster layer is the friction surface, so the human mobility index in this example. The start points are the ten major cities and the end points are the quarries. The start-point layer dialog only accepts a single point; if you have multiple points, hit the green circular arrow button to iterate across all of them. There’s a checkbox for connecting the start point to just the nearest end point (as opposed to all of them). Save the output to a geopackage.

It took about five minutes to run the analysis and iterate across all ten points. Each path is saved in a separate file, but since they have an identical structure I subsequently used Vector > Data Management Tools > Merge Vector Layers to combine them into one file. The attribute table records the end point ID (for the quarry) and the accumulated cost, but does not include the origin ID; this ID is the number 1 repeated each time, as the tool was iterating over the origin points. We can see the result below; for Athens and Ephesus in the south, land routes were shortest, whereas for Pergamum and Byzantium in the north it was easier (distance and friction-wise) to move across the sea.

While this worked fine for ten cities, it would take a considerable amount of time to compute paths for all 1,400. The problem here is that the plugin was designed for one point at a time. Let’s outline the process so we can understand how alternatives would work.

Cost Surface Analysis

To calculate a least cost path, the first step is to create a cost surface, where we take our friction grid and the destinations and calculate the total cost of movement across all cells to the nearest destination. First, the destinations are placed on the grid, and they become the grid sources. Then, the accumulated cost of moving from each source to its adjacent cells is calculated. For horizontal and vertical movement, it’s the sum of the friction values divided by two, and for diagonal movement it’s the sum of the friction values divided by two then multiplied by 1.4142. Once those calculations are performed, those adjacent cells are assigned to each source. Next, the lowest accumulated cost cell in the grid is identified, the cost for moving to its unassigned neighbors is calculated, and these cells are assigned to the same source. This process is repeated by cycling through the lowest accumulated value until all calculations for the grid are finished. Illustrated in the example below, which I derived from Lloyd’s Spatial Data Analysis (2010) pp. 165-168.

For each cell, three items are recorded, and are saved either as separate bands in one raster, or in three separate raster files:

  1. Accumulated cost of moving to that cell from the nearest source
  2. Assignment or allocation of the cell to its source (the nearest one to which it “belongs”)
  3. A vector that indicates direction from that source

With these cost surfaces, we can take the second step of calculating the least cost path. We place a number of starting points onto this surface, and each point is assigned to the closest destination based on where its grid cell was allocated. The direction to that destination is traced backward using the direction grid, and the total cost of movement is taken from the accumulated cost surface.

Knowing how this process works, there are two practical conclusions we can draw. First, when computing the cost surface, you use your destinations (not the origins) as the source for the cost surface. You use the origins as the start points for the least cost path. Second, there’s no need to recalculate the cost surface for every origin point; you only need to do this once. That’s why the QGIS plugin took so long; it was recomputing the cost surface each time. Knowing this, we can use GRASS GIS to compute the paths, as it’s designed to compute the surface just once (and it’s data structure will also boost performance a bit).

Cost Surface Analysis in GRASS

GRASS GIS comes bundled with QGIS. While it’s possible to run a number of GRASS tools directly within QGIS, it’s a bit undesirable as you’re not able to access the full range of parameters or options for each GRASS command. I opted to create the GRASS environment in QGIS, and loaded all the necessary data into the GRASS format. Then, I flipped over to the GRASS GUI to do the analysis.

GRASS uses a distinct database structure and file format, and we need to create a GRASS workspace and load our data into that database in order to use the cost surface tools. I followed the steps in the QGIS manual for creating a GRASS environment and loading data into a GRASS database. Once you create the database and mapset, you use the QGIS Browser to browse to the grassdata folder and designate your new mapset as your working mapset (mapsets have the little green grass icon beside them). With the GRASS tools open, I used v.in.ogr.qgis to load my my cities and quarries layers into this mapset, and r.in.gdal.qgis to load the mobility index (if these layers weren’t already in your QGIS project, you’d use the tools that don’t have the qgis suffix, i.e. v.in.ogr).

After exiting QGIS and launching GRASS, you select the mapset under the grassdata database at the top, right click and choose Switch mapset, and choose the mapset you want to work with (if you don’t see it, hit the database icon to browse and connect to the grassdata folder). You can display the layers in the GRASS window to visualize them, but it’s not necessary for running the tools. In the tool menu on the right, search for the Cost surface tool, r.cost, and choose the following options:

  • Required: input raster map with grid cell cost is the human mobility index, and output raster is cost_surface
  • Optional outputs: output raster with nearest start point as allocation_surface, output raster with movement as direction_surface
  • Start: vector starting points for the cost surface are the destinations, the quarries
  • Optional: check verbose output (to get more details on errors)
GRASS GIS GUI and r.cost

Running this operation on all 1,400 cities took a matter of seconds, and all three rasters described in the previous section were generated: cost, allocation, direction (shown below).

Using these outputs, we can run the Least cost route or flow tool, which is called r.drain (as it’s often used in earth sciences to chart the path that water will drain based on elevation).

  • Required: Name of input elevation or cost surface raster is cost_surface, Name of output raster path: is path_raster
  • Cost surface: check the Input raster map is a cost surface box, Name of input movement direction raster is direction_surface
  • Start: Name of starting points vector map: are the origins (cities)
  • Path settings: choose ONE option that you’d like to record (or none)
  • Optional: check Verbose mode, Name for output drain vector is path_vector

This also took mere seconds to complete (!) and generated the paths from each origin (city) to the closet destination (quarry) over the surface as both raster cells and vector lines. The output in GRASS is shown below.

Least cost path output in GRASS GIS

At this stage, we can hop back into QGIS, and load these output paths into our original project to symbolize and study what’s going on. Notice the settlements in northeastern Italy and along the Dalmatian coast; for many of them the least cost path is to a quarry across the sea rather than through rugged mountainous terrain. Even though some quarries in the mountains may be closer in actual distance, it’s a tougher path to travel.

Conclusion

The benefit of using GRASS is that we can run these processes fairly quickly for large datasets. The GRASS commands can also be compiled into a batch script, so you can create a documented and automated process instead of having to drill through multiple menus.

A big downside of the GRASS tools for this analysis is that the resulting vector paths contain no information about the origin or destination points, and only the raster path output carries along values. You might be able to generate this information through some extra steps; using the QGIS field calculator, you can get the coordinates for the start point and end point of each path and add them explicitly to the attribute table. Then take those coordinates, and for the start point of the line select the closest city and get its attributes, and for the end point select the closet quarry and get its attributes. I say “closest” because the vector paths don’t snap perfectly to the start and end points. Modifying the resolution of the human mobility index to make it coarser (fewer cells) might help to resolve this, or converting the origin and destination points to a raster of the same resolution as the index. Alternatively, if you incorporate the GRASS commands into a Python script, you could iterate over the origins in the least cost path analysis and record the origin IDs as you step through.

I haven’t worked all the pieces, but hopefully this will be useful for those of you who are interested in conducting a basic cost surface analysis in open source. The student I was helping was interested in measuring the density of the paths across a grid, so this process worked for him as he didn’t need to associate the paths with origins and destinations. Beyond FOSS GIS, ArcGIS Pro has a full suite of tools for cost surface analysis, and the underlying methods and logic are the same.

Comparing ACS Estimates Over Time: Are They Really Different?

I often get questions about comparing American Community Survey (ACS) estimates from the US Census Bureau over time. This process is more complicated than you’d think, as the ACS wasn’t designed as a time series dataset. The Census Bureau does publish comparative profile tables that compare two period estimates (in data.census.gov), but for a limited number of geographies (states, counties, metro areas).

For me, this question often takes the form of comparing change at the census tract-level for mapping and GIS projects. In this post, we’ll look at the primary considerations for comparing estimates over time, and I will walk through an example with spreadsheet formulas for calculating: change and percent change (estimates and margins of error), coefficients of variation, and tests for statistical difference. We’ll conclude with examples of mapping this data.

Primary considerations

  1. The ACS is published in 1-year and 5-year period estimates. 1-year estimates are only available for areas that have at least 65,000 people, which means if you’re looking at small geographies (census tracts, ZCTAs) or rural areas that have small populations (most counties, county subdivisions, places) you will need to use the 5-year series. When comparing 5-year estimates, you should only compare non-overlapping time periods. For example, you would not compare the 2021 ACS (2017-2021) with the 2020 ACS (2016-2020) as these estimates have four years of sample data in common. In contrast, 2021 and 2016 (2012-2016) could be compared as they do not overlap…
  2. …but, census geography changes over time. All statistical areas (block groups, tracts, ZCTAs, PUMAs, census designated-places, etc.) are updated every ten years with each decennial census. Areas can be re-numbered, aggregated, subdivided, or modified as populations change. This complicates comparisons; 2021 data uses geography created in 2020, while 2016 data uses geography from 2010. The only non-overlapping ACS periods with identical geographic areas would be 2014 (2010-2014) and 2019 (2015-2019). The only other alternative would be to use normalized census data, which involves additional work. While most legal areas (states, counties) can change at any time, they are generally more stable and you can make comparisons over a longer-period with modest adjustments.
  3. All ACS estimates are fuzzy, representing a midpoint within a possible range of values (indicated with a margin of error) at a 90% confidence level. Because of sampling variability, any difference that you see between one time period and the next could be noise and not actual change. If you’re working with small geographies or small population groups, you’ll encounter large margins of error and it will be difficult to measure actual change. In addition, it’s often difficult to detect change in any area that isn’t experiencing either substantive growth or decline.

ACS Formulas

Let’s look at an example where we’ll use formulas to: calculate change over time, measure the reliability of a difference estimate, and determine whether two estimates are significantly different. I downloaded table B25064 Median Gross Rent (dollars) from the 5-year 2014 (2010-2014) and 2019 (2015-2019) ACS for all census tracts in Providence County, RI, and stitched them together into one spreadsheet. In this post I’ve replaced the cell references with an abbreviated label that indicates what should be referenced (i.e. Est1_MOE is the margin of error for the first estimate). You can download a copy of the spreadsheet with these examples.

  1. To calculate the change / difference for an estimate, subtract one from the other.
  2. To calculate the margin of error for this difference, take the square root of the sum of the squares for each estimate’s margin of error (MOE):
=ROUND(SQRT((Est1_MOE^2)+(Est2_MOE^2)),0)
Spreadsheet with ACS formula to compute margin of error for change / difference
  1. To calculate percent change, divide the difference by the earliest estimate (Est1), and multiply by 100.
  2. To calculate the margin of error for the percent change, use the ACS formula for computing a ratio:
=ROUND(((SQRT(Est2_MOE^2+((Est2/Est1)^2*Est1_MOE^2)))/Est1)100,1)

Divide the 2nd estimate by the 1st and square it, multiply that by the square of the 1st estimate’s MOE, add that to the square of the 2nd estimate’s MOE. Take the square root of that result, then divide by the 1st estimate and multiply by 100. Note that this is formula for percent change is different from the one used for calculating a percent total (the latter uses the formula for a proportion; switch the plus symbol under the square root to a minus for percent totals).

Spreadsheet with ACS formula to compute margin of error for percent change / difference
  1. To characterize the overall accuracy of the new difference estimate, calculate its coefficient of variation (CV):
=ROUND(ABS((Est_MOE/1.645)/Est)*100,0)

Divide the MOE for the difference by 1.645, which is the Z-value for a 90% confidence interval. Divide that by the difference itself, and multiply by 100. Since we can have positive or negative change, we take the absolute value of the result.

Spreadsheet with ACS formula to compute coefficient of variation
  1. To convert the CV into the generally recognized reliability categories:
=IF(CV<=12,"high",IF(CV>=35,"low","medium"))

If the CV value is between 0 to 12, then it’s considered to be highly reliable, else if the CV value is greater than or equal to 35 it’s considered to be of low reliability, else it is considered to be of medium reliability (between 13 and 34). Note: this is a conservative range; search around and you’ll find more liberal examples that use 0-15, 16-40, 41+.

  1. To measure whether two estimates are significantly different from each other, use the statistical difference formula:
=ROUND(ABS((Est2-Est1)/(SQRT((Est1_MOE/1.645)^2+(Est2_MOE/1.645)^2))),3)

Divide the MOE for both the 1st and 2nd estimate by 1.645 (Z value for 90% confidence), take the sum of their squares, and then square root. Subtract the 1st estimate from the 2nd, and then divide. Again in this case, since we could have a positive or negative value we take the absolute value.

Spreadsheet with ACS formula to compute significant difference
  1. To create a boolean significant or not value:
=IF(SigDif>1.645,1,0)

If the significant difference value is greater than 1.645, then the two estimates are significantly different from each other (TRUE 1), implying that some actual change occurred. Otherwise, the estimates are not significantly different (FALSE 0), which means any difference is likely the result of variability in the sample, or any true difference is hidden by this variability.

ALWAYS CHECK YOUR WORK! It’s easy to put parentheses in the wrong place or transpose a cell reference. Take one or two examples and plug them into Cornell PAD’s ACS Calculator, or into Fairfax County VA’s ACS Tools (spreadsheets with formulas – bottom of page). The Census Bureau also provides a spreadsheet that lets you test multiple values for significant difference. Caveat: for the Cornell calculator use the ratio option instead of change when testing. For some reason its change formula never matches my results, but the Fairfax spreadsheets do. I’ve also checked my formulas against the Census Bureau’s ACS Handbooks, and they clearly say to use the ratio formula for percent change.

Interpreting Results

Let’s take a look at a few of the records to understand the results. In Census Tract 1.01, median gross rent increased from $958 (+/- 125) in 2014 to $1113 (+/- 73) in 2019, a change of $155 (+/- 145) and a percent change of 16.2% (+/- 17%). The CV for the change estimate was 57, indicating that this estimate has low reliability; the margin of error is almost equal to the estimate, and the change could have been as little as $10 or as great as $300! The rent estimates for 2014 and 2019 are statistically different but not by much (1.761, higher than 1.645). The margins of error for the two estimates do overlap slightly (with $1,083 being the highest possible value in 2014 and $1,040 the lowest possible value in 2019).

Spreadsheet comparing values for different census tracts

In Census Tract 4, rent increased from $863 (+/- 122) to $1003 (+/- 126), a change of $140 (+/- 175) and percent change of 16.2% (+/- 22%). The CV for the change estimate was 76, indicating very low reliability; indeed the MOE exceeds the value of the estimate. With a score of 1.313 the two estimates for 2014 / 2019 are not significantly different from each other, so any difference here is clouded by sample noise.

In Census Tract 9, rent increased from $875 (+/- 56) to $1083 (+/- 62), a change of $208 (+/- 84) or 23.8% (+/- 10.6%). Compared to the previous examples, these MOEs are much lower than the estimates, and the CV value for the difference is 25, indicating medium reliability. With a score of 4.095, these two estimates are significantly different from each other, indicating substantive change in rent in this tract. The highest possible value in 2014 was $931, and the lowest possible value in 2019 was $1021, so there is no overlap in the value ranges over time.

Mapping Significant Difference and CVs

I grabbed the Census Cartographic Boundary File for tracts for Rhode Island in 2019, and selected out just the tracts for Providence County. I made a copy of my worksheet where I saved the data as text and values in a separate sheet (removing the formulas and encoding the actual outputs), and joined this sheet to the shapefile using the AFFGEOID. The City of Providence and surrounding cities and suburban areas appear in the southeast corner of the county.

The map on the left displays simple percent change over time. In the map on the right, I applied a filter to select just tracts where change was significantly different (the non-significant tracts are symbolized with hash marks). In the screenshots, the count of the number of tracts in each class appears in brackets; I used natural breaks, then modified to place all negative values in the same class. Of the 141 tracts, only 49 had statistically different values. The first map is a gross misrepresentation, as change for most of the tracts can’t be distinguished from sampling variability.

Map of difference on left, significant difference on right
Percent Change in Median Gross Rent 2010-14 to 2015-19: Change on Left, Change Where Both Rent Estimates were Significantly Different on Right

A refined version of the map on the right appears below. In this one, I converted the tracts from polygons to points in a new layer, applied a filter to select significantly different tracts, and symbolized the points by their CV category. Of the 49 statistically different tracts, the actual estimate of change was of low reliability for 32 and medium reliability for the rest. So even if the difference is significant, the precision of most of these estimates is poor.

Providence County, Significant Difference in Median Rent Map
Percent Change in Median Gross Rent 2010-14 to 2015-19 with CV Values, for Tracts with Significantly Different Estimates, Providence County RI

Conclusion

Comparing change over time for ACS estimates is complex, time consuming, and yields many dubious results. What can you do? The size of the MOE relative to the estimate tends to decline as you look at either larger or more populous areas, or larger and fewer subcategories (i.e. 4 income brackets instead of 8). You could also look at two period estimates that are further apart, making it more likely that you’ll see changes; say 2005-2009 compared to 2016-2020. But – you’ll have to cope with normalizing the data. Places that are rapidly changing will exhibit more difference than places that aren’t. If you are studying basic demographics (age / sex / race / tenure) and not socio-economic indicators, use the decennial census instead, as that’s a count and not a sample survey. Ultimately, it’s important to address these issues, and be honest. There’s a lot of bad research where people ignore these considerations, and thus make faulty claims.

For more information, visit the Census Bureau’s page on Comparing ACS Data. Chapter 6 of my book Exploring the US Census covers the American Community Survey and has additional examples of these formulas. As luck would have it, it’s freely accessible as a preview chapter from my publisher, SAGE.

Final caveat: dollar values in the ACS are based on the release year of the period estimate, so 2010-2014 rent is in 2014 dollars, and 2015-2019 is in 2019 dollars. When comparing dollar values over time you should adjust for inflation; I skipped that here to keep the examples a bit simpler. Inflation in the 2010s was rather modest compared to the 2020s, but still could push tracts that had small changes in rent to none when accounted for.