analysis

Clipping Rasters and Extracting Values with Geospatial Python

In an earlier post, I described how to summarize and extract raster temperature data using GIS. In this post I’ll demonstrate some alternate methods using spatial Python. I’ll describe some scripts I wrote for batch clipping rasters, overlaying them with point locations, and extracting raster values (mean temperature) at those locations based on attributes of the points (a matching date). I used a number of third party modules, including geopandas (storing vector data in a tabular form), rasterio (working with raster grids), shapely (building vector geometry), matplotlib (plotting), and datetime (working with date data types). Using Anaconda Python, I searched for and added each of these modules via its package handler. I opted for this modular approach instead of using something like ArcPy, because I don’t want the scripts to be wedded to a specific software package. My scripts and sample data are available in GitHub; I’ll add snippets of code to this post for illustration purposes. The repo includes the full batch scripts that I’ll describe here, plus some earlier, shorter, sample scripts that are not batch-based and are useful for basic experimentation.

Overview

I was working with a medical professor who had point observations of where patients lived, which included a date attribute of when they had visited a clinic to receive certain treatment. For the study we needed to know what the mean temperature was on that day, as well as the temperature of each day of the preceding week. We opted to use daily temperature data from the PRISM Climate Group at Oregon State, where you can download a raster of the continental US for a given day that has the mean temperature (degrees Celsius) in one band, at 4km resolution. There are separate files for min and max temperature, as well as precipitation. You can download a year’s worth of data in one go, with one file per date.

Our challenge was that we had thousands of observations than spanned five years, so doing this one by one in GIS wasn’t going to be feasible. A custom script in Python seemed to be the best solution. Each raster temperature file has the date embedded in the file name. If we iterate through the point observations, we could grab its observation date, and using string manipulation grab the raster with the matching date in its file name, and then do the overlay and extraction. We would need to use Python’s datetime module to convert each date to a common format, and use a function to iterate over dates from the previous week.

Prior to doing that, we needed to clip or mask the rasters to the study area, which consists of the three southern New England states (Connecticut, Rhode Island, and Massachusetts). The PRISM rasters cover the lower 48 states, and clipping them to our small study area would speed processing time. I downloaded the latest Census TIGER file for states, and extracted the three SNE states. ArcGIS Pro does have batch clipping tools, but I found they were terribly slow. I opted to write one Python script to do the clipping, and a second to do the overlay and extraction.

Batch Clipping Rasters

I downloaded a sample of PRISM’s raster data that included two full months of daily mean temperature files, from Jan and Feb 2020. At the top of the clipper script, we import all the modules we need, and set our input and output paths. It’s best to use the path.join method from the os module to construct cross platform paths, so we don’t encounter the forward / backward \ slash issues between Mac and Linux versus Windows. Using geopandas I read in the shapefile of the southern New England (SNE) states into a geodataframe.

import os
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio
from rasterio.mask import mask
from shapely.geometry import Polygon
from rasterio.plot import show

#Inputs
clip_file=os.path.join('input_raster','mask','states_southern_ne.shp')
# new file created by script:
box_file=os.path.join('input_raster','mask','states_southern_ne_bbox.shp') 
raster_path=os.path.join('input_raster','to_clip')
out_folder=os.path.join('input_raster','clipped')

clip_area = gpd.read_file(clip_file)

Next, I create a new geodataframe that represents the bounding box for the SNE states. The total_bounds method provides a list of the four coordinates (west, south, east, north) that form a minimum bounding rectangle for the states. Using shapely, I build polygon geometry from those coordinates by assigning them to pairs, beginning with the northwest corner. This data is from the Census Bureau, so the coordinates are in NAD83. Why bother with the bounding box when we can simply mask the raster using the shapefile itself? Since the bounding box is a simple rectangle, the process will go much faster than if we used the shapefile that contains thousands of coordinate pairs.

corners=clip_area.total_bounds
minx=corners[0]
miny=corners[1]
maxx=corners[2]
maxy=corners[3]
areabbox = gpd.GeoDataFrame({'geometry':Polygon([(minx,maxy),
                                                (maxx,maxy),
                                                (maxx,miny),
                                                (minx,miny),
                                                (minx,maxy)])},index=[0],crs="EPSG:4269")

Once we have the bounding box as geometry, we proceed to iterate through the rasters in the folder in a loop, reading in each raster (PRISM files are in the .bil format) using rasterio, and its mask function to clip the raster to the bounding box. The PRISM rasters and the TIGER states both use NAD83, so we didn’t need to do any coordinate reference system (CRS) transformation prior to doing the mask (if they were in different systems, we’d have to convert one to match the other). In creating a new raster, we need to specify metadata for it. We copy the metadata from the original input file to the output file, and update specific attributes for the output file (such as the pixel height and width, and the output CRS). Here’s a mask example and update from the rasterio docs. Once that’s done, we write the new file out as a simple GeoTIFF, using the name of the input raster with the prefix “clipped_”.

idx=0
for rf in os.listdir(raster_path):
    if rf.endswith('.bil'):
        raster_file=os.path.join(raster_path,rf)
        in_raster=rasterio.open(raster_file)
        # Do the clip operation
        out_raster, out_transform = mask(in_raster, areabbox.geometry, filled=False, crop=True)
        # Copy the metadata from the source and update the new clipped layer 
        out_meta=in_raster.meta.copy() 
        out_meta.update({
            "driver":"GTiff",
            "height":out_raster.shape[1], # height starts with shape[1]
            "width":out_raster.shape[2], # width starts with shape[2]
            "transform":out_transform})  
        # Write output to file
        out_file=rf.split('.')[0]+'.tif'
        out_path=os.path.join(out_folder,'clipped_'+out_file)
        with rasterio.open(out_path,'w',**out_meta) as dest:
            dest.write(out_raster)
        idx=idx+1
        if idx % 20 ==0:
            print('Processed',idx,'rasters so far...')
    else:
        pass
    
print('Finished clipping',idx,'raster files to bounding box: \n',corners)

Just to see some evidence that things worked, outside of the loop I take the last raster that was processed, and plot that to the screen. I also export the bounding box out as a shapefile, to verify what it looks like in GIS.

#Show last clipped raster
fig, ax = plt.subplots(figsize=(12,12))
areabbox.plot(ax=ax, facecolor='none', edgecolor='black', lw=1.0)
show(in_raster,ax=ax)

fig, ax = plt.subplots(figsize=(12,12))
show(out_raster,ax=ax)

# Write bbox to shapefile 
areabbox.to_file(box_file)

Clipped raster with bounding box — PRISM US mean daily temperature raster, clipped / masked to bounding box of southern New England

Extract Raster Values by Date at Point Locations

In the second script, we begin with reading in the modules and setting paths. I added an option at the top with a variable called temp_many_days; if it’s set to True, it will take the date range below it and retrieve temperatures for x to y days before the observation date in the point file. If it’s False, it will retrieve just the matching date. I also specify the names of columns in the input point observation shapefile that contain a unique ID number, name, and date. In this case the input data consists of ten sample points and dates that I’ve concocted, labeled alfa through juliett, all located in Rhode Island and stored as a shapefile.

import os,csv,rasterio
import matplotlib.pyplot as plt
import geopandas as gpd
from rasterio.plot import show
from datetime import datetime as dt
from datetime import timedelta
from datetime import date

#Calculate temps over multiple previous days from observation
temp_many_days=True # True or False
date_range=(1,7) # Range of past dates 

#Inputs
point_file=os.path.join('input_points','test_obsv.shp')
raster_dir=os.path.join('input_raster','clipped')
outfolder='output'
if not os.path.exists(outfolder):
    os.makedirs(outfolder)

# Column names in point file that contain: unique ID, name, and date
obnum='OBS_NUM'
obname='OBS_NAME'
obdate='OBS_DATE'

Next, we loop through the folder of clipped raster files, and for each raster (ending in .tif) we grab the file name and extract the date from it. We take that date and store it in Python’s standard date format. The date becomes a key, and the path to the raster its value, which get added to a dictionary called rf_dict. For example, if we split the file name clipped_PRISM_tmean_stable_4kmD2_20200131_bil.tif using the underscores, counting from zero we get the date in the 5th position, 20200131. Converting that to the standard datetime format gives us datetime.date(2020, 1, 31).

rf_dict={} # Create dictionary of dates and raster file names

for rf in os.listdir(raster_dir):
    if rf.endswith('.tif'):
        rfdatestr=rf.split('_')[5]
        rfdate=dt.strptime(rfdatestr,'%Y%m%d').date() #format of dates is 20200131
        rfpath=os.path.join(raster_dir,rf)
        rf_dict[rfdate]=rfpath
    else:
        pass

Then we read the observation point shapefile into a geodataframe, create an empty result_list that will hold each of our extracted values, and construct the header row for the list. If we are grabbing temperatures for multiple days, we generate extra header values to add to that row.

#open point shapefile
point_data = gpd.read_file(point_file)

result_list=[]
result_list.append(['OBS_NUM','OBS_NAME','OBS_DATE','RASTER_ROW','RASTER_COL','RASTER_FILE','TEMP'])

if temp_many_days==True:
    for d in range(date_range[0],date_range[1]):
        tcol='TMINUS_'+str(d)
        result_list[0].append(tcol)
    result_list[0].append('TEMP_RANGE')
    result_list[0].append('AVG_TEMP')
    temp_ftype='multiday_'
else:
    temp_ftype='singleday_'

Now the preliminaries are out of the way, and processing can begin. This post and tutorial helped me to grasp the basics of the process. We loop through the point data in the geodataframe (we indicate point.data.index because these are dataframe records we’re looping through). We get the observation date for the point and store that it the standard Python date format. Then we take that date, compare it to the dictionary, and get the path to the corresponding temperature raster for that date. We open that raster with rasterio, isolate the x and y coordinate from the geometry of the point observation, and retrieve the corresponding row and column for that coordinate pair from the raster. Then we read the value that’s associated with the grid cell at those coordinates. We take some info from the observation points (the number, name, and date) and the raster data we’ve retrieved (the row, column, file name, and temperature from the raster) and add it to a list called record.

#Pull out and format the date, and use date to look up file
for idx in point_data.index:
    obs_date=dt.strptime(point_data[obdate][idx],'%m/%d/%Y').date() #format of dates is 1/31/2020
    obs_raster=rf_dict.get(obs_date)
    if obs_raster == None:
        print('No raster available for observation and date',
              point_data[obnum][idx],point_data[obdate][idx])
    #Open raster for matching date, overlay point coordinates, get cell location and value
    else:
        raster=rasterio.open(obs_raster)
        xcoord=point_data['geometry'][idx].x
        ycoord=point_data['geometry'][idx].y
        row, col = raster.index(xcoord,ycoord)
        tempval=raster.read(1)[row,col]
        rfile=os.path.split(obs_raster)[1]
        record=[point_data[obnum][idx],point_data[obname][idx],
                point_data[obdate][idx],row,col,rfile,tempval]

If we had specified that we wanted a single day (option near the top of the script), we’d skip down to the bottom of the next block, append the record to the main result_list, and continue iterating through the observation points. Else, if we wanted multiple dates, we enter into a sub-loop to get data from a range of previous dates. The datetime timedelta function allows us to do date subtraction; if we subtract 1 from the current date, we get the previous day. We loop through and get rasters and the temperature values for the points from each previous date in the range and append them to an old_temps list; we also build in a safety mechanism in case we don’t have a raster file for a particular date. Once we have all the dates, we do some calculations to get the average temperature and range for that entire period. We do this on a copy of old_temps called all_temps, where we delete null values and add the current observation date. Then we add the average and range to old_temps, and old_temps to our record list for this point observation, and when finished we append the observation record to our main result_list, and proceed to the next observation.

       # Optional block, if pulling past dates
        if temp_many_days==True:
            old_temps=[]
            for d in range(date_range[0],date_range[1]):
                past_date=obs_date-timedelta(days=d) # iterate through days, subtracting
                past_raster=rf_dict.get(past_date)
                if past_raster == None: # if no raster exists for that date
                    old_temps.append(None)
                else:
                    old_raster=rasterio.open(past_raster)
                    # Assumes rasters from previous dates are identical in structure to 1st date
                    past_temp=old_raster.read(1)[row,col]
                    old_temps.append(past_temp)
            # Calculate avg and range, must exclude None values and include obs day
            all_temps=[t for t in old_temps if t is not None]
            all_temps.append(tempval)
            temp_range=max(all_temps)-min(all_temps)
            avg_temp=sum(all_temps)/len(all_temps)
            old_temps.extend([temp_range,avg_temp])
            record.extend(old_temps)
            result_list.append(record)
        else: # if NOT doing many days, just append data for observation day
            result_list.append(record)
    if (idx+1)%200==0:
        print('Processed',idx+1,'records so far...')

Once the loop is complete, we plot the last point and raster to the screen just to check that it looks good, and we write the results out to a CSV.

#Plot the points over the final raster that was processed    
fig, ax = plt.subplots(figsize=(12,12))
point_data.plot(ax=ax, color='black')
show(raster, ax=ax)

today=str(date.today()).replace('-','_')
outfile='temp_observations_'+temp_ftype+today+'.csv'
outpath=os.path.join(outfolder,outfile)

with open(outpath, 'w', newline='') as writefile:
    writer=csv.writer(writefile, quoting=csv.QUOTE_MINIMAL, delimiter=',')
    writer.writerows(result_list)  

print('Done. {} observations in input file, {} records in results'.format(len(point_data),len(result_list)-1))

Output data for script — CSV output from script, temperatures extracted from raster by date for observation points

Results and Wrap-up

Visit the GitHub repo for full copies of the scripts, plus input and output data. In creating test observation points, I purposefully added some locations that had identical coordinates, identical dates, dates that varied by a single day, and dates for which there would be no corresponding raster file in the sample data if we went one week back in time. I looked up single dates for all point observations manually, and a sample of multi-day selections as well, and they matched the output of the script. The scripts ran quickly, and the overall process seemed intuitive to me; resetting the metadata for rasters after masking is the one part that wouldn’t have occurred to me, and took a little bit of time to figure out. This solution worked well for this case, and I would definitely apply geospatial Python to a problem like this again. An alternative would have been to use a spatial database like PostGIS; this would be an attractive option if we were working with a bigger dataset and processing time became an issue. The benefit of using this Python approach is that it’s easier to share the script and replicate the process without having to set up a database.

Observation points on raster in QGIS — Observation points plotted on temperature raster with single-day output temperatures in QGIS

Parsing the Internet Archive’s Twitter Stream Grab with Python

In this post I’ll share a process for getting geo-located tweets from Twitter, using large files of tweets archived by the Internet Archive. These are tweets where the user opted to have their phone or device record the longitude and latitude coordinates for their location, at the time of the tweet. I’ve created some straightforward scripts in Python without any 3rd party modules for processing a daily file of tweets. Given all the turmoil at Twitter in early 2023, most of the tried and true solutions for scraping tweets or using their APIs no longer function. What I’m presenting here is one, simple solution.

Social media data is not my forte, as I specialize in working with official government datasets. When such questions turn up from students, I’ve always turned to the great Web Scraping Toolkit developed by our library’s Center for Digital Scholarship. But the graduate student I was helping last week and I discovered that both the Twint and TAGS tools no longer function due to changes in Twitter’s developer policies. Surely there must be another solution – there are millions of posts on the internet that show how easy it is to grab tweets via R or Python! Alas, we tried several alternatives to no avail. Many of these projects rely on third party modules that are deprecated or dodgy (or both), and even if you can escape from dependency hell and get everything working, the changed policies rendered them moot.

You can register under Twitter’s new API policy and get access to a paltry number of records. But I thought – surely, someone else has scraped tons of tweets for academic research purposes and has archived them somewhere – could we just access those? Indeed, the folks at Harvard have. They have an archive of geolocated tweets in their dataverse repository, and another one for political tweets. They are also affiliated with a much larger project called DocNow with other schools that have different tweet archives. But alas, there are rules to follow, and to comply with Twitter’s license agreement Harvard and these institutions can’t directly share the raw tweets with anyone outside their institutions. You can search and get IDs to the tweets, using their Hydrator application, which you can use in turn to get the actual tweets. But then in small print:

“Twitter’s changes to their API which greatly reduce the amount of read-only access means that the Hydrator is no longer a useful application. The application keys, which functioned for the last 7 years, have been rescinded by Twitter.”

Fortunately, there is the Internet Archive, which has been working to preserve pieces of the internet for posterity for several decades. Their Twitter Stream Grab consists of monthly collections of daily files for the past few years, from 2016 to 2022. This project is no longer active, but there’s a newer one called the Twitter Archiving Project which has data from 2017 to now. I didn’t investigate this latter one, because I wasn’t sure if it provided the actual tweets or just metadata about them, while the older project definitely did. The IA describes the Stream Grab as the “spritzer” version of Twitter grabs (as opposed to a sprinkler or garden hose). Thanks to the internet, it’s easy to find statistics but hard to find reliable ones – this one, credible looking source (the GDELT Project) suggests that there are between 400 and 500 million tweets a day in recent years. The file I downloaded from IA for one day had over 4 million tweets, so that’s about 1% of all tweets.

I went into the November 2022 collection and downloaded the file for Nov 1st. It’s a TAR file that’s about 3 GB. Unzipping it gives you a folder for that data named for the date, with hundreds of gz ZIP files. Unzip those, and you have tons of JSON Line files. These are JSON files where each JSON record has been collapsed into one line.

Python to the rescue. See GitHub for the full scripts – I’ll just add some snippets here for illustration. I wrote two scripts: the first reads in and aggregates all the tweets from the JSONL files, parses them into a Python dictionary, and writes out the geo-located records into regular JSON. The second reads in that file, selects the elements and values that we want into a list format, and writes those out to a CSV. The rationale was to separate importing and parsing from making these selections, as we’re not going to want to repeat the time-consuming first part while we’re tweaking and modifying the second part.

In the sample data I used for 11/01/2022, unzipping the downloaded TAR file gave me a date folder, and in that date folder were hundreds of gz ZIP files. Unzipping those revealed the JSONL files. I wrote the script to look in that date folder, one level below the folder that holds the scripts, and read in anything that ended with .json. Not all of the Internet Archive’s stream’s are structured this way; if your downloads are structured differently, you can simply move all the unzipped json files to one directory below the script to read them. Or, you can modify the script to iterate through sub-directories.

Because the data was stored as JSONL, I wasn’t able to read it in as regular JSON. I read each line as a string that I appended to a list, iterated through that list to convert it into a dictionary, pulled out the records that had geo-located elements, and added those records to a larger dictionary where I used an identifier in the record as a key and the value as a dictionary with all elements and values for a tweet. This gets written out as regular JSON at the end. Reading the data in didn’t take long; parsing the strings into dictionaries was the time consuming part. Originally, I wanted to parse and save all 4 million records, but the process stalled around 750k as I ran out of memory. Since so few records are geo-located, just selecting these circumvented this problem. If you wanted to modify this part to get other kinds of records, you would need to apply some filter, or implement a more efficient process than what I’m using.

json_list=[] # list of lists, each sublist has 1 string element = 1 line

for f in os.listdir(json_dir):
    if f.endswith('.json'):
        json_file=os.path.join(json_dir,f)
        with open(json_file,'r',encoding='utf-8') as jf:
            jfile_list = list(jf) # create list with one element, a line saved as a string 
            json_list.extend(jfile_list)
            print('Processed file',f,'...')

geo_dict={} # dictionary of dicts, each dict has line parsed into keys / values
i=0   
for json_str in json_list:
    result = json.loads(json_str) # convert line / string to dict
    if result.get('geo')!=None: # only take records that were geocoded
        geo_dict[result['id']]=result 
    i=i+1
    if i%100000==0:
        print('Processed',i,'records...')

The second script reads the JSON output from the first, and simply iterates through the dictionary and chooses the elements and values I want and assigns them to variables. Some of these are straightforward, such as grabbing the timestamp and tweet. Others required additional work. The source element provides HTML code with a source link and name, so I split and strip this value to get them separately. The coordinates are stored as a list, so to get longitude and latitude as separate values I indicate the list position. In cases where I’m delving into a sub-dictionary to get a value (like the coordinates), I added if statements to set values to None if they don’t exist in the JSON, otherwise you get an error. Once I finish iterating, I append all these variables to a list, and add this list to the main one that captures every record. I create a matching header row list, and both are written out as a CSV.

with open(input_json) as json_file:
    twit_data = json.load(json_file)

twit_list=[]

# In this block, select just the keys / values to save
for k,v in twit_data.items():
    tweet_id=k
    timestamp=v.get('created_at')
    tweet=v.get('text')
    # Source is in HTML with anchors. Separate the link and source name
    source=v.get('source') # This is in HTML
    source_url=source.split('"')[1] # This gets the url
    source_name=source.strip('</a>').split('>')[-1] # This gets the name
    lang=v.get('lang')
    # Value for long / lat is stored in a list, must specify position
    if v['geo'] !=None:
        longitude=v.get('geo').get('coordinates')[1]
        latitude=v.get('geo').get('coordinates')[0]
    else:
        longitude=None
        latitude=None
...

My code could use improvement – much of this could be abstracted into a function to avoid repetition. We were in a hurry, and I’m also working with folks who need data but aren’t necessarily familiar with Python, so something that’s inefficient but understandable is okay (although I will polish this up in the future).

I provide the output in GitHub, examples of the final CSV appear below. Every language in the world is captured in these tweets, so Windows users need to import the CSV into Excel (Data – From Text/CSV) and choose UTF-8 encoding. Double-clicking the CSV to open it in Excel in Windows will render most of the text as junk, in the default Windows-1252 encoding.

Tweets extracted from Internet Archive with timestamp, tweets, and source information

Geolocated Twitter Data 1 — Tweets extracted from Internet Archive, showing geo-located information

So, is this data actually useful? That’s an open question. Of the 4 million tweets in this file, just over 1,158 were geo-located! I checked and this is not a mistake. The metadata record for the Harvard geolocated tweets mentions that only 1% to 2% of all tweets are geo-located. So of the 400 million daily tweets, only 4 million. And out of our daily 4 million sample from IA, just 1,158 (less than 1%). What we ended up with does give you a sense of variety and global coverage (see map at the top of the post, showing sample of tweets by language Nov 1, 2022). In this sample, the top five countries represented were: US (35%), Japan (17%), Brazil (4%), UK (4%), Mexico and Turkey (tied 3%). For languages, the top five: English (51%), Japanese (17%), Spanish (9%), Portuguese (5%), and Turkish (3%).

In many cases, I think you’d need a larger sample than a single day, assuming you’re interested in just geo-located records. Perhaps 4 million is large enough for certain non-spatial research? Again, not my area of expertise, but you would want to be aware of events that happened on a certain date that would influence what was tweeted. My graduate student wanted to see differences in certain kinds of tweets in the LA metro area versus the rest of the US, but this sample includes less than 20 tweets from LA. To do anything meaningful, she’d have to download and process a whole month of tweets (at least). Even then, there are certain tweeters that show up repeatedly in given areas. In NYC, most of the tweets on this date were from the 511 service, warning people where that day’s potholes were.

Beyond the location of the tweet, there is a lot of information about the user, including their self-reported location. This data is available in all tweets (not just the geo-located ones). But there are a lot problems with this attribute: the user isn’t necessarily tweeting from that location, as it represents their “static” home. This location is not geocoded, and it’s self reported and uncontrolled. In this example, some users dutifully reported their home as ‘Cleveland, OH’ or ‘New York City’. Other folks listed ‘NYC – LA – ATL – MIA’, ‘CIUDAD DE LAS BAJAS PASIONES’, ‘H E L L’, and ‘Earth. For now’.

Even for research that incorporated geo-located tweets from other, larger data sources that were previously accessible, how representative are all those studies when the data represents only 1% of the total tweet volume? I am skeptical. Also consider the information from the good folks at the Pew Research Center, that tells us that only one in five US adults use Twitter, and that the minority of Twitter users generate the vast majority of tweets: “The top 25% of US users by tweet volume produce 97% of all tweets, while the bottom 75% of users produce just 3%” (10 Facts About Americans and Twitter May 5, 2022).

For what’s it worth, if you need access to Twitter data for academic, non-commercial research purposes and the old methods aren’t working, perhaps the Internet Archive’s data and the solution posed here will fit the bill. You can see the geo-located output (JSON and CSV) from this example in the GitHub repo’s output folder. There is also a samples folder, which contains JSON and CSV for about 77k records that include both geo-located and non-geolocated examples. Looking at the examples can help you decide how to modify the scripts, to pull out different elements and values of interest.

Summarizing Raster Data for Areas and Assigning Values to Points

It’s been a busy few months, but I have a few days to catch my breath now that it’s spring break and most people (except me) have gone away! One question that’s come up quite a bit this semester is how to associate raster data with coinciding vector data. I’ll summarize some approaches in this post using ArcGIS Pro and QGIS, to summarize raster values for polygons (zonal statistics) and to assign raster values to points (aka raster sampling).

Zonal Statistics: Summarize Rasters by Area

Imagine that you have quantitative values such as temperature or a vegetation index in a raster grid, and you want to use this data to calculate an average for counties or metro areas. The goal is to have a new attribute column in the vector layer that contains the summarized raster value, perhaps because you want to make thematic maps of that value, or you want to use it in conjunction with other variables to run spatial statistics, or you just want a plain and simple summary for given places.

The term zonal statistics is used to define any operation that calculates statistics on cell values of a raster within an area or zone defined by another dataset, either a raster or a vector. The ArcGIS Pro toolbox has a Zonal Statistics tool where the output is a new raster file with cells that are summarized by the input zones. That’s not desirable for the use case I’m presenting here; the better choice is the Zonal Statistics as Table tool. The output is a table containing the unique identifiers of the raster and vector, the summary stats you’ve generated (average, sum, min, max, etc), and a count of the number of cells used to generate the summary. You can join this resulting table back to the vector file using their common unique identifier in a table join.

In the example below, I’m using counties from the census TIGER files for southern New England as my Input Feature Zone, the AFFGEOID (Census ANSI / FIPS code) to identify the Zone Field, and a temperature grid for January 1, 2020 from PRISM as the Input Value Raster. I’m calculating the mean temperature for the counties on that day.

ArcGIS Zonal Statistics as Table Tool — ArcGIS Pro Zonal Statistics as Table; Temperature Grid and Southern New England Counties

The output table consists of one record for each zone / county, with the count of the cells used to create the average, and the mean temperature (in degrees Celsius). This table can be joined back to the original vector feature (select the county feature in the Contents, right click, Joins and Relates – Join) to thematically map the average temp.

ArcGIS Zonal Statistics Result — ArcGIS Pro Zonal Statistics; Table Output and Join to Show Average Temperature per County

In QGIS, this tool is simply called Zonal Statistics; search for it in the Processing toolbox. The vector with the zones is the Input layer, and the Raster layer is the grid with the values. By default the summary stats are the count, sum, and mean, but you can check the Statistics to calculate box to select others. Unlike ArcGIS, QGIS allows you to write output as a table or a new shapefile / geopackage, which carries along the feature geometry from the Input zones and adds the summaries, allowing you to skip the step of having to do a table join (if you opted to create a table, you could join it to the zones using the Joins tab under the Properties menu for the vector features).

QGIS Zonal Stats — QGIS Zonal Statistics

Extract Raster Values for Point Features

Zonal stats allows you to summarize raster data within a polygon. But what if you had point features, and wanted to assign each point the value of the raster cell that it falls within? In ArcGIS Pro, search the toolbox for the Extract Values to Points tool. You select your input points and raster, and a new point feature that will include the raster values. The default is to take the value for the cell that the point falls within, but there is an Interpolate option that will calculate the value from adjacent cells. The output point feature contains a new column called RASTERVALU. I created some phony point data and used it to generate the output below.

ArcGIS Extract Values to Points — ArcGIS Pro Extract Values to Points (assign raster cell values to points)

In QGIS the name of this tool is Sample raster values, which you can find in the Processing toolbox. Input the points, choose a raster layer, and write the output to a new vector point file. Unlike ArcGIS, there isn’t an option for interpolation from surrounding cells; you simply get the value for the cell that the point falls within. If you needed to interpolate, you can go to the Plugins menu, enable the SAGA plugin, and in the Processing toolbox try the SAGA tool Raster Values to Points instead.

QGIS Sample Raster Values (assign raster cell values to points)

A variation on this theme would be to create and assign an average value around each point at a given distance, such as the average temperature within five miles. One way to achieve this would be to use the buffer tools in either ArcGIS or QGIS to create distinct buffers around each point at the specified distance. The buffer will automatically carry over all the attributes from the point features, including unique identifiers. Then you can run the zonal statistics tools against the buffer polygons and raster to compute the average, and if need be do a table join between the output table and the original point layer using their common identifier.

Wrap-up

In using any of these tools, it’s important to consider the resolution of the raster (i.e. the size of the grid cell):

1. Relative to the size of the zonal areas or number of points, and

2. In relation to the phenomena that you’re studying.

When larger grid cells or zonal areas are used for measurement, any phenomena becomes more generalized, and any variations within these large areas become masked. The temperature grid cells in this example had a resolution of 2.5 miles, which was suitable for creating county summaries. Summarizing data for census tracts at this resolution would be less ideal, as the tracts are much smaller than the cells, with the cell value characterizing a much larger area. This might be okay in the case of temperature, which tends not to vary considerably over a distance of a few miles. In contrast, averaging temperature data for states is not worthwhile, as states vary considerably in size and most are large enough that they contain multiple ecosystems and elevation levels.

The solutions I’ve described here are the desktop GIS solutions. You could also use either spatial SQL in a geodatabase or a spatial extension in a scripting language like Python or R to perform similar operations. In both cases a basic overlay and intersection statement is used, in conjunction with some grouping function for calculating summaries. I’ve been doing a lot more spatial Python work with geopandas these past few months – perhaps a topic for a subsequent post…

Introduction to Stata Tutorial

This month’s post will be brief, but helpful for anyone who wants to learn Stata. I wrote a short tutorial called First Steps with Stata for an introductory social science data course I visited earlier this month. It’s intended for folks who have never used a statistical package or a command-driven interface, and represents initial steps prior to doing any statistical analysis:

Loading and viewing Stata dta files
Describing and summarizing data
Modifying and recoding data
Batch processing with Do files / scripts
Importing data from other formats

I chose the data that I used in my examples to illustrate the difference between census microdata and summary data, using sample data from the Current Population Survey to illustrate the former, and a table from the American Community Survey to represent the latter.

I’m not a statistician by training; I know the basics but rely on Python, databases, Excel, or GIS packages instead of stats packages. I learned a bit of Stata on my own in order to maintain some datasets I’m responsible for hosting, but to prepare more comprehensively to write this tutorial I relied on Using Stata for Quantitative Analysis, which I highly recommend. There’s also an excellent collection of Stata learning modules created by UCLA’s Advance Research Computing Center. Stata’s official user documentation is second to none for clearly introducing and explaining the individual commands and syntax.

In my years working in higher ed, the social science and public policy faculty I’ve met have all sworn by Stata over the alternatives. A study of citations in the health sciences, where stats packages used for the research were referenced in the texts, illustrates that SPSS is employed most often, but that Stata and R have increased in importance / usage over the last twenty years, while SAS has declined. I’ve had some students tell me that Stata commands remind them of R. In searching through the numerous shallow reviews and comparisons on the web, I found this post from a data science company that compares R, Python, SPSS, SAS, and Stata to be comprehensive and even-handed in summarizing the strengths and weaknesses of each. In short, I thought Stata was fairly intuitive, and the ability to batch script commands in Do files and to capture all input / output in logs makes it particularity appealing for creating reproducible research. It’s also more affordable than the other proprietary stats packages and runs on all operating systems.

Example – print the first five records of the active dataset to the screen for specific variables:

list statefip age sex race cinethp in f/5

     +-------------------------------------------+
     | statefip   age      sex    race   cinethp |
     |-------------------------------------------|
  1. |  alabama    76   female   white       yes |
  2. |  alabama    68   female   black       yes |
  3. |  alabama    24     male   black        no |
  4. |  alabama    56   female   black        no |
  5. |  alabama    80   female   white       yes |
     |-------------------------------------------|

Kernel Density and Contours in QGIS: Noisy NYC

In spatial analysis, kernel density estimation (colloquially referred to as a type of “hot spot analysis”) is used to explore the intensity or clustering of point-based events. Crimes, parking tickets, traffic accidents, bird sightings, forest fires, incidents of infections disease, anything that you can plot as a point at a specific period in time can be studied using KDE. Instead of looking at these features as a distribution of discrete points, you generate a raster that represents a continuous surface of values. You can either measure the density of the incidents themselves, or the concentration of a specific attribute that is tied to those incidents (like the dollar amount of parking tickets or the number of injuries in traffic accidents).

In this post I’ll demonstrate how to do a KDE analysis in QGIS, but you can easily implement KDE in other software like ArcGIS Pro or R. Understanding the inputs you have to provide to produce a meaningful result is more important than the specific tool. This YouTube video produced by the SEER Lab at the University of Florida helped me understand what these inputs are. They used the SAGA kernel tool within QGIS, but I’ll discuss the regular QGIS tool and will cover some basic data preparation steps when working with coordinate data. The video illustrates a KDE based on a weight, where there were single points that had a count-based attribute they wanted to interpolate (number of flies in a trap). In this post I’ll cover simple density based on the number of incidents (individual noise complaints), and will conclude by demonstrating how to generate contour lines from the KDE raster.

For a summary of how KDE works, take a look at the entry for “Kernel” in the Encyclopedia of Geographic Information Science (2007) p 247-248. For a fuller treatment, I always recommend Christopher Lloyd’s Spatial Data Analysis: An Introduction to GIS Users (2010) p 93-97 by Oxford Press. There’s also an explanation in the ArcGIS Pro documentation.

Data Preparation

I visited the NYC Open Data page and pulled up the entry for 311 Service Requests. When previewing the data I used the filter option to narrow the records down to a small subset; I chose complaints that were created between June 1st and 30th 2022, where the complaint type began with “Noise”, which gave me about 75,000 records (it’s a noisy town). Then I hit the Export button and chose one of the CSV formats. CSV is a common export option from open data portals; as long as you have columns that contain latitude and longitude coordinates, you will be able to plot the records. The NYC portal allows you to filter up front; other data portals like the ones in Philly and DC package data into sets of CSV files for each year, so if you wanted to apply filters you’d use the GIS or stats package to do that post-download. If shapefiles or geoJSON are provided, that will save you the step of having to plot coordinates from a CSV.

With the CSV, I launched QGIS, went to the Data Source Manager, and selected Delimited Text. Browsed for the file I downloaded, gave the layer a common sense name, and under geometry specified Point coordinates, and confirmed that the X field was my longitude column and the Y field was latitude. Ran the tool, and the points were plotted in the basic WGS 84 longitude / latitude system in degrees, which is the system the coordinates in the data file were in (generally a safe bet for modern coordinate data, but not always the case).

QGIS Add Delimited Text and Plot Coordinates

The next step was to save these plotted points in a file format that stores geometry and allows us to do spatial analysis. In doing that step, I recommend taking two additional ones. First, verify that all of the plotted data have coordinates – if there are any records where lat and long are missing, those records will be carried along into the spatial file but there will be no geometry for them, which will cause problems. I used the Select Features by Expression tool, and in the expression window typed “Latitude” is not null to select all the features that have coordinates.

Second, transform the coordinate reference system (CRS) of the layer to a projected system that uses meters or feet. When we run the kernel tool, it will ask us to specify a radius for defining the density, as well as the size of the pixels for the output raster. Using degrees doesn’t make sense, as it’s hard for us to conceptualize distances in degrees, and they are not a constant unit of measurement. If you’ve googled around and read Stack Exchange posts or watched videos where a person says “You just have to experiment and adjust these numbers until your map looks Ok”, they were working with units in fractions of degrees. This is not smart. Transform the system of your layers!

I selected the layer, right clicked, Export, Save Selected Features As. The default output is a geopackage, which is fine. Otherwise you could select ESRI shapefile, both are vector formats that store geometry. For file name I browse … and save the file in a specific folder. Beside CRS I hit the globe button, and in the CRS Selector window typed NAD83 Long Island in the filter at the top, and at the bottom I selected the NAD83 / New York Long Island (ftUS) EPSG 2263 option system in the list. Every state in the US has one or more state plane zones that you can select for making optimal maps for that area, in feet or meters. Throughout the world, you could choose an appropriate UTM zone that covers your area in meters. For countries or continents, look for an equidistant projection (meters again).

Clicked a series of Oks to create the new file. To reset my map window to match CRS of the new file, I selected that file, right clicked, Layer CRS, Set Project CRS from Layer. Removed my original CSV to avoid confusion, and saved my project.

Kernel Density Estimation

Now our data is ready. Under the Processing menu I opened the toolbox and searched for kernel to find Heatmap (Kernel Density Estimation) under the Interpolation tools. The tool asks for an input point layer, and then a radius. The radius is used to define an area for calculating a local density estimate around each point. We can use a formula to determine an ideal radius; the hopt method seems to be commonly employed for this purpose.

To use the hopt formula, we need to know the standard distance for our layer, which measures the degree to which features are dispersed around the spatial mean or center of the distribution. A nice 3rd party plugin was created for calculating this. I went to the the plugins menu, searched for the Standard Distance plugin, and added it. Searched for it in the Processing toolbox and launched it. I provided my point layer for input, and specified an output file. The other fields are optional (if we were measuring an attribute of the points instead of the density of the points, we could specify the attribute as a weight column). The output layer consists of a circle where the center is the mean center of the distribution, and the circle represents the standard deviation. The attribute table contains one record, with the standard distance attribute of 36,046.18 feet (if no feature was created, the likely problem is you have records in the point file that don’t have geometry – delete them and try again).

Output from the Standard Distance Plugin

Knowing this, I used the hopt formula:

=((2/(3N))^0.25)SD

Where N is the number of features and SD is the standard distance. I used Excel to plug in these values and do the calculation.

((2/(374526))^0.25)36046.18 = 1971.33

Finally, I launched the heatmap kernel tool, specified my noise points as input, and the radius as 1,971 feet. The output raster size does take some experimentation. The larger the pixel size, the coarser or more general the resolution will be. You want to choose something that makes sense based on the size of the area, the number of points, and / or some other contextual information. Just like the radius, the units are based on the map units of your layer. If I type in 100 feet for Pixel X, I see I’ll have a raster with 1,545 rows and 1,565 columns. Change it to 200 feet, and I get 773 by 783. I’ll go with 200 feet (the distance between a “standard” numbered street block in midtown Manhattan). I kept the defaults for the other options.

QGIS Heatmap Kernel Density Estimation Window

The resulting raster was initially displayed in black and white. I opened the properties and symbology menu and changed the render type from Singleband gray to Singleband pseudocolor, and kept the default yellow to red scheme. Voila!

Kernel Density Estimate of NYC Noise Complaints June 2022

In June 2022 there were high clusters of noise complaints in north central Brooklyn, northern Manhattan, and the southwest portion of the Bronx. There’s a giant red hot spot in the north central Bronx that looks like the storm on planet Jupiter. What on earth is going on there? I flipped back to the noise point layer and selected points in that area, and discovered a single address where over 2,700 noise complaints about a loud party were filed on June 18 and 19! There’s also an address on the adjacent block that registered over 900 complaints. And yet the records do not appear to be duplicates, as they have different time stamps and closing dates. A mistake in coding this address, multiple times? A vengeful person spamming the 311 system? Or just one helluva loud party? It’s hard to say, but beware of garbage in, garbage out. Beyond this demo, I would spend more time investigating, would try omitting these complaints as outliers and run the heatmap tool again, and compare this output to different months. It’s also worth experimenting with the color classification scheme, and some different pixel sizes.

Contour Lines

Another interesting way to visualize this data would be to generate contour lines based on the kernel output. I did a search for contour in the processing toolbox, and in the contour tool I provided the kernel noise raster as the input. For intervals between contour lines I tried 20 feet, and changed the attribute name to reflect what the contour represents: COMPLAINT instead of ELEV. Generated the new file, overlaid on top of the kernel, and now you can see how it represents the “elevation” of complaints.

Switch the kernel off, symbolize the contours and add some labels, and throw the OpenStreetMap underneath, and now you can explore New York’s hills and valleys of noise. Or more precisely, the hills and valleys of noise complainers! In looking at these contours, it’s important to remember that they’re generated from the kernel raster’s grid cells and not from the original point layer. The raster is a generalization of the point layer, so it’s possible that if you look within the center of some of the denser circles you may not find, say, 340 or 420 actual point complaints. To generate a more precise set of contours, you would need to decrease the pixel size in the kernel tool (from say 200 feet to 100).

Noise Complaint Contours in Lower Manhattan, Northwest Brooklyn, and Long Island City

It’s interesting what you can create with just one set of points as input. Happy mapping!

Creating STATA Variable Lists in Excel and Do Files With Python

In this post I demonstrate how export a list of variables from a STATA dta file to an Excel spreadsheet, and how to create a STATA do file by using Python to read in a list of variables from a spreadsheet; the do file will generate an extract of attributes and observations from a larger dta file. Gallup Analytics microdata serves as the example.

Gallup Analytics Microdata

Many academic libraries subscribe to an online database called Gallup Analytics, which lets users explore and download summary statistics from a number of on-going polls and surveys conducted by the Gallup Organization, such as the US Daily Tracker poll, World Poll, and SPSS polling series. As part of the package, subscribing institutions also receive microdata files for some of the surveys, in STATA and SPSS formats. These files contain the anonymized, individual responses to the surveys. The microdata is valuable to social science researchers who use the responses to conduct statistical analyses.

Naturally, the microdata is copyrighted and licensed for non-commercial research purposes to members of the university or institution who are covered by the license agreement, and cannot be shared outside the institution. Another stipulation is that the files cannot be shared in their entirety, even for members of the licensed institution; researchers must request individual extracts of variables and observations to answer a specific research question. This poses a challenge for the data librarian, who somehow has to communicate to the researcher what’s available in the files and mediate the request. Option 1 is to share the codebooks (which are also copyrighted and can’t be publicly distributed) with the researcher and haggle back and forth via email to iron out the details of the request. Option 2 is to have a stand-alone computer set up in the library, where a researcher can come and generate their own extract from files stored on a secure, internal network. In both cases, the manual creation of the extract and the researcher’s lack of familiarity with the contents of the data makes for a tedious process.

My solution was to create spreadsheets that list all of the variables in each dataset, and have the researcher check the ones they want. I created a resource guide that advertises and describes the datasets, and provides secure links to the Gallup codebooks and these spreadsheets, which are stored on a Google Drive and are protected via university authentication. The researcher can then fill out a Google form (also linked to from that page), where they describe the nature of the request, select the specific dataset of interest, specify filters on observations (rows), and upload the spreadsheet of requested variables (columns). Then, I can read the spreadsheet variables into Python and generate a STATA do file (STATA scripts stored in plain text format), to create the desired extract which I can share with the researcher.

Create List of STATA Variables in Excel Spreadsheet

First, I created a standard set of STATA do files to output lists of all variables to a spreadsheet for the different data files. An example for the US Daily Tracker poll from pre-2018 is below. I was completely unfamiliar with STATA, but the online docs and forums taught me what I needed to pull this together.

Some commands are the same across all the do files. I use describe and then translate to create a simple text file that saves a summary from the screen that counts rows and columns. Describe gives a description of the data stored in memory, while replace is used to swap out existing variables with a new subset. Then, generate select_vars gives me codebook information about the dataset (select_vars is a variable name I created), which I sort using the name column. The export excel command is followed by the specific summary fields I wish to output; the position of the variable, data type, variable label, and the variable name itself.

* Create variable list for Gallup US Tracker Survey 2008-2017

local y = YEAR in 1

describe,short
summarize YEAR
translate @Results gallup_tracker_`y'_summary.txt, replace

describe, replace
generate select_vars = ""
sort name

export excel position name type varlab select_vars using gallup_tracker_`y'_vars.xlsx, firstrow(variables) replace

The variation for this particular US Daily Tracker dataset is that the files are packaged as one file per year. I load the first file for 2008, and the do file saves the YEAR attribute as a local variable, which allows me to include the year in the summary and excel output file names. I had to run this do file for each subsequent year up to 2017. This is not a big deal as I’ll never have to repeat the process on the old files, as new data will be released in separate, new files. Other datasets imposed different requirements; the GPSS survey is packaged in eleven separate files for different surveys, and the updates are cumulative (each file contains older data plus any updates – Gallup sends us updated files a few times each year). For the GPSS, I prompt the user for input to specify the survey file name, and overwrite the previous Excel file.

With the do file in hand, you open STATA and the data file you want to process, change the working directory from the default user folder to a better location for storing the output, open the do file, and it runs and creates the variable list spreadsheet.

Excel spreadsheet of variables generated from STATA — List of variables in Excel generated from STATA file. Users check the variables they want in an extract in the select_vars column

Create a STATA Do File with Python and Excel

Once a researcher submits their Google form and their selected variable spreadsheet (placing an X in a dedicated column to indicate that they want to include a variable), I run the Python script below. I use the openpyxl module to read the Excel file. I have to modify the paths, spreadsheet file name, and an integer for the particular survey each time I run it. I use the os module to navigate up and down through folders to store outputs in specific places. If the researcher specifies in the Google form that they want to filter observations, for example records for specific states or age ranges, I have to add those manually but I commented out a few examples that I can copy and modify. One caveat is that you must filter using the coded variable and not its label (i.e. if a month value is coded as 2 and its label is February, I must reference the code and not the label). Reading in the requested columns is straightforward; the script identifies cells in the selection column (E) that have an X, then grabs the variable name from the adjacent column.

# -*- coding: utf-8 -*-
"""
Pull selected gallup variables from spreadsheet to create STATA Do File
Frank Donnelly / GIS and Data Librarian / Brown University
"""

import openpyxl as xl, os
from datetime import date

thedate=date.today().strftime("%m%d%Y")
surveys={1:'gallup_covid',2:'gallup_gpss',3:'gallup_tracker',4:'gallup_world'}

rpath=os.path.join('requests','test') # MODIFY BASED ON INPUT
select_file=os.path.join(rpath,'gallup_tracker_2017_vars_TEST.xlsx') #MODIFY BASED ON INPUT
survey_file=surveys[3] #MODIFY BASED ON INPUT

dofile=os.path.join(rpath,'{}_vars_{}.do'.format(survey_file,thedate))
dtafile=os.path.join(os.path.abspath(os.getcwd()),rpath,'{}_extract_{}.dta'.format(survey_file,thedate))


#MODIFY to filter by observations - DO NOT ERASE EXAMPLES - copy, then modify
obsfilter=None
# obsfilter=None
# obsfilter='keep if inlist(STATE_NAME,"CT","MA","ME","NH","RI","VT")'
# obsfilter='keep if inrange(WP1220,18,64)'
# obsfilter='keep if SC7==2 & MONTH > 6'
# obsfilter='keep if FIPS_CODE=="44007" | FIPS_CODE=="25025"'

workbook = xl.load_workbook(select_file)
ws = workbook['Sheet1']

# May need to modify ws col and cell values based on user input
vlist=[]
for cell in ws['E']:
    if cell.value in ('x','X'): 
        vlist.append((ws.cell(row=cell.row, column=2).value))
outfile = open(dofile, "w")
outfile.writelines('keep ')
outfile.writelines(" ".join(vlist)+"\n")
if obsfilter==None:
    pass
else:
    outfile.writelines(obsfilter+"\n")
outfile.writelines('save '+dtafile+"\n")
outfile.close()
print('Created',dofile)

The plain text do file begins with the command keep followed by the columns, and if requested, an additional keep statement to filter by records. The final save command will direct the output to a specific location.

keep CENREG D17A D23 D24 D5 FIPS_CODE HISPANIC INT_DATE MONTH MOTHERLODE_ID PE_WEIGHT RACE SC7 STATE_NAME WP10202 WP10208 WP10209 WP10215 WP10216 WP10229 WP10230 WP1220 WP1223 YEAR ZIPGALLUPREGION ZIPSTATE
save S:\gallup\processing\scripts\reques\test\gallup_tracker_extract_02202022.dta

All that remains is to open the requested data file in STATA, open the do file, and an extract is created. Visit my GitHub for the do files, Python script, and sample output. The original source data and the variable spreadsheets are NOT included due to licensing issues; if you have the original data files you can generate what I’ve described here. Sorry, I can’t share the Gallup data with you (so please don’t ask). You’ll need to contact your own university or institution to determine if you have access.

At These Coordinates

Dispatches from the Geospatial Data World