pandas

Python Spyder IDE

Python Tips for Somewhat Bigger Data

I’m fortunate to be on sabbatical for much of this summer, and am working on a project where I’m measuring the effectiveness of comparing census American Community Survey estimates over time. I’ve written a lot of Python code over the past six weeks, and thought I’d share some general tips for working with bigger datasets.

For my project, I’m looking at 317 variables stored in 25 tables for over 406,000 individual geographic areas; approximately 129.5 million data points. Multiply that by two, as I’m comparing two time periods. While this wouldn’t fall into the realm of what data scientists would consider as ‘big data’, it is big enough that you have to think strategically about how handle it, so you don’t run out of memory or have to wait hours while tasks grind away. While you could take advantage of parallel processing, or find access to a high performance computer, with this amount of data you can stick with a decent laptop, if you take steps to ensure that it doesn’t go kaput.

While the following suggestions may seem obvious to experienced programmers, it should be helpful to novices. I work with a lot of students whose exposure to Python programming is using Google Colab with Pandas. While that’s a fine place to start, the basic approaches you learn in an intro course will fall flat once you start working with datasets that are this big.

  • Don’t use a notebook. Ipython notebooks like Jupyter or Colab are popular, and are great for doing iterative analysis, visualization, and annotation of your work. But they run via web browsers which introduce extra overhead memory-wise. Iterative notebooks are unnecessary if you’re processing lots of data and don’t need to see step by step results. Use a traditional development environment instead (Spyder is my favorite – see the pic in this post’s header).
  • Don’t rely so much on Pandas DataFrames. They offer convenience as you can explicitly reference rows and columns, and reading and writing data to and from files is straightforward. But DataFrames can hog memory, and processing them can be inefficient (depending on what you’re doing). Instead of loading all your data from a file into a frame, and then making a copy of it where you filter out records you don’t need, it’s more efficient to read a file line by line and omit records while reading. Appending records to a DataFrame one at a time is terribly slow. Instead, use Python’s basic CSV module for reading and append records to nested lists. When you reach the point where a DataFrame would be easier for subsequent steps, you can convert the nested list to a frame. The basic Python data structures – lists, dictionaries, and sets – give you a lot of power at less cost. Novices would benefit from learning how to use these structures effectively, rather than relying on DataFrames for everything. Case in point: after loading a csv file with 406,000 records and 49 columns into a Pandas DataFrame, the frame consumed 240 MB of memory. Loading that same file with the csv module into a nested list, the list consumed about 3 MB.

Reads a file, skips the header row, adds a key / value pair to a dictionary for each row using the first and second value (assuming the key value is unique).

import os, csv

keep_ids={}
with open(recskeep_file,'r') as csv_file:
    reader=csv.reader(csv_file,delimiter='\t')
    next(reader)
    for row in reader:        
        keep_ids[row[0]]=row[1]

Or, save all the records as a list in a nested list, while keeping the header row in a separate list.

records=[]
with open(recskeep_file,'r') as csv_file:
    reader=csv.reader(csv_file,delimiter='\t')
    header=next(reader)
    for row in reader:        
        records.append(row)]
  • Delete big variables when you’re done with them. The files I was reading were segmented in twos: one file for estimates, and one for margins of error for those same estimates. I read each into separate, nested lists while filtering for records I wanted. I had to associate each set with a header row, filter by columns, and then join the two together. Arguably that part was easier to do with DataFrames, so at that stage I read both into separate frames, filtered by column, and joined the two. Once I had the joined frame as a distinct copy, I deleted the two individual frames to save memory.
  • Take out the garbage. Python automatically frees up memory when it can, but you can force the issue by emptying deleted objects from memory by calling the garbage collection module. After I deleted the two DataFrames in the previous step, I explicitly called gc.collect() to free up the space.
...
del est_df
del moe_df
gc.collect()
  • Write as you read. There’s no way I could read all my data in and hold it in memory before writing it all out. Instead I had to iterate – in my case the data is segmented by data tables, which were logical collections of variables. After I read and processed one table, I wrote it out as a file, then moved on to the next one. The variable that held the table was overwritten each time by the next table, and never grew in size beyond the table I was actively processing.
  • Take a break. You can use the sleep module to build in brief pauses between big operations. This can give your program time to “catch up”, finishing one task and freeing up some juice before proceeding to the next one.
time.sleep(3)
  • Write several small scripts, not one big one. The process for reading, processing, and writing my files was going to be one of the longer processes that I had to run. It’s also one that I’d likely not have to repeat if all went well. In contrast, there were subsequent analytical tasks that I knew would require a lot of back and forth, and revision. So I wrote several scripts to handle individual parts of the process, to avoid having to repeat a lot of long, unnecessary tasks.
  • Lean on a database for heavy stuff. Relational databases can handle large, structured data more efficiently compared to scripts reading data from text files. I installed PostgreSQL on my laptop to operate as a localhost database server. After I created my filtered, processed CSV files, I wrote a second program that loaded them into the database using Psycopg2, a Python module that interacts with PostgreSQL (this is a good tutorial that demonstrates how it works). SQL statements can be long, but you can use Python to iteratively write the statements for you, by building strings and filling placeholders in with the format method. This gives you two options. Option 1, you execute the SQL statements from within Python. This is what I did when I loaded my processed CSV files; I used Python to iterate and read the files into memory, wrote CREATE TABLE and INSERT statements in the script, and then inserted the data from Python’s data structures into the database. Option 2, is you can use Python to write a SQL transaction statement, save the transaction as a SQL text file, and then load it in the database and run it. I followed this approach later in my process, where I had to iterate through two sets of 25 tables for each year, and perform calculations to create a host of new variables. It was much quicker to do these operations within the database rather than have Python do them, and executing the SQL script as a separate process made it easier for me to debug problems.

Connect to a database, save SQL statement as a string, loops through a list of variables IDs, and for each variable format the string by passing the values in as parameters, execute the statement and fetch the result – fetchone() in this case, but could also fetchmany():

# Database connection parameters
pgdb='acs_timeseries'
pguser='postgres'
pgpswd='password'
pgport='5432'
pghost='localhost'
pgschema='public'

conpg = psycopg2.connect(database=pgdb, user=pguser, password=pgpswd,
                             host=pghost, port=pgport)
curpg=conpg.cursor()

sql_varname="SELECT var_lbl from acs{}_variables_mod WHERE var_id='{}'"
year='2019'

for v in varids:
    # Get labels associated with variables
    qvarname=sql_varname.format(year, v)
    curpg.execute(qvarname)
    vname=curpg.fetchone()[0]
... #do stuff...

curpg.close()
  • When using Psycopg2, don’t use the executemany() function. When performing an INSERT statement, you can have the module executeone() statement at a time, or executemany(). But the latter was excruciatingly slow – in my case it ran overnight before it finished. Instead I found this trick called mogrify, where you convert your INSERT arguments into one enormous string, and pass that to the mogrify() function. This was lightning fast, but because the text string is massive I ran out of memory if my tables were too big. My solution was to split tables in half if the number of columns exceeded a certain number, and pass them in one after the other.
  • Use the database and script for what they do best. Once I finished my processing, I was ready to begin analyzing. I needed to do several different cross-tabulations on the entire dataset, which was segmented into 25 tables. PostgreSQL is able to summarize data quickly, but it would be cumbersome to union all these tables together, and calculating percent totals in SQL for groups of data is a pain. Python with Pandas would be much better at the latter, but there’s no way I could load a giant flat file of my data into Python to use as the basis for all my summaries. So, I figured out the minimal level of grouping that I would need to do, which would still allow me to run summaries on the output for different combinations of groups (i.e. in total and by types of geography, tables, types of variables, and by variables). I used Python to write and execute GROUP BY statements in the database, iterating over each table and appending the result to a nested list, where one record represented a summary count for a variable by table and geography type. This gave me a manageable number of records. Since the GROUP BY operation took some time, I did that in one script to produce output files. Creating different summaries and reports was a more iterative process that required many revisions, but was quick to execute, so I performed those operations in a subsequent script.
SQL GROUP BY Output
Instead of 386 mil records for (406k geographies * 317 variables * 3 categories), about 18k summary counts for 19 groups of geography

Lastly, while writing and perfecting your script, run it against a sample of your data and not the entire dataset! This will save you time and needless frustration. If I have to iterate through hundreds of files, I’ll begin by creating a list that has a couple of file names in it and iterate over those. If I have a giant nested list of records to loop through, I’ll take a slice and just go through the first ten. Once I’m confident that all is well, then I’ll go back and make changes to execute the program on everything.

GeoPandas Python Script in Spyder IDE

Basic Geospatial Python with GeoPandas

Last month I cobbled together bits and pieces of geospatial Python code I’ve written in various scripts into one cohesive example. You can script, automate, and document a lot of GIS operations with Python, and if you use a combination of Pandas, GeoPandas, and Shapely you don’t even need to have desktop GIS software installed (packages like ArcPy and PyQGIS rely on their underlying base software).

I’ve created a GitHub repository that contains sample data, a basic Python script, and a Jupyter Notebook (same code and examples, in two different formats). The script covers these fundamental operations: reading shapefiles into a geodataframe, reading coordinate data into a dataframe and creating geometry, getting coordinate reference system (CRS) information and transforming the CRS of a geodataframe, generating line geometry from groups and sequences of points, measuring length, spatially joining polygons and points to assign the attributes of one to the other, plotting geodataframes to create a basic map, and exporting geodataframes out as shapefiles.

A Pandas dataframe is a Python structure for tabular data that allows you to store and manipulate data in rows and columns. Like a database, Pandas columns are assigned explicit data types (text, integers, decimals, dates, etc). A GeoPandas geodataframe adds a special geometry column for holding and manipulating coordinate data that’s encoded as point, line, or polygon objects (either single or multi). Similar to a spatial database, the geometry column is referenced with standard coordinate reference system definitions, and there are many different spatial functions that you can apply to the geometry. GeoPandas allows you to work with vector GIS datasets; there are wholly different third-party modules for working with rasters (Rasterio for instance – see this post for examples).

First, you’ll likely have to install the packages Pandas, GeoPandas, and Shapely with pip or your distro’s package handler. Then you can import them. The Shapely package is used for building geometry from other geometry. Matplotlib is used for plotting, but isn’t strictly necessary depending on how detailed you want your plots to be (you could simply use Panda’s own plot library).

import os, pandas as pd
import geopandas as gpd
from shapely.geometry import LineString
import matplotlib.pyplot as plt
%matplotlib inline

Reading a shapefile into a geodataframe is a piece of cake with read_file. We use path.join from the os module to build paths that work in any operating system. Reading in a polygon file of Rhode Island counties:

county_file=os.path.join('input','ri_county_bndy.shp')
gdf_cnty=gpd.read_file(county_file)
gdf_cnty.head()
GeoDataframe of County Polygons

If you have coordinate data in a CSV file, there’s a two step process where you load the coordinates as numbers into a dataframe, and then convert the dataframe and coordinates into a geodataframe with actual point geometry. Pandas / GeoPandas makes assumptions about the column types when you read a CSV, but you have the option to explicitly define them. In this example I define the Census Bureau’s IDs as strings to avoid dropping leading zeros (an annoying and perennial problem). The points_from_xy function takes the longitude and latitude (in that order!) and creates the points; you also have to define what system the coordinates are presently in. This sample data came from the US Census Bureau, so they’re in NAD 83 (EPSG 4269) which is what most federal agencies use. For other modern coordinate data, WGS 84 (EPSG 4326) is usually a safe bet. GeoPandas relies on the EPSG / ESRI CRS library, and familiarity with these codes is a must for working with spatial data.

point_file=os.path.join('input','test_points.csv')
df_pnts=pd.read_csv(point_file, index_col='OBS_NUM', delimiter=',',dtype={'GEOID':str})

gdf_pnts = gpd.GeoDataFrame(df_pnts,geometry=gpd.points_from_xy(
df_pnts['INTPTLONG'],df_pnts['INTPTLAT']),crs = 'EPSG:4269')
gdf_pnts

In the output below, you can see the distinction between the coordinates, stored separately in two numeric columns, and point-based geometry in the geometry column. The sample data consists of eleven point locations, ten in Rhode Island and one in Connecticut, labeled alfa through kilo. Each point is assigned to a group labeled a, b, or c.

Geodataframe of Point Features

You can obtain the CRS metadata for a geodataframe with this simple command:

gdf_cnty.crs
CRS Information for GeoDataframe

You can also get the bounding box for the geometry:

gdf_cnty.total_bounds

These commands are helpful for determining whether different geodataframes share the same CRS. If they don’t, you can transform the CRS of one to match the other. The geometry in the frames must share the same CRS if you want to interact with the data. In this example, we transform our points from NAD 83 to the RI State Plane zone that the counties are in with to_crs; the EPSG code is 3438.

gdf_pnts.to_crs(3438,inplace=True)
gdf_pnts.crs

If our points represent a sequence of events, we can do a points to lines operation to create paths. In this example our points are ordered in the correct sequence; if this was not the case, we’d sort the frame on a sequence column first. If there are different events or individuals in the table that have an identifying field, we use this as the group field to create distinct lines. We use lambda to repeat Shapely’s LineString function across the points to build the lines, and then assign them to a new geodataframe. Then we add a column where we compute the length of the lines; this RI CRS uses feet for units, so we divide by 5,280 feet to get miles. The Panda’s loc function grabs all the rows and a subset of the columns to display them on the screen (we could save them to a new geodataframe if we wanted to subset rows or columns).

lines = gdf_pnts.groupby('GROUP')['geometry'].apply(lambda x: LineString(x.tolist()))
gdf_lines = gpd.GeoDataFrame(lines, geometry='geometry',crs = 'EPSG:3438').reset_index()
gdf_lines['length_mi']=(gdf_lines.length)/5280
gdf_lines.loc[:,['GROUP','length_mi']]
New GeoDataFrame with Line Geometry

To assign every point the attributes of the polygon (county) that it intersects with , we do a spatial join with the sjoin function. Here we take all attributes from the points frame, and a select number of columns from the polygon frame; we have to take the geometry from both frames to do the join. In this example we do a left join, keeping all the points on the left regardless of whether they have a matching polygon on the right. There’s one point that falls oustide of RI, so it will be assigned null values on the right. We rename a few of the columns, and use loc again to display a subset of them to the screen.

gdf_pnts_wcnty=gpd.sjoin(gdf_pnts, gdf_cnty[['geoid','namelsad','geometry']],
how='left', predicate='intersects')
gdf_pnts_wcnty.rename(columns={'geoid': 'COUNTY_ID', 'namelsad': 'COUNTY'}, inplace=True)
gdf_pnts_wcnty.loc[:,['OBS_NAME','OBS_DATE','COUNTY']]
GeoDataframe Created from Spatial Join

To see what’s going on, we can generate a basic plot to display the polygons, points, and lines. I used matplotlib to create a figure and axes, and then placed each layer one on top of the other. We could opt to simply use Pandas / GeoPandas internal plotting instead as illustrated in this tutorial, which works for basic plots. If we want more flexibility or need additional functions we can call on matplotlib. In this example the default placement for the tick marks (coordinates in the state plane system) was bad, and the only way I could fix them was by rotating the labels, which required matplotlib.

fig, ax = plt.subplots()
plt.xticks(rotation=315)
gdf_cnty.plot(ax=ax, color='yellow', edgecolor='grey')
gdf_pnts.plot(ax=ax,color='black', markersize=5)
gdf_lines.plot(ax=ax, column="GROUP", legend=True)
Basic Map Plot

Exporting the results out a shapefiles is also pretty straightforward with to_file. Shapefiles come with many limitations, such as a limit on ten characters for column names. You can opt to export to a variety of other vector formats, such a geopackages or geoJSON.

out_points=os.path.join('output','test_points_counties.shp')
out_lines=os.path.join('output','test_lines.shp')

gdf_pnts_wcnty.to_file(out_points)
gdf_lines.to_file(out_lines)

Hopefully this little intro will help get you started with using geospatial Python with GeoPandas. Happy New Year!

Best – Frank

SQL View

SQL Views to Excel and Back with Pandas

I had lists of businesses that I queried from a large table and saved in individual views in SQLite, where each view contained related businesses based on their industrial classification code (NAICS). There were about 8,000 records in total. Another team needed to review these records and verify whether we needed to keep them in the study or not. The simplest approach was to segment the businesses based on activity, grab a subset of the necessary columns from the main table into a SQL view, and export them to individual Google Sheets so that everyone could access and edit the files. When they were finished, I had to re-aggregate the sheets and get them back into the database, to use a filter for records to keep. I wrote two python / pandas scripts for doing this, which I’ll walk through here.

Since I had already written and saved SQL views in the database (see sample image in the post’s header), I wanted to simply access those using pandas, rather than having to write the queries all over again in pandas. My solution is below. At the top I establish variables that specify file names and paths using the os module. I have an Excel file that will serve as my template; it contains one metadata README sheet that will be the same each year. Next, I create a list of the views, plus a list of new columns that I want to add to each sheet that the team will use for verifying the records. Since this is a process I will need to run each year, I provide the year as a variable and insert it into the output files and the view names rather than hard coding it. For example, ‘convenience_stores’ is formatted to ‘v_2023_convenience_stores’ to retrieve the current view from the database.

The work happens in the loop. I iterate through the list of views, and build a query string where I insert the view name. I use pandas.read_sql to execute a SELECT statement, and the result is saved in a dataframe; the result is essentially the result of the view when its executed. Then I iterate through a list of new columns that the reviewers will use, inserting them one by one. They will appear at the front of the worksheet, in the reverse order in which they appear in the list. I use pandas.ExcelWriter and the append mode so I can insert multiple sheets into the workbook template. And that’s it!

import sqlite3, os, pandas as pd

# CHANGE THE YEAR VARIABLE to reflect year we are processing
year='2022' # must be a string - quote!
outfolder='yr{}'.format(year)
vsuffix='v_{}_'.format(year)
outfile='business_lists_{}.xlsx'.format(year)

outpath=os.path.join('business_output',outfolder,outfile)
con = sqlite3.connect('project_db.sqlite') 

# views within the database that contain business lists
views=['convenience_stores','department_stores','drinking_places',
       'food_manufacturing', 'gas_stations', 'grocery_stores',
       'liquor_stores','pharmacies','restaurants',
       'specialty_food_stores', 'variety_stores','wholesale_clubs']

# blank columns to insert in each sheet to hold verification
newcols=['notes','maps_verified','recategorize','remove']

for v in views:
    vname=vsuffix+v # creates the actual name of the view in the db
    query='SELECT * FROM {}'.format(vname)
    df=pd.read_sql(query, con)
    for n in newcols:
        df.insert(0,n,'')
    with pd.ExcelWriter(outpath, mode='a') as writer:  
        df.to_excel(writer, sheet_name=v, index=False)
    print('Wrote',v,'to output')

print('Done')
con.close()

The final step is to upload the Excel workbook into Google Sheets, and then manually apply some formatting. I looked at some options for writing to Google Sheets directly and skipping Excel as an intermediary, but decided that it looked like more trouble than it was worth. You can’t trust that Google isn’t going to suddenly change something without notice, so this intermediary approach seemed safer.

Once the records have been verified, I needed to combine these sheets into one file and get them back into the database again, where I can use the results to filter the original business table and pull the records we want to keep. My solution for this part is below.

First, I download the finished Google spreadsheet as an Excel file, and provide that as input. Again, I set up input and output paths at the top. I use pandas.read_excel to read the sheets into a dictionary, where the key is the name of the sheet and the value is a dataframe that contains everything in that sheet. I loop through the dictionary, skip the metadata README sheet, and create a list of the dataframes where I add the name of the sheet as a dedicated column. Next, I compare the column names and number of columns in the first dataframe / sheet to each of the others to ensure they are the same in terms or order, name, and number. Lastly, I concatenate all the sheets into one and write them out to a CSV file.

import os, csv, pandas as pd

# CHANGE THE YEAR VARIABLE to reflect year we are processing
year='2022' # must be a string - quote!
folder='yr{}'.format(year)
infile='business_lists_{}.xlsx'.format(year)
outfile='checked_biz_{}.csv'.format(year)

inpath=os.path.join(folder,infile)
outpath=os.path.join(folder,outfile)

# Read sheets to dict, key sheet name and value df
# read all vals as strings to preserve ID codes
sheets_dict = pd.read_excel(inpath, sheet_name=None, dtype=str)

all_sheets_dfs = [] # a list of dataframes, one df per worksheet
for name, sheet in sheets_dict.items():
    if name !='README': # don't include the readme sheet
        sheet['biz_category'] = name # add the sheet name to the data
        all_sheets_dfs.append(sheet)

# This block checks number of columns and names of all sheets against the first one
f=all_sheets_dfs[0]
for i,s in enumerate(all_sheets_dfs):
    check_cols = (s.columns == f.columns).all() and s.shape[1] == f.shape[1]
    if check_cols is False:  
        print('Warning: difference in column names or number between first worksheet and number:',i)
    else:
        pass

# Block creates single dataframe of all records and writes to CSV
biz_df = pd.concat(all_sheets_dfs)
biz_df.reset_index(inplace=True, drop=True)
biz_df.to_csv(outpath, index=True, index_label='pid')

print('Done, record count:',len(biz_df))

With that, I can launch the database (using the DB Browser for SQLite), import that CSV to a table, and proceed to join it back to my original table and filter. Alternatively, I could have written the concatenated dataframe directly into the database, but in a pinch this works fine. It’s been a hectic semester and as soon as I get something working I polish it off and move on to the next thing…

Folium Map Zoomed In

Folium Maps with Geopandas and Random Color Schemes

In this post I’ll demonstrate one method for assigning random colors to features on a Folium map, with specific consideration for plotting features from a geopandas geodataframe. I’m picking up where I left off in my previous post, where we plotted routes from a file of origins and destinations using the OpenRouteService. I wrote a regular Python script where I made a simple plot of the points and lines using geopanda’s plot, but in a Jupyter Notebook version I went a step further and used Folium to plot the features on the OpenStreetMap.

Folium is a Python port of Leaflet, a popular javascript library for making basic interactive web maps. I’m not going to cover the basics in this post, so for starters you can view Folium’s documentation: there’s a basic getting started tutorial, and a more detailed user guide. Geopanda’s documentation includes Folium examples that illustrate how you can work with the two packages in tandem.

One important concept to grasp, is that many of the basic examples assume you are working with latitude and longitude coordinates that are either hard-coded as variables, stored in lists, or stored in dedicated columns in a dataframe as attributes. But if you are working with actual geometry stored in a geodataframe, to display features you need to use the folium.GeoJson function instead. For example, this tutorial illustrates how to render points stored in a dedicated geometry column, and not as coordinates stored as numeric attributes in separate columns.

In my example, I have the following geodataframe, where each record is a route with linear geometry, with a sequential integer as the index value:

All I wanted to do was assign random colors to each line to depict them differently on the map – and I could not find a straightforward way to do that! In odd contrast, making a choropleth map (i.e. values shaded by quantity) seems easy. I found a couple posts on stack exchange, here and here, that demonstrated how to create and assign colors by randomly creating hexadecimal color strings. Another post illustrated how to assign specific non-random colors, and this thorough blog post walked through assigning colors to categorical data (where each category gets a distinct color).

I pooled aspects of the last two solutions together to come up with the following:

# Get colors for lines
gdfcount=len(gdf) # number of routes
colors=['red','green','blue','gray','purple','brown']
clist=[] # list of colors, one per route
c=0
for i in range(gdfcount):
    clist.append(colors[c])
    c=c+1
    if c>len(colors)-1:
        c=0 # if we run out of colors, start over
color_series = pd.Series(clist,name='color') # create series in order to...
gdf_c=pd.merge(gdf, color_series, left_index=True,right_index=True) # join to routes on seq index #
gdf_c

I get the number of features in my geodataframe, and create a list with a set number of colors. I iterate through my colors list, generating a new list with one color for each record in my frame. If I cycle through all the colors, I reset a counter and cycle through again from the beginning. When finished, I turn the list into a pandas series, which generates a sequential integer index paired with the color. Then I merge the series with my geodataframe (called gdf) using the sequential index, so the color is now part of each record:

Then I can create the map. When creating a Folium map you need to provide a location on which the map will be centered. I used Geopanda’s total_bounds method to get the bounding box of my features, which returns a list of coordinates: min X, min Y, max X, max Y. I sum the appropriate coordinates for X and Y and divide each by two, which gives me the coordinate pair for the center of the bounding box.

bnds=gdf.total_bounds 
clong=(bnds[0]+bnds[2])/2
clat=(bnds[1]+bnds[3])/2

Next, I generate the map using those coordinates (the name of the map object in the example is “m”). I create a popup menu first, specifying which fields from the geodataframe to display when you click on a line. Then I add the actual geodataframe to the map, applying a style_function parameter to apply the colors and popup to each feature. The lambda business is necessary here, apparently when you’re applying styles that differ for each feature you have to iterate over the features. I guess this is the norm when you’re styling a GeoJSON file (but is at odds with how you would normally operate in a dataframe or GIS environment).

m = folium.Map(location=[clat,clong], tiles="OpenStreetMap")
popup = folium.GeoJsonPopup(
    fields=["ogn_name", "dest_name","distance","travtime"],
    localize=True,
    labels=True)
folium.GeoJson(gdf_c,style_function=lambda x: {'color':x['properties']['color']},popup=popup).add_to(m)

Once the lines are added to the map, we can continue to add additional features. I use folium.GeoJson twice more to plot my origin and destination points. It took me quite a bit of searching around to get the syntax right, so I could change the icon and color. This tutorial helped me identify the possibilities, but again the basic iteration examples don’t work if you’re using GeoJSON or geometry in a geodataframe. If you’re assigning several different colors you have to apply lambda, as in this example in the Folium docs. In my case I just wanted to change the default icon and color and apply them uniformly to all symbols, so I didn’t need to iterate. Once you’ve added everything to the map, you can finally display it!

folium.GeoJson(ogdf,marker=folium.Marker(icon=folium.Icon(icon='home',color='black'))).add_to(m)
folium.GeoJson(dgdf,marker=folium.Marker(icon=folium.Icon(icon='star',color='lightgray'))).add_to(m)
m.fit_bounds(m.get_bounds()) # zoom to bounding box
m

A static screenshot of the result follows, as I can’t embed Folium maps into my WordPress site. Everything works perfectly when using the notebook locally, but GitHub won’t display the map either, even if the Notebook is saved as a trusted source. If you open the notebook in the nbviewer, it renders just fine (scroll to the bottom of the notebook to see the map and interact).

Undoubtedly there are other options for achieving this same result. I saw examples where people had embedded all the styles they needed within a dedicated dataframe column as code and applied them, or wrote a function for applying the styles. I have rarely worked with javascript or coded my own web maps, so I expect there may be aspects of doings things in that world that were ported over to Folium that aren’t intuitive to me. But not being able to readily apply a random color scheme is bizarre, and is a big shortcoming of the module.

I’d certainly benefit from a more formal introduction to Folium / Leaflet, as scattered blog posts and stack exchange solutions can only take you so far. But here’s hoping that this post has added some useful scraps to your knowledge!

Route from SciLi to Libraries on OSM

Plotting Routes with OpenRouteService and Python

I made my first foray into network routing recently, and drafted a python script and notebook that plots routes using the OpenRouteService (ORS) API. ORS is based on underlying data from the OpenStreetMap (OSM), and was created by the Heidelberg Institute for Geoinformation Technology, at Heidelberg University in Germany. They publish several routing APIs that include directions, isochrones, distance matricies, geocoding, and route optimization. You can access them via a basic REST API, but they also have a dedicated Python wrapper and an R package which makes things a bit easier. For non-programmers, there is a plugin for QGIS.

Regardless of which tool you use, you need to register for an API key first. The standard plan is free for small projects; for example you can make 2,000 direction requests per day with a limit of 40 per minute. If you’re affiliated with higher ed, government, or a non-profit and are doing non-commercial research, you can upgrade to a collaborative plan that ups the limits. It’s also possible to install OSR locally on your own server for large jobs.

I opted for Python and used the openrouteservice Python module, in conjunction with other geospatial modules including geopandas and shapely. In my script / notebook I read in two CSV files, one with origins and the other with destinations. At minimum both files must contain a header row, and attributes for unique identifier, place label, longitude, and latitude in the WGS 84 spatial reference system. The script plots a route between each origin and every destination, and outputs three shapefiles that include the origin points, destination points, and routes. Each line in the route file includes the ID and names of each origin and destination, as well as distance and travel time. The script and notebook are identical, except that the script plots the end result (points and lines) using geopanda’s plot function, while the Jupyter Notebook plots the results on a Folium map (Folium is a Python implementation of the popular Leaflet JS langauge).

Visit the GitHub repo to access the scripts; a basic explanation with code snippets follows.

After importing the modules, you define several variables that determine the output, including a general label for naming the output file (routename), and several parameters for the API including the mode of travel (driving, walking, cycling, etc), distance units (meters, kilometers, miles), and route preference (fastest or shortest). Next, you provide the positions or “column” locations of attributes in the origin and destination CSV files for the id, name, longitude, and latitude. Lastly, you specify the location of those input files and the file that contains your API key. The location and names of output files are generated automatically based on the input; all will contain today’s date stamp, and the route file name includes route mode and preference. I always use the os module’s path function to ensure the scripts are cross-platform.

import openrouteservice, os, csv, pandas as pd, geopandas as gpd
from shapely.geometry import shape
from openrouteservice.directions import directions
from openrouteservice import convert
from datetime import date
from time import sleep

# VARIABLES
# general description, used in output file
routename='scili_to_libs'
# transit modes: [“driving-car”, “driving-hgv”, “foot-walking”, “foot-hiking”, “cycling-regular”, “cycling-road”,”cycling-mountain”, “cycling-electric”,]
tmode='driving-car'
# distance units: [“m”, “km”, “mi”]
dunits='mi'
# route preference: [“fastest, “shortest”, “recommended”]
rpref='fastest'

# Column positions in csv files that contain: unique ID, name, longitude, latitude
# Origin file
ogn_id=0
ogn_name=1
ogn_long=2
ogn_lat=3
# Destination file
d_id=0
d_name=1
d_long=2
d_lat=3

# INPUTS and OUTPUTS
today=str(date.today()).replace('-','_')

keyfile='ors_key.txt'
origin_file=os.path.join('input','origins.csv') #CSV must have header row
dest_file=os.path.join('input','destinations.csv') #CSV must have header row
route_file=routename+'_'+tmode+'_'+rpref+'_'+today+'.shp'
out_file=os.path.join('output',route_file)
out_origin=os.path.join('output',os.path.basename(origin_file).split('.')[0]+'_'+today+'.shp')
out_dest=os.path.join('output',os.path.basename(dest_file).split('.')[0]+'_'+today+'.shp')

I define some general functions for reading the origin and destination files into nested lists, and for taking those lists and generating shapefiles out of them (by converting them to geopanda’s geodataframes). We read the origin and destination data in, grab the API key, set up a list to hold the routes, and create a header for the eventual output.

# For reading origin and dest files
def file_reader(infile,outlist):
    with open(infile,'r') as f:
        reader = csv.reader(f)    
        for row in reader:
            rec = [i.strip() for i in row]
            outlist.append(rec)
            
# For converting origins and destinations to geodataframes            
def coords_to_gdf(data_list,long,lat,export):
    """Provide: list of places that includes a header row,
    positions in list that have longitude and latitude, and
    path for output file.
    """
    df = pd.DataFrame(data_list[1:], columns=data_list[0])
    longcol=data_list[0][long]
    latcol=data_list[0][lat]
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df[longcol], df[latcol]), crs='EPSG:4326')
    gdf.to_file(export,index=True)
    print('Wrote shapefile',export,'\n')
    return(gdf)
      
origins=[]
dest=[]
file_reader(origin_file,origins)
file_reader(dest_file,dest)

# Read api key in from file
with open(keyfile) as key:
    api_key=key.read().strip()

route_count=0
route_list=[]
# Column header for route output file:
header=['ogn_id','ogn_name','dest_id','dest_name','distance','travtime','route']

Here are some nested lists from my sample origin and destination CSV files:

[['origin_id', 'name', 'long', 'lat'], ['0', 'SciLi', '-71.4', '41.8269']]
[['dest_id', 'name', 'long', 'lat'],
 ['1', 'Rock', '-71.405089', '41.825725'],
 ['2', 'Hay', '-71.404947', '41.826433'],
 ['3', 'Orwig', '-71.396609', '41.824581'],
 ['4', 'Champlin', '-71.408194', '41.818912']]

Then the API call begins. For every record in the origin list, we iterate through each record in the destination list (in both cases starting at index 1, skipping the header row) and calculate a route. We create a tuple with each pair of origin and destination coordinates (coords), which we supply to the OSR directions API. We pass in the parameters supplied earlier, and specify instructions as False (instructions are the actual turn by turn directions returned as text).

The result is returned as a JSON object, which we can manipulate like a nested Python dictionary. At the first level in the dictionary, we have three keys and values: a bounding box for the route area with a list value, metadata with a dictionary value, and routes with a list value. Dive into route, and the list contains a single dictionary, and inside that dictionary – more dictionaries that contain the values we want!

1st level, dictionary with three keys, the routes key has a single list value
2nd level, the routes list has a single element, another dictionary
3rd level, inside the dictionary in that list element, four keys with route data

The next step is we extract the values that we need from this container by specifying their location. For example, the distance value is inside the first list of routes, inside summary and inside distance. Travel time is in a similar spot, and we take an extra step of dividing by 60 to get minutes instead of seconds. The geometry is trickier as its encoded in some binary line format. We use OSR’s decoding function to turn it into plain text, and shapely to convert it into WKT text; we’ll need WKT in order to get the geometry into a geodataframe, and eventually output as a shapefile. Once we have the bits we need, we string them together as a list for that origin / destination pair, and append this to our route list.

# API CALL
for ogn in origins[1:]:
    for d in dest[1:]:
        try:
            coords=((ogn[ogn_long],ogn[ogn_lat]),(d[d_long],d[d_lat]))
            client = openrouteservice.Client(key=api_key) 
            # Take the returned object, save into nested dicts:
            results = directions(client, coords, 
                                profile=tmode,instructions=False, preference=rpref,units=dunits)
            dist = results['routes'][0]['summary']['distance']
            travtime=results['routes'][0]['summary']['duration']/60 # Get minutes
            encoded_geom = results['routes'][0]['geometry']
            decoded_geom = convert.decode_polyline(encoded_geom) #convert from binary to txt
            wkt_geom=shape(decoded_geom).wkt #convert from json polyline to wkt
            route=[ogn[ogn_id],ogn[ogn_name],d[d_id],d[d_name],dist,travtime,wkt_geom]
            route_list.append(route)
            route_count=route_count+1
            if route_count%40==0: # API limit is 40 requests per minute
                print('Pausing 1 minute, processed',route_count,'records...')
                sleep(60)
        except Exception as e:
            print(str(e))
            
api_key=''
print('Plotted',route_count,'routes...' )

Here is some sample output for the final origin / destination pair, which contains the IDs and labels for the origin and destination, distance in miles, time in minutes, and a string of coordinates that represents the route:

['0', 'SciLi', '4', 'Champlin', 1.229, 3.8699999999999997,
 'LINESTRING (-71.39989 41.82704, -71.39993 41.82724, -71.39959 41.82727, -71.39961 41.82737, -71.39932 41.8274, -71.39926 41.82704, -71.39924000000001 41.82692, -71.39906000000001 41.82564, -71.39901999999999 41.82534, -71.39896 41.82489, -71.39885 41.82403, -71.39870000000001 41.82308, -71.39863 41.82269, -71.39861999999999 41.82265, -71.39858 41.82248, -71.39855 41.82216, -71.39851 41.8218, -71.39843 41.82114, -71.39838 41.82056, -71.39832 41.82, -71.39825999999999 41.8195, -71.39906000000001 41.81945, -71.39941 41.81939, -71.39964999999999 41.81932, -71.39969000000001 41.81931, -71.39978000000001 41.81931, -71.40055 41.81915, -71.40098999999999 41.81903, -71.40115 41.81899, -71.40186 41.81876, -71.40212 41.81866, -71.40243 41.81852, -71.40266 41.81844, -71.40276 41.81838, -71.40452000000001 41.81765, -71.405 41.81749, -71.40551000000001 41.81726, -71.40639 41.81694, -71.40647 41.81688, -71.40664 41.81712, -71.40705 41.81769, -71.40725 41.81796, -71.40748000000001 41.81827, -71.40792 41.81891, -71.40794 41.81895)']

Finally, we can write the output. We convert the nested route list to a pandas dataframe and use the header row for column names, and convert that dataframe to a geodataframe, building the geometry from the WKT string, and write that out. In contrast, the origins and destinations have simple coordinates (not in WKT), and we create XY geometry from those coordinates. Writing the geodataframe out to a shapefile is straightforward, but for debugging purposes it’s helpful to see the result without having to launch GIS. We can use geopandas’s plot function to draw the resulting geometry. I’m using the Spyder IDE, which displays plots in a dedicated window (in my example the coordinate labels for the X axis look strange, as the distances I’m plotting are small).

# Create shapefiles for routes
df = pd.DataFrame(route_list, columns=header)
gdf = gpd.GeoDataFrame(df, geometry=gpd.GeoSeries.from_wkt(df["route"]),crs = 'EPSG:4326')
gdf.drop(['route'],axis=1,inplace=True) # drop the wkt text
gdf.to_file(out_file,index=True)
print('Wrote route shapefile to:',out_file,'\n')

# Create shapefiles for origins and destinations
ogdf=coords_to_gdf(origins,ogn_long,ogn_lat,out_origin)
dgdf=coords_to_gdf(dest,d_long,d_lat,out_dest)

# Plot
base=gdf.plot(column="dest_id", kind='geo',cmap="Set1")
ogdf.plot(ax=base, marker='o',color='black')
dgdf.plot(ax=base, marker='x', color='red');

In a notebook environment we can employ something like Folium more readily, which gives us a basemap and some basic interactivity for zooming around and clicking on features to see attributes. Implementing this was a more complex than I thought it would be, and took me longer to figure out compared to the routing process. I’ll return to those details in a subsequent post…

In my sample data (output rendered below in QGIS) I was plotting fastest driving distance from the Brown University Sciences Library to the other libraries in our system. Compared to Google or Apple Maps the result made sense, although the origin coordinates I used for the SciLi had an impact on the outcome (assumed you left from the loading dock behind the building as opposed to the front door as Google did, which produces different routes in this area of one-way streets). My real application was plotting distances of hundreds of miles across South America, which went well and was useful for generating different outcomes using fastest or shortest route.

Take a look at the full script in GitHub, or if programming is not your thing check out the QGIS plugin instead (activate in the Plugins menu, search for OSR). Remember to get your API key first.

PRISM Temperature Raster and Test Points Jan 15, 2020

Clipping Rasters and Extracting Values with Geospatial Python

In an earlier post, I described how to summarize and extract raster temperature data using GIS. In this post I’ll demonstrate some alternate methods using spatial Python. I’ll describe some scripts I wrote for batch clipping rasters, overlaying them with point locations, and extracting raster values (mean temperature) at those locations based on attributes of the points (a matching date). I used a number of third party modules, including geopandas (storing vector data in a tabular form), rasterio (working with raster grids), shapely (building vector geometry), matplotlib (plotting), and datetime (working with date data types). Using Anaconda Python, I searched for and added each of these modules via its package handler. I opted for this modular approach instead of using something like ArcPy, because I don’t want the scripts to be wedded to a specific software package. My scripts and sample data are available in GitHub; I’ll add snippets of code to this post for illustration purposes. The repo includes the full batch scripts that I’ll describe here, plus some earlier, shorter, sample scripts that are not batch-based and are useful for basic experimentation.

Overview

I was working with a medical professor who had point observations of where patients lived, which included a date attribute of when they had visited a clinic to receive certain treatment. For the study we needed to know what the mean temperature was on that day, as well as the temperature of each day of the preceding week. We opted to use daily temperature data from the PRISM Climate Group at Oregon State, where you can download a raster of the continental US for a given day that has the mean temperature (degrees Celsius) in one band, at 4km resolution. There are separate files for min and max temperature, as well as precipitation. You can download a year’s worth of data in one go, with one file per date.

Our challenge was that we had thousands of observations than spanned five years, so doing this one by one in GIS wasn’t going to be feasible. A custom script in Python seemed to be the best solution. Each raster temperature file has the date embedded in the file name. If we iterate through the point observations, we could grab its observation date, and using string manipulation grab the raster with the matching date in its file name, and then do the overlay and extraction. We would need to use Python’s datetime module to convert each date to a common format, and use a function to iterate over dates from the previous week.

Prior to doing that, we needed to clip or mask the rasters to the study area, which consists of the three southern New England states (Connecticut, Rhode Island, and Massachusetts). The PRISM rasters cover the lower 48 states, and clipping them to our small study area would speed processing time. I downloaded the latest Census TIGER file for states, and extracted the three SNE states. ArcGIS Pro does have batch clipping tools, but I found they were terribly slow. I opted to write one Python script to do the clipping, and a second to do the overlay and extraction.

Batch Clipping Rasters

I downloaded a sample of PRISM’s raster data that included two full months of daily mean temperature files, from Jan and Feb 2020. At the top of the clipper script, we import all the modules we need, and set our input and output paths. It’s best to use the path.join method from the os module to construct cross platform paths, so we don’t encounter the forward / backward \ slash issues between Mac and Linux versus Windows. Using geopandas I read in the shapefile of the southern New England (SNE) states into a geodataframe.

import os
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio
from rasterio.mask import mask
from shapely.geometry import Polygon
from rasterio.plot import show

#Inputs
clip_file=os.path.join('input_raster','mask','states_southern_ne.shp')
# new file created by script:
box_file=os.path.join('input_raster','mask','states_southern_ne_bbox.shp') 
raster_path=os.path.join('input_raster','to_clip')
out_folder=os.path.join('input_raster','clipped')

clip_area = gpd.read_file(clip_file)

Next, I create a new geodataframe that represents the bounding box for the SNE states. The total_bounds method provides a list of the four coordinates (west, south, east, north) that form a minimum bounding rectangle for the states. Using shapely, I build polygon geometry from those coordinates by assigning them to pairs, beginning with the northwest corner. This data is from the Census Bureau, so the coordinates are in NAD83. Why bother with the bounding box when we can simply mask the raster using the shapefile itself? Since the bounding box is a simple rectangle, the process will go much faster than if we used the shapefile that contains thousands of coordinate pairs.

corners=clip_area.total_bounds
minx=corners[0]
miny=corners[1]
maxx=corners[2]
maxy=corners[3]
areabbox = gpd.GeoDataFrame({'geometry':Polygon([(minx,maxy),
                                                (maxx,maxy),
                                                (maxx,miny),
                                                (minx,miny),
                                                (minx,maxy)])},index=[0],crs="EPSG:4269")

Once we have the bounding box as geometry, we proceed to iterate through the rasters in the folder in a loop, reading in each raster (PRISM files are in the .bil format) using rasterio, and its mask function to clip the raster to the bounding box. The PRISM rasters and the TIGER states both use NAD83, so we didn’t need to do any coordinate reference system (CRS) transformation prior to doing the mask (if they were in different systems, we’d have to convert one to match the other). In creating a new raster, we need to specify metadata for it. We copy the metadata from the original input file to the output file, and update specific attributes for the output file (such as the pixel height and width, and the output CRS). Here’s a mask example and update from the rasterio docs. Once that’s done, we write the new file out as a simple GeoTIFF, using the name of the input raster with the prefix “clipped_”.

idx=0
for rf in os.listdir(raster_path):
    if rf.endswith('.bil'):
        raster_file=os.path.join(raster_path,rf)
        in_raster=rasterio.open(raster_file)
        # Do the clip operation
        out_raster, out_transform = mask(in_raster, areabbox.geometry, filled=False, crop=True)
        # Copy the metadata from the source and update the new clipped layer 
        out_meta=in_raster.meta.copy() 
        out_meta.update({
            "driver":"GTiff",
            "height":out_raster.shape[1], # height starts with shape[1]
            "width":out_raster.shape[2], # width starts with shape[2]
            "transform":out_transform})  
        # Write output to file
        out_file=rf.split('.')[0]+'.tif'
        out_path=os.path.join(out_folder,'clipped_'+out_file)
        with rasterio.open(out_path,'w',**out_meta) as dest:
            dest.write(out_raster)
        idx=idx+1
        if idx % 20 ==0:
            print('Processed',idx,'rasters so far...')
    else:
        pass
    
print('Finished clipping',idx,'raster files to bounding box: \n',corners)

Just to see some evidence that things worked, outside of the loop I take the last raster that was processed, and plot that to the screen. I also export the bounding box out as a shapefile, to verify what it looks like in GIS.

#Show last clipped raster
fig, ax = plt.subplots(figsize=(12,12))
areabbox.plot(ax=ax, facecolor='none', edgecolor='black', lw=1.0)
show(in_raster,ax=ax)

fig, ax = plt.subplots(figsize=(12,12))
show(out_raster,ax=ax)

# Write bbox to shapefile 
areabbox.to_file(box_file)
Clipped raster with bounding box
PRISM US mean daily temperature raster, clipped / masked to bounding box of southern New England

Extract Raster Values by Date at Point Locations

In the second script, we begin with reading in the modules and setting paths. I added an option at the top with a variable called temp_many_days; if it’s set to True, it will take the date range below it and retrieve temperatures for x to y days before the observation date in the point file. If it’s False, it will retrieve just the matching date. I also specify the names of columns in the input point observation shapefile that contain a unique ID number, name, and date. In this case the input data consists of ten sample points and dates that I’ve concocted, labeled alfa through juliett, all located in Rhode Island and stored as a shapefile.

import os,csv,rasterio
import matplotlib.pyplot as plt
import geopandas as gpd
from rasterio.plot import show
from datetime import datetime as dt
from datetime import timedelta
from datetime import date

#Calculate temps over multiple previous days from observation
temp_many_days=True # True or False
date_range=(1,7) # Range of past dates 

#Inputs
point_file=os.path.join('input_points','test_obsv.shp')
raster_dir=os.path.join('input_raster','clipped')
outfolder='output'
if not os.path.exists(outfolder):
    os.makedirs(outfolder)

# Column names in point file that contain: unique ID, name, and date
obnum='OBS_NUM'
obname='OBS_NAME'
obdate='OBS_DATE'

Next, we loop through the folder of clipped raster files, and for each raster (ending in .tif) we grab the file name and extract the date from it. We take that date and store it in Python’s standard date format. The date becomes a key, and the path to the raster its value, which get added to a dictionary called rf_dict. For example, if we split the file name clipped_PRISM_tmean_stable_4kmD2_20200131_bil.tif using the underscores, counting from zero we get the date in the 5th position, 20200131. Converting that to the standard datetime format gives us datetime.date(2020, 1, 31).

rf_dict={} # Create dictionary of dates and raster file names

for rf in os.listdir(raster_dir):
    if rf.endswith('.tif'):
        rfdatestr=rf.split('_')[5]
        rfdate=dt.strptime(rfdatestr,'%Y%m%d').date() #format of dates is 20200131
        rfpath=os.path.join(raster_dir,rf)
        rf_dict[rfdate]=rfpath
    else:
        pass

Then we read the observation point shapefile into a geodataframe, create an empty result_list that will hold each of our extracted values, and construct the header row for the list. If we are grabbing temperatures for multiple days, we generate extra header values to add to that row.

#open point shapefile
point_data = gpd.read_file(point_file)

result_list=[]
result_list.append(['OBS_NUM','OBS_NAME','OBS_DATE','RASTER_ROW','RASTER_COL','RASTER_FILE','TEMP'])

if temp_many_days==True:
    for d in range(date_range[0],date_range[1]):
        tcol='TMINUS_'+str(d)
        result_list[0].append(tcol)
    result_list[0].append('TEMP_RANGE')
    result_list[0].append('AVG_TEMP')
    temp_ftype='multiday_'
else:
    temp_ftype='singleday_'

Now the preliminaries are out of the way, and processing can begin. This post and tutorial helped me to grasp the basics of the process. We loop through the point data in the geodataframe (we indicate point.data.index because these are dataframe records we’re looping through). We get the observation date for the point and store that it the standard Python date format. Then we take that date, compare it to the dictionary, and get the path to the corresponding temperature raster for that date. We open that raster with rasterio, isolate the x and y coordinate from the geometry of the point observation, and retrieve the corresponding row and column for that coordinate pair from the raster. Then we read the value that’s associated with the grid cell at those coordinates. We take some info from the observation points (the number, name, and date) and the raster data we’ve retrieved (the row, column, file name, and temperature from the raster) and add it to a list called record.

#Pull out and format the date, and use date to look up file
for idx in point_data.index:
    obs_date=dt.strptime(point_data[obdate][idx],'%m/%d/%Y').date() #format of dates is 1/31/2020
    obs_raster=rf_dict.get(obs_date)
    if obs_raster == None:
        print('No raster available for observation and date',
              point_data[obnum][idx],point_data[obdate][idx])
    #Open raster for matching date, overlay point coordinates, get cell location and value
    else:
        raster=rasterio.open(obs_raster)
        xcoord=point_data['geometry'][idx].x
        ycoord=point_data['geometry'][idx].y
        row, col = raster.index(xcoord,ycoord)
        tempval=raster.read(1)[row,col]
        rfile=os.path.split(obs_raster)[1]
        record=[point_data[obnum][idx],point_data[obname][idx],
                point_data[obdate][idx],row,col,rfile,tempval]

If we had specified that we wanted a single day (option near the top of the script), we’d skip down to the bottom of the next block, append the record to the main result_list, and continue iterating through the observation points. Else, if we wanted multiple dates, we enter into a sub-loop to get data from a range of previous dates. The datetime timedelta function allows us to do date subtraction; if we subtract 1 from the current date, we get the previous day. We loop through and get rasters and the temperature values for the points from each previous date in the range and append them to an old_temps list; we also build in a safety mechanism in case we don’t have a raster file for a particular date. Once we have all the dates, we do some calculations to get the average temperature and range for that entire period. We do this on a copy of old_temps called all_temps, where we delete null values and add the current observation date. Then we add the average and range to old_temps, and old_temps to our record list for this point observation, and when finished we append the observation record to our main result_list, and proceed to the next observation.

       # Optional block, if pulling past dates
        if temp_many_days==True:
            old_temps=[]
            for d in range(date_range[0],date_range[1]):
                past_date=obs_date-timedelta(days=d) # iterate through days, subtracting
                past_raster=rf_dict.get(past_date)
                if past_raster == None: # if no raster exists for that date
                    old_temps.append(None)
                else:
                    old_raster=rasterio.open(past_raster)
                    # Assumes rasters from previous dates are identical in structure to 1st date
                    past_temp=old_raster.read(1)[row,col]
                    old_temps.append(past_temp)
            # Calculate avg and range, must exclude None values and include obs day
            all_temps=[t for t in old_temps if t is not None]
            all_temps.append(tempval)
            temp_range=max(all_temps)-min(all_temps)
            avg_temp=sum(all_temps)/len(all_temps)
            old_temps.extend([temp_range,avg_temp])
            record.extend(old_temps)
            result_list.append(record)
        else: # if NOT doing many days, just append data for observation day
            result_list.append(record)
    if (idx+1)%200==0:
        print('Processed',idx+1,'records so far...')

Once the loop is complete, we plot the last point and raster to the screen just to check that it looks good, and we write the results out to a CSV.

#Plot the points over the final raster that was processed    
fig, ax = plt.subplots(figsize=(12,12))
point_data.plot(ax=ax, color='black')
show(raster, ax=ax)

today=str(date.today()).replace('-','_')
outfile='temp_observations_'+temp_ftype+today+'.csv'
outpath=os.path.join(outfolder,outfile)

with open(outpath, 'w', newline='') as writefile:
    writer=csv.writer(writefile, quoting=csv.QUOTE_MINIMAL, delimiter=',')
    writer.writerows(result_list)  

print('Done. {} observations in input file, {} records in results'.format(len(point_data),len(result_list)-1))
Output data for script
CSV output from script, temperatures extracted from raster by date for observation points

Results and Wrap-up

Visit the GitHub repo for full copies of the scripts, plus input and output data. In creating test observation points, I purposefully added some locations that had identical coordinates, identical dates, dates that varied by a single day, and dates for which there would be no corresponding raster file in the sample data if we went one week back in time. I looked up single dates for all point observations manually, and a sample of multi-day selections as well, and they matched the output of the script. The scripts ran quickly, and the overall process seemed intuitive to me; resetting the metadata for rasters after masking is the one part that wouldn’t have occurred to me, and took a little bit of time to figure out. This solution worked well for this case, and I would definitely apply geospatial Python to a problem like this again. An alternative would have been to use a spatial database like PostGIS; this would be an attractive option if we were working with a bigger dataset and processing time became an issue. The benefit of using this Python approach is that it’s easier to share the script and replicate the process without having to set up a database.

Observation points on raster in QGIS
Observation points plotted on temperature raster with single-day output temperatures in QGIS