climate

Historic County Climate Data for the US

I recently had a question about finding historic climate data in the United States at the county-level. In this post I’ll show you how to access it, and how to parse fixed-width text files in Excel. Weather data is captured and reported by point-based weather stations, and then is often interpolated and modeled over gridded surfaces (rasters). The National Centers for Environmental Information at NOAA have used their models to create zonal statistics for counties, which they publish via the Climate at a Glance County Mapping program (I described what zonal statistics are in an earlier post).

The basic application lets you map the continental US or an individual state (includes AK but not HI). You choose a parameter (Avg / Min / Max temperature, precipitation, cooling / heating days, drought indexes), year (1895 to present), month, and time scale (1 month to 5 years). This creates a map that you can modify to depict that value, or to display ranks or anomalies. You can download the map as an image, or the underlying data as CSV or JSON.

A separate app allows you to create a time series profile for a particular county, with a table, chart, and data that you can download.

These apps are great for the basics, but bulk downloading the underlying data for all counties and years is a bit trickier. You crash land in a file directory and have to choose from an array of zipped files. Fortunately there is good documentation. In that folder, these are the county-level files for precipitation, min temp, max temp, and avg temp:

climdiv-pcpncy-vx.y.z-YYYYMMDD
climdiv-tmaxcy-vx.y.z-YYYYMMDD
climdiv-tmincy-vx.y.z-YYYYMMDD
climdiv-tmpccy-vx.y.z-YYYYMMDD

Where v is for “version”, the xyz is a version number, and the final portion is the date. The archive is updated monthly. The other files in the directory are for climate divisions, states and regions, and data that pertains to the drought indexes. There are also files that have climate normals for each of these areas. If you’re interested in these, you can go up to the parent-level directory and view the relevant documentation.

The county files are fixed-width text files, which means you have to parse them to separate the values. If you treat them as delimited files (using spaces), then all of the fields at the beginning of the file will be lumped together, which is not useful. Spreadsheets and stats packages have tools for importing delimited text, or you could script something in Python or R. Modern versions of Excel will allow you to parse fixed-width data by supplying a list of endpoints for each column; older versions of Excel and other spreadsheets have you “eyeball” the columns and manually insert breaks in an import screen.

If you’re using a modern version of Excel: open a blank workbook and on the Data ribbon click the From Text/CSV button. Browse and select the county text file you’ve downloaded. In the import screen change the Delimiter drop down to Fixed Width.

In the box underneath, begin with zero and type the end points for each position (with the exception of the final endpoint, 95) as a comma separated list. You’ll find these in the README file, but I’ve also tacked on the most salient bits to the end of this post. For your convenience:

0,5,7,11,18,25,32,39,46,53,60,67,74,81,88

If you click on the preview grid, it will parse the columns.

In this example, I’m not parsing the state and county code separately, but am keeping them together to create a single unique identifier. Once everything is parsed, hit the Transform Data button. For column 1, hit the small 123 button, and change the option to Text, and choose Replace data.

This will preserve the leading zero in the state/county code. It’s important to do this, so the codes in this table with match the codes in other county data table or spatial data files that you may wish to join this table to. Do the same for the element code in column 2. The remaining Year and Month columns can be left alone, as they’re already appropriately saved as integers and decimals respectively.

Hit the Close and Load button in the upper left hand corner, and Excel will parse and load the data. It formats the columns and applies a filter option. To get rid of the styling and filter dropdowns, I’d copy the entire table, and do a Paste-Special-Values in a new worksheet. Then replace the generic column labels with these:

CNTYCODE,ELEMENT,YEAR,JAN,FEB,MAR,APR,MAY,JUNE,JULY,AUG,SEPT,OCT,NOV,DEC

Save the file, and now you have something to work with. Each record represents the monthly temperature or precipitation for a particular county for a particular year. To create a unique record ID, you can concatenate the state/county code, element code, and year values. For GIS applications, you would need to pivot the data to a wide form, so that the year becomes a column to give you month-year as a column, and each row represents each county with no repeats. With over 120 years of monthly data, that would give you over 1500 columns – so filter out what you don’t need. The state / county code can be used to join the table to the Census Bureau’s Cartographic Boundary Files, using the CBF’s GEOID field.

When would you use this data? If you’re creating data profiles or are running a statistical analysis and are using counties as your geographic unit, and temperature or precipitation is one variable among many that you need. Or, you’re making a series of county-level maps, and this is one of your variables. This dataset is clearly pretty convenient for doing time series analyses, as compiling data for a times series is usually time consuming. The counties in this dataset represent present day boundaries, so normalizing geography over time isn’t necessary.

When not to use it? Counties vary in size and can encompass a great deal of internal variety in terms of elevation, land use and land cover, and proximity to / presence of water bodies, all of which impact the climate. So the weather in one part of a county could be quite different from another part. To capture these internal differences, it would be better to use gridded data, such as the 4×4 km rasters that PRISM produces for daily, monthly, annual, and normal summaries.

Gridded climate data and zonal stats derived from grids are estimates based on models; if you wanted or needed the actual measurements as they were recorded, you would need to go back and get point-based weather station data, from the Local Climatological Database for instance. There are a limited number of stations, and not one for every county. The closest station to a given place could be used to represent or approximate the weather for that place.

Codebook for county data files (extracted from README):

Element Record
Name Position Element Description

STATE-CODE 1-2 as indicated in State Code Table as described in FILE 1. Range of values is 01-48.

DIVISION-NUMBER 3-5 COUNTY FIPS - Range of values 001-999.

ELEMENT CODE 6-7

01 = Precipitation
02 = Average Temperature
25 = Heating Degree Days
26 = Cooling Degree Days
27 = Maximum Temperature
28 = Minimum Temperature

YEAR 8-11 This is the year of record. Range is 1895 to current year processed.

Monthly Divisional Temperature format (f7.2) Range of values -50.00 to 140.00 degrees Fahrenheit. Decimals retain a position in the 7-character field.  Missing values in the latest year are indicated by -99.99.

Monthly Divisional Precipitation format (f7.2) Range of values 00.00 to 99.99.  Decimal point retains a position in the 7-character field. Missing values in the latest year are indicated by -9.99.

JAN-VALUE 12-18

FEB-VALUE 19-25

MAR-VALUE 26-32

APR-VALUE 33-39

MAY-VALUE 40-46

JUNE-VALUE 47-53

JULY-VALUE 54-60

AUG-VALUE 61-67

SEPT-VALUE 68-74

OCT-VALUE 75-81

NOV-VALUE 82-88

DEC-VALUE 89-95

Clipping Rasters and Extracting Values with Geospatial Python

In an earlier post, I described how to summarize and extract raster temperature data using GIS. In this post I’ll demonstrate some alternate methods using spatial Python. I’ll describe some scripts I wrote for batch clipping rasters, overlaying them with point locations, and extracting raster values (mean temperature) at those locations based on attributes of the points (a matching date). I used a number of third party modules, including geopandas (storing vector data in a tabular form), rasterio (working with raster grids), shapely (building vector geometry), matplotlib (plotting), and datetime (working with date data types). Using Anaconda Python, I searched for and added each of these modules via its package handler. I opted for this modular approach instead of using something like ArcPy, because I don’t want the scripts to be wedded to a specific software package. My scripts and sample data are available in GitHub; I’ll add snippets of code to this post for illustration purposes. The repo includes the full batch scripts that I’ll describe here, plus some earlier, shorter, sample scripts that are not batch-based and are useful for basic experimentation.

Overview

I was working with a medical professor who had point observations of where patients lived, which included a date attribute of when they had visited a clinic to receive certain treatment. For the study we needed to know what the mean temperature was on that day, as well as the temperature of each day of the preceding week. We opted to use daily temperature data from the PRISM Climate Group at Oregon State, where you can download a raster of the continental US for a given day that has the mean temperature (degrees Celsius) in one band, at 4km resolution. There are separate files for min and max temperature, as well as precipitation. You can download a year’s worth of data in one go, with one file per date.

Our challenge was that we had thousands of observations than spanned five years, so doing this one by one in GIS wasn’t going to be feasible. A custom script in Python seemed to be the best solution. Each raster temperature file has the date embedded in the file name. If we iterate through the point observations, we could grab its observation date, and using string manipulation grab the raster with the matching date in its file name, and then do the overlay and extraction. We would need to use Python’s datetime module to convert each date to a common format, and use a function to iterate over dates from the previous week.

Prior to doing that, we needed to clip or mask the rasters to the study area, which consists of the three southern New England states (Connecticut, Rhode Island, and Massachusetts). The PRISM rasters cover the lower 48 states, and clipping them to our small study area would speed processing time. I downloaded the latest Census TIGER file for states, and extracted the three SNE states. ArcGIS Pro does have batch clipping tools, but I found they were terribly slow. I opted to write one Python script to do the clipping, and a second to do the overlay and extraction.

Batch Clipping Rasters

I downloaded a sample of PRISM’s raster data that included two full months of daily mean temperature files, from Jan and Feb 2020. At the top of the clipper script, we import all the modules we need, and set our input and output paths. It’s best to use the path.join method from the os module to construct cross platform paths, so we don’t encounter the forward / backward \ slash issues between Mac and Linux versus Windows. Using geopandas I read in the shapefile of the southern New England (SNE) states into a geodataframe.

import os
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio
from rasterio.mask import mask
from shapely.geometry import Polygon
from rasterio.plot import show

#Inputs
clip_file=os.path.join('input_raster','mask','states_southern_ne.shp')
# new file created by script:
box_file=os.path.join('input_raster','mask','states_southern_ne_bbox.shp') 
raster_path=os.path.join('input_raster','to_clip')
out_folder=os.path.join('input_raster','clipped')

clip_area = gpd.read_file(clip_file)

Next, I create a new geodataframe that represents the bounding box for the SNE states. The total_bounds method provides a list of the four coordinates (west, south, east, north) that form a minimum bounding rectangle for the states. Using shapely, I build polygon geometry from those coordinates by assigning them to pairs, beginning with the northwest corner. This data is from the Census Bureau, so the coordinates are in NAD83. Why bother with the bounding box when we can simply mask the raster using the shapefile itself? Since the bounding box is a simple rectangle, the process will go much faster than if we used the shapefile that contains thousands of coordinate pairs.

corners=clip_area.total_bounds
minx=corners[0]
miny=corners[1]
maxx=corners[2]
maxy=corners[3]
areabbox = gpd.GeoDataFrame({'geometry':Polygon([(minx,maxy),
                                                (maxx,maxy),
                                                (maxx,miny),
                                                (minx,miny),
                                                (minx,maxy)])},index=[0],crs="EPSG:4269")

Once we have the bounding box as geometry, we proceed to iterate through the rasters in the folder in a loop, reading in each raster (PRISM files are in the .bil format) using rasterio, and its mask function to clip the raster to the bounding box. The PRISM rasters and the TIGER states both use NAD83, so we didn’t need to do any coordinate reference system (CRS) transformation prior to doing the mask (if they were in different systems, we’d have to convert one to match the other). In creating a new raster, we need to specify metadata for it. We copy the metadata from the original input file to the output file, and update specific attributes for the output file (such as the pixel height and width, and the output CRS). Here’s a mask example and update from the rasterio docs. Once that’s done, we write the new file out as a simple GeoTIFF, using the name of the input raster with the prefix “clipped_”.

idx=0
for rf in os.listdir(raster_path):
    if rf.endswith('.bil'):
        raster_file=os.path.join(raster_path,rf)
        in_raster=rasterio.open(raster_file)
        # Do the clip operation
        out_raster, out_transform = mask(in_raster, areabbox.geometry, filled=False, crop=True)
        # Copy the metadata from the source and update the new clipped layer 
        out_meta=in_raster.meta.copy() 
        out_meta.update({
            "driver":"GTiff",
            "height":out_raster.shape[1], # height starts with shape[1]
            "width":out_raster.shape[2], # width starts with shape[2]
            "transform":out_transform})  
        # Write output to file
        out_file=rf.split('.')[0]+'.tif'
        out_path=os.path.join(out_folder,'clipped_'+out_file)
        with rasterio.open(out_path,'w',**out_meta) as dest:
            dest.write(out_raster)
        idx=idx+1
        if idx % 20 ==0:
            print('Processed',idx,'rasters so far...')
    else:
        pass
    
print('Finished clipping',idx,'raster files to bounding box: \n',corners)

Just to see some evidence that things worked, outside of the loop I take the last raster that was processed, and plot that to the screen. I also export the bounding box out as a shapefile, to verify what it looks like in GIS.

#Show last clipped raster
fig, ax = plt.subplots(figsize=(12,12))
areabbox.plot(ax=ax, facecolor='none', edgecolor='black', lw=1.0)
show(in_raster,ax=ax)

fig, ax = plt.subplots(figsize=(12,12))
show(out_raster,ax=ax)

# Write bbox to shapefile 
areabbox.to_file(box_file)

Clipped raster with bounding box — PRISM US mean daily temperature raster, clipped / masked to bounding box of southern New England

Extract Raster Values by Date at Point Locations

In the second script, we begin with reading in the modules and setting paths. I added an option at the top with a variable called temp_many_days; if it’s set to True, it will take the date range below it and retrieve temperatures for x to y days before the observation date in the point file. If it’s False, it will retrieve just the matching date. I also specify the names of columns in the input point observation shapefile that contain a unique ID number, name, and date. In this case the input data consists of ten sample points and dates that I’ve concocted, labeled alfa through juliett, all located in Rhode Island and stored as a shapefile.

import os,csv,rasterio
import matplotlib.pyplot as plt
import geopandas as gpd
from rasterio.plot import show
from datetime import datetime as dt
from datetime import timedelta
from datetime import date

#Calculate temps over multiple previous days from observation
temp_many_days=True # True or False
date_range=(1,7) # Range of past dates 

#Inputs
point_file=os.path.join('input_points','test_obsv.shp')
raster_dir=os.path.join('input_raster','clipped')
outfolder='output'
if not os.path.exists(outfolder):
    os.makedirs(outfolder)

# Column names in point file that contain: unique ID, name, and date
obnum='OBS_NUM'
obname='OBS_NAME'
obdate='OBS_DATE'

Next, we loop through the folder of clipped raster files, and for each raster (ending in .tif) we grab the file name and extract the date from it. We take that date and store it in Python’s standard date format. The date becomes a key, and the path to the raster its value, which get added to a dictionary called rf_dict. For example, if we split the file name clipped_PRISM_tmean_stable_4kmD2_20200131_bil.tif using the underscores, counting from zero we get the date in the 5th position, 20200131. Converting that to the standard datetime format gives us datetime.date(2020, 1, 31).

rf_dict={} # Create dictionary of dates and raster file names

for rf in os.listdir(raster_dir):
    if rf.endswith('.tif'):
        rfdatestr=rf.split('_')[5]
        rfdate=dt.strptime(rfdatestr,'%Y%m%d').date() #format of dates is 20200131
        rfpath=os.path.join(raster_dir,rf)
        rf_dict[rfdate]=rfpath
    else:
        pass

Then we read the observation point shapefile into a geodataframe, create an empty result_list that will hold each of our extracted values, and construct the header row for the list. If we are grabbing temperatures for multiple days, we generate extra header values to add to that row.

#open point shapefile
point_data = gpd.read_file(point_file)

result_list=[]
result_list.append(['OBS_NUM','OBS_NAME','OBS_DATE','RASTER_ROW','RASTER_COL','RASTER_FILE','TEMP'])

if temp_many_days==True:
    for d in range(date_range[0],date_range[1]):
        tcol='TMINUS_'+str(d)
        result_list[0].append(tcol)
    result_list[0].append('TEMP_RANGE')
    result_list[0].append('AVG_TEMP')
    temp_ftype='multiday_'
else:
    temp_ftype='singleday_'

Now the preliminaries are out of the way, and processing can begin. This post and tutorial helped me to grasp the basics of the process. We loop through the point data in the geodataframe (we indicate point.data.index because these are dataframe records we’re looping through). We get the observation date for the point and store that it the standard Python date format. Then we take that date, compare it to the dictionary, and get the path to the corresponding temperature raster for that date. We open that raster with rasterio, isolate the x and y coordinate from the geometry of the point observation, and retrieve the corresponding row and column for that coordinate pair from the raster. Then we read the value that’s associated with the grid cell at those coordinates. We take some info from the observation points (the number, name, and date) and the raster data we’ve retrieved (the row, column, file name, and temperature from the raster) and add it to a list called record.

#Pull out and format the date, and use date to look up file
for idx in point_data.index:
    obs_date=dt.strptime(point_data[obdate][idx],'%m/%d/%Y').date() #format of dates is 1/31/2020
    obs_raster=rf_dict.get(obs_date)
    if obs_raster == None:
        print('No raster available for observation and date',
              point_data[obnum][idx],point_data[obdate][idx])
    #Open raster for matching date, overlay point coordinates, get cell location and value
    else:
        raster=rasterio.open(obs_raster)
        xcoord=point_data['geometry'][idx].x
        ycoord=point_data['geometry'][idx].y
        row, col = raster.index(xcoord,ycoord)
        tempval=raster.read(1)[row,col]
        rfile=os.path.split(obs_raster)[1]
        record=[point_data[obnum][idx],point_data[obname][idx],
                point_data[obdate][idx],row,col,rfile,tempval]

If we had specified that we wanted a single day (option near the top of the script), we’d skip down to the bottom of the next block, append the record to the main result_list, and continue iterating through the observation points. Else, if we wanted multiple dates, we enter into a sub-loop to get data from a range of previous dates. The datetime timedelta function allows us to do date subtraction; if we subtract 1 from the current date, we get the previous day. We loop through and get rasters and the temperature values for the points from each previous date in the range and append them to an old_temps list; we also build in a safety mechanism in case we don’t have a raster file for a particular date. Once we have all the dates, we do some calculations to get the average temperature and range for that entire period. We do this on a copy of old_temps called all_temps, where we delete null values and add the current observation date. Then we add the average and range to old_temps, and old_temps to our record list for this point observation, and when finished we append the observation record to our main result_list, and proceed to the next observation.

       # Optional block, if pulling past dates
        if temp_many_days==True:
            old_temps=[]
            for d in range(date_range[0],date_range[1]):
                past_date=obs_date-timedelta(days=d) # iterate through days, subtracting
                past_raster=rf_dict.get(past_date)
                if past_raster == None: # if no raster exists for that date
                    old_temps.append(None)
                else:
                    old_raster=rasterio.open(past_raster)
                    # Assumes rasters from previous dates are identical in structure to 1st date
                    past_temp=old_raster.read(1)[row,col]
                    old_temps.append(past_temp)
            # Calculate avg and range, must exclude None values and include obs day
            all_temps=[t for t in old_temps if t is not None]
            all_temps.append(tempval)
            temp_range=max(all_temps)-min(all_temps)
            avg_temp=sum(all_temps)/len(all_temps)
            old_temps.extend([temp_range,avg_temp])
            record.extend(old_temps)
            result_list.append(record)
        else: # if NOT doing many days, just append data for observation day
            result_list.append(record)
    if (idx+1)%200==0:
        print('Processed',idx+1,'records so far...')

Once the loop is complete, we plot the last point and raster to the screen just to check that it looks good, and we write the results out to a CSV.

#Plot the points over the final raster that was processed    
fig, ax = plt.subplots(figsize=(12,12))
point_data.plot(ax=ax, color='black')
show(raster, ax=ax)

today=str(date.today()).replace('-','_')
outfile='temp_observations_'+temp_ftype+today+'.csv'
outpath=os.path.join(outfolder,outfile)

with open(outpath, 'w', newline='') as writefile:
    writer=csv.writer(writefile, quoting=csv.QUOTE_MINIMAL, delimiter=',')
    writer.writerows(result_list)  

print('Done. {} observations in input file, {} records in results'.format(len(point_data),len(result_list)-1))

Output data for script — CSV output from script, temperatures extracted from raster by date for observation points

Results and Wrap-up

Visit the GitHub repo for full copies of the scripts, plus input and output data. In creating test observation points, I purposefully added some locations that had identical coordinates, identical dates, dates that varied by a single day, and dates for which there would be no corresponding raster file in the sample data if we went one week back in time. I looked up single dates for all point observations manually, and a sample of multi-day selections as well, and they matched the output of the script. The scripts ran quickly, and the overall process seemed intuitive to me; resetting the metadata for rasters after masking is the one part that wouldn’t have occurred to me, and took a little bit of time to figure out. This solution worked well for this case, and I would definitely apply geospatial Python to a problem like this again. An alternative would have been to use a spatial database like PostGIS; this would be an attractive option if we were working with a bigger dataset and processing time became an issue. The benefit of using this Python approach is that it’s easier to share the script and replicate the process without having to set up a database.

Observation points on raster in QGIS — Observation points plotted on temperature raster with single-day output temperatures in QGIS

Summarizing Raster Data for Areas and Assigning Values to Points

It’s been a busy few months, but I have a few days to catch my breath now that it’s spring break and most people (except me) have gone away! One question that’s come up quite a bit this semester is how to associate raster data with coinciding vector data. I’ll summarize some approaches in this post using ArcGIS Pro and QGIS, to summarize raster values for polygons (zonal statistics) and to assign raster values to points (aka raster sampling).

Zonal Statistics: Summarize Rasters by Area

Imagine that you have quantitative values such as temperature or a vegetation index in a raster grid, and you want to use this data to calculate an average for counties or metro areas. The goal is to have a new attribute column in the vector layer that contains the summarized raster value, perhaps because you want to make thematic maps of that value, or you want to use it in conjunction with other variables to run spatial statistics, or you just want a plain and simple summary for given places.

The term zonal statistics is used to define any operation that calculates statistics on cell values of a raster within an area or zone defined by another dataset, either a raster or a vector. The ArcGIS Pro toolbox has a Zonal Statistics tool where the output is a new raster file with cells that are summarized by the input zones. That’s not desirable for the use case I’m presenting here; the better choice is the Zonal Statistics as Table tool. The output is a table containing the unique identifiers of the raster and vector, the summary stats you’ve generated (average, sum, min, max, etc), and a count of the number of cells used to generate the summary. You can join this resulting table back to the vector file using their common unique identifier in a table join.

In the example below, I’m using counties from the census TIGER files for southern New England as my Input Feature Zone, the AFFGEOID (Census ANSI / FIPS code) to identify the Zone Field, and a temperature grid for January 1, 2020 from PRISM as the Input Value Raster. I’m calculating the mean temperature for the counties on that day.

ArcGIS Zonal Statistics as Table Tool — ArcGIS Pro Zonal Statistics as Table; Temperature Grid and Southern New England Counties

The output table consists of one record for each zone / county, with the count of the cells used to create the average, and the mean temperature (in degrees Celsius). This table can be joined back to the original vector feature (select the county feature in the Contents, right click, Joins and Relates – Join) to thematically map the average temp.

ArcGIS Zonal Statistics Result — ArcGIS Pro Zonal Statistics; Table Output and Join to Show Average Temperature per County

In QGIS, this tool is simply called Zonal Statistics; search for it in the Processing toolbox. The vector with the zones is the Input layer, and the Raster layer is the grid with the values. By default the summary stats are the count, sum, and mean, but you can check the Statistics to calculate box to select others. Unlike ArcGIS, QGIS allows you to write output as a table or a new shapefile / geopackage, which carries along the feature geometry from the Input zones and adds the summaries, allowing you to skip the step of having to do a table join (if you opted to create a table, you could join it to the zones using the Joins tab under the Properties menu for the vector features).

QGIS Zonal Stats — QGIS Zonal Statistics

Extract Raster Values for Point Features

Zonal stats allows you to summarize raster data within a polygon. But what if you had point features, and wanted to assign each point the value of the raster cell that it falls within? In ArcGIS Pro, search the toolbox for the Extract Values to Points tool. You select your input points and raster, and a new point feature that will include the raster values. The default is to take the value for the cell that the point falls within, but there is an Interpolate option that will calculate the value from adjacent cells. The output point feature contains a new column called RASTERVALU. I created some phony point data and used it to generate the output below.

ArcGIS Extract Values to Points — ArcGIS Pro Extract Values to Points (assign raster cell values to points)

In QGIS the name of this tool is Sample raster values, which you can find in the Processing toolbox. Input the points, choose a raster layer, and write the output to a new vector point file. Unlike ArcGIS, there isn’t an option for interpolation from surrounding cells; you simply get the value for the cell that the point falls within. If you needed to interpolate, you can go to the Plugins menu, enable the SAGA plugin, and in the Processing toolbox try the SAGA tool Raster Values to Points instead.

QGIS Sample Raster Values (assign raster cell values to points)

A variation on this theme would be to create and assign an average value around each point at a given distance, such as the average temperature within five miles. One way to achieve this would be to use the buffer tools in either ArcGIS or QGIS to create distinct buffers around each point at the specified distance. The buffer will automatically carry over all the attributes from the point features, including unique identifiers. Then you can run the zonal statistics tools against the buffer polygons and raster to compute the average, and if need be do a table join between the output table and the original point layer using their common identifier.

Wrap-up

In using any of these tools, it’s important to consider the resolution of the raster (i.e. the size of the grid cell):

1. Relative to the size of the zonal areas or number of points, and

2. In relation to the phenomena that you’re studying.

When larger grid cells or zonal areas are used for measurement, any phenomena becomes more generalized, and any variations within these large areas become masked. The temperature grid cells in this example had a resolution of 2.5 miles, which was suitable for creating county summaries. Summarizing data for census tracts at this resolution would be less ideal, as the tracts are much smaller than the cells, with the cell value characterizing a much larger area. This might be okay in the case of temperature, which tends not to vary considerably over a distance of a few miles. In contrast, averaging temperature data for states is not worthwhile, as states vary considerably in size and most are large enough that they contain multiple ecosystems and elevation levels.

The solutions I’ve described here are the desktop GIS solutions. You could also use either spatial SQL in a geodatabase or a spatial extension in a scripting language like Python or R to perform similar operations. In both cases a basic overlay and intersection statement is used, in conjunction with some grouping function for calculating summaries. I’ve been doing a lot more spatial Python work with geopandas these past few months – perhaps a topic for a subsequent post…

GIS Data for US Coastal Storms and Floods

Over the course of this academic year I’ve helped many students find GIS data related to coastal storms and flooding in the US. There’s a ton of data available, particularly from NOAA, but there are so many projects and initiatives that it can be tough to find what you’re looking for. So I’ll share a few key resources here.

NOAA’s DigitalCoast is a good place to start; it’s a catalog of federal, state, and US territory projects and websites that provide both spatial and non-spatial datasets related to coastal storms and flooding. You can filter by place and data type; there are even a few global sources. Most of the projects I mention below are cataloged there.

Given the size of many of these datasets, the ArcGIS File Geodatabase is often used for packaging and distribution. Once you’ve downloaded and unzipped one, it looks like a folder with lots of subfolders and files. If you’re an ArcGIS user, use the Catalog pane to browse your file system and add a connection to the database / folder to access its contents. If you’re a QGIS user, use the Data Manager and on the Vector tab change the source type from File to Directory. In the Source Type dropdown you can choose OpenFileGDB, and browse and select the database, which appears as a folder. Once you hit the Add button, you’ll be prompted to choose the features in the DB that you wish to add to the project.

FEMA Flood Hazards and Disasters

The FEMA flood maps are usually the first thing that comes to mind when folks set out to find data on flooding, but good luck finding their GIS data. I’ve searched through their main program site for the National Flood Hazard Layer and followed every link, but can’t for the life of me find the connection to the page that has actual GIS data; there are map viewer tools, scanned paper maps, web mapping services, and everything else under the sun.

If you want FEMA flood data in a GIS format: GO HERE! You have to search by state, county, and jurisdiction, but after searching under Effective products at the bottom choose NFHL-Data State, and you’ll get the database for the whole state (or choose county if you prefer). The data is packaged in an ArcGIS File Geodatabase, and among the many layers there is a flood hazard area layer. Features are categorized into different types of flood zones, open water bodies, areas outside of flood zones, and areas outside flood zones protected by levees. The pic below illustrates 100 and 500 year zones overlaid on the OpenTopoMap.

FEMA Flood Maps. Light blue areas are 500 year zones, dark blue are 100 year — FEMA Flood Hazard Layer, 100 year zones in dark blue, 500 year in light blue

FEMA also has a GIS data feed for current and historical emergencies and disasters, that are available in a variety of formats both spatial and non-spatial. These are county-level layers that indicate where disaster areas were declared and what kind of funding or assistance is / was available.

NOAA Sea Level Rise

The FEMA maps assess both past events and current conditions to model the likelihood of flooding in a 100 or 500 year period for a major storm event. A different way of looking at flooding is to consider sea level rise due to climate change, where the impact of sea level rise is measured in different increments. Instead of the impact of a one-shot event, this illustrates potential long term change. NOAA’s Sea Level Rise (SLR) viewer allows you to easily visualize the impact of sea level rise in 1 foot increments, between 1 and 10 feet. You can download the data by US state or territory for coastal areas. There are separate downloads for sea level rise, rise depth, the confidence intervals for the models, as well as DEMs and flood frequency. The sea level rise data is package in an ArcGIS file geodatabase, with two sets of files (a low estimate and high estimate) in one foot increments. An example of 6 feet in sea level rise is shown below.

NOAA National Hurricane Center

Beyond showing the general impact of flooding or sea level rise, you can also look at the track of individual hurricanes and tropical storms. The National Hurricane Center’s GIS data page provides historical forecasts – the projected path and cone of storms, windspeeds, storm surges, etc. You choose your year, then can choose a storm, and then a particular day. You can use this data to see how the forecasts evolved as the storm moved. When we’re in hurricane season, you can also see what the circumstances are day by day for tracking new storms.

If you want to see what actually happened (as opposed to a forecast), you can dig through the data page and browse the different options. There’s the Tropical Cyclone Report (TCR) which provides “information on each tropical cyclone, including synoptic history, meteorological statistics, casualties and damages, and the post-analysis best track (six-hourly positions and intensities). Tropical cyclones include depressions, storms and hurricanes.” The default page shows you the Atlantic, but you can swap to Eastern or Central Pacific using the link at the top. Storms are listed alphabetically (and thus by date) and your format options are shapefile or KML. There’s a map at the bottom that depicts and labels all the storms for that season. You actually get four shapefiles in a download; a point file that contains a number of measurements, a line file for the storm track, a polygon file for the radius of the storm, and another polygon with the wind swath. The layers for 2021’s Tropical Storm Henri are illustrated below.

NOAA Tropical Cyclone Report Layers — Layers from NOAA”s NHC Tropical Cyclone Report, Tropical Storm Henri 2021

GIS data for the storms begins in 2010 with KMZ files (which you’ll need to convert in ArcGIS or QGIS to make them useful beyond display purposes), and shapefiles appear in 2015. Further back in time are just PDF reports and map scans.

If you really want to go back and time and get all the tracks at once, there’s the HURDAT2 database; one for the Atlantic (1851 to present) and another for the Pacific (1949 to present). It’s a csv file that contains coordinates for the track of every storm, which you can process to create a geospatial file using a points to line tool. Or – you can grab a version where that’s already been created! The International Best Track Archive for Climate Stewardship (IBTrACS) keeps a running CSV and shapefile of all global storms. Scroll down and choose shapefile (CSV is another option). The download page is just a list of files – you can choose points or lines, storms by ocean (East Pacific, North Atlantic, North Indian, South Atlantic, South Indian, South Pacific, West Pacific), or grab everything in lists that are: active, everything (ALL), last 3 years, or since 1980. Below is an example of all storms in the North Atlantic – there are quite a lot (see below)! You get storm speed and direction, wind speed and direction, coordinates, and identifiers associated with the storm as points and lines. A subset of this data for the 2021 season is displayed in the feature image at the top of this post.

IBTrACS Historical Hurricane Tracks — Historical hurricane / storm tracks from 1851 to 2021 in the North Atlantic from IBTrACS

How About the Weather?

There are many places you can go for this and the best source depends on the use case. More often than not, I end up using the Local Climatological Database. Choose a geographic type, then a specific area, and you’ll see all the weather stations in this area. Add them to the cart, and then view the cart once you have all the stations you want. On the next screen choose an output format (CSV or TXT fixed width) and a date range. You submit an order and wait a bit for it to be compiled, and are notified by email when it’s ready for download. Mixed in this CSV are records that are monthly, daily, and hourly, so after downloading you’ll want to extract just the period you’re interested in. Data includes temperature, precipitation, dew point, wind speed and direction, humidity, barometric pressure, and cloud cover.

NOAA Local Climatological Data Map Tool — Map Tool search interface for NOAA Local Climatological Data

Some processing is required to make these files GIS ready. Each record represents an observation at a station at a given point in time, so if you plot these “as is” the likely idea is you’re making an illustrated time series of some sort, as you’ll have tons of observations plotted on a few spots (where the stations are). If this isn’t desirable, then you’ll filter records to create extracts for just a given point in time, maybe separate features for each time period. For monthly summaries you can pivot time to columns, to create a column for each month and indicator. This would be impractical for daily or hourly summaries, unless you’re focusing on a single month for the former or day / week for the latter (otherwise you’ll have a bazillion columns).

Annoyingly, the CSV option doesn’t include any of the station information in the download (like the standard WBAN ID, name, longitude, latitude, and elevation) except for one unique identifier. I know that this information was all included in the past, and am not sure why it was dropped. The TXT version includes the station info, but fixed-width files are a pain to work with. If you are working with a small number of stations, you can pull the station info individually by previewing the station on the download screen (click on the station title or little eye symbol). The five digit WBAN number is included as the last 5 digits of the identifier in the CSV, so you can identify and relate each one. If you don’t want to mess with copying and pasting, you can generate a second extract for all the stations for just a single day and download that in the TXT format, and then parse just the station columns and associate them with your main table.

There are multiple ways that you can create extracts for this data beyond the example I just provided, available from the main data tools page. For a more refined search you can select the summary period (yearly, monthly, daily, hourly) and targeted variables in advance. There are also FTP options for bulk downloads.

One thing that surprises folks who are new to working with this data, is that there aren’t many weather stations. For the LCD, my home state of Delaware only has three, one in each county. The entire City of New York only has three as well, at each of the airports and one in Central Park. If you’re not interested in points and want areas, then you would need to gather a significant number of stations and do interpolation. Or – use data that’s already modeled. I mentioned PRISM at Oregon State in a previous post, as a nice source for national US rasters of temperature and precipitation that you can generate for dailies, monthlies, and normals.

At These Coordinates

Dispatches from the Geospatial Data World