gis

Anything about GIS software

Coordinates Plotted in Rhode Island

Using PyProj to Transform Coordinates

I’ve written a number of spatial Python posts over the past few months; I’ll cap off this series with a short one on using PyProj to convert coordinates from one spatial reference system to another. PyProj is Python’s interface to PROJ, a library of coordinate system functions that power projection handling in many open source GIS and spatial packages.

A few months back I geocoded a large batch addresses against the Rhode Island DOT’s geocoding API, which returns coordinates in the local state plane system in feet. I decided to run the non-matching addresses against the Census Bureau’s Batch Geocoder, which returns coordinates in NAD 83 longitude and latitude. You can upload a CSV file of 10k addresses and get nearly instant results (one of my students recently wrote a tutorial on how to use it). So I split the unmatched records from my original CSV, uploaded it to the Census geocoder, and got matches.

Next, I needed to get the results from both processes into the same spatial reference system back in one unified file. The kludgy way to do this would be to plot each file separately in their respective systems in QGIS or ArcGIS, convert the NAD 83 plot to the state plane system, and merge the two vector files together. I used PyProj instead, to convert the NAD 83 coordinate data in the CSV to state plane, added that data to my main address CSV file, and plotted them all at once in the state plane system.

PyProj’s Transformer function does the job. I pass the EPSG / WKID codes for the input and output systems (4269 for NAD 83 and 3438 for NAD 83 RI State Plane ft-US) to Transformer.from_crs, and specify that I’m working with XY coordinates. I open the CSV file that contains the results from the Census Geocoder and read it in as a nested list, with each record as a sublist. Here are some sample records:

[["42221","1720 Victory Hwy, Glendale, RI, ","Match","Exact","1720 VICTORY HWY, GLENDALE, RI, 02826","-71.63746768099998,41.96913042600005","647200684","L","44","007","013002","1083"],
["44882","129 SHORE RD, Riverside, RI, ","No_Match"]]

Then I iterate through the records; in my example any record with less than 3 variables was a non-match, so I skip those. The Census geocoder returns longitude and latitude in the 5th position, in the same field separated with a comma (notice quotes around the coordinates in the example above, indicating that these are part of the same field so the comma is not used as a delimiter). I split this value on the comma, read the longitude as x1 and latitude as y1. The output of the transformer function returns coordinates x2 and y2 in the new system. I tack these new coordinates on to the existing record. Once the loop is finished, I write the result out as a new CSV; I used the name of the input file and tacked “stateplane” plus today’s date to the end. Here are the results for the same records:

[["42221","1720 Victory Hwy, Glendale, RI, ","Match","Exact","1720 VICTORY HWY, GLENDALE, RI, 02826","-71.63746768099998,41.96913042600005","647200684","L","44","007","013002","1083","290699.10687381076","322797.1874965105"],
["44882","129 SHORE RD, Riverside, RI, ","No_Match"]]

That’s it! I took the resulting CSV and tacked it to end of my primary CSV, which contained the successful matches from the RIDOT geocoder, in such a way that matching fields lined up. I can still identify which results came from what geocoder, as a few of the fields are different.

import csv
from datetime import date
from pyproj import Transformer

reproject = Transformer.from_crs(4269,3438,always_xy=True)

records=[]

addfile='GeocodeResults.csv'
with open(addfile,'r') as infile:
    reader = csv.reader(infile)
    for row in reader:
        records.append(row)

for r in records:
    if len(r)>3:
        x1,y1=r[5].split(',')
        x2,y2=reproject.transform(float(x1),float(y1))
        r.extend([str(x2),str(y2)])

today=str(date.today())        

outfile=addfile.split('.')[0]+'_stateplane_'+today+'.csv'
with open(outfile, 'w', newline='') as writefile:
    writer = csv.writer(writefile, quoting=csv.QUOTE_ALL, delimiter=',')
    writer.writerows(records)

print('Done')
PRISM Temperature Raster and Test Points Jan 15, 2020

Clipping Rasters and Extracting Values with Geospatial Python

In an earlier post, I described how to summarize and extract raster temperature data using GIS. In this post I’ll demonstrate some alternate methods using spatial Python. I’ll describe some scripts I wrote for batch clipping rasters, overlaying them with point locations, and extracting raster values (mean temperature) at those locations based on attributes of the points (a matching date). I used a number of third party modules, including geopandas (storing vector data in a tabular form), rasterio (working with raster grids), shapely (building vector geometry), matplotlib (plotting), and datetime (working with date data types). Using Anaconda Python, I searched for and added each of these modules via its package handler. I opted for this modular approach instead of using something like ArcPy, because I don’t want the scripts to be wedded to a specific software package. My scripts and sample data are available in GitHub; I’ll add snippets of code to this post for illustration purposes. The repo includes the full batch scripts that I’ll describe here, plus some earlier, shorter, sample scripts that are not batch-based and are useful for basic experimentation.

Overview

I was working with a medical professor who had point observations of where patients lived, which included a date attribute of when they had visited a clinic to receive certain treatment. For the study we needed to know what the mean temperature was on that day, as well as the temperature of each day of the preceding week. We opted to use daily temperature data from the PRISM Climate Group at Oregon State, where you can download a raster of the continental US for a given day that has the mean temperature (degrees Celsius) in one band, at 4km resolution. There are separate files for min and max temperature, as well as precipitation. You can download a year’s worth of data in one go, with one file per date.

Our challenge was that we had thousands of observations than spanned five years, so doing this one by one in GIS wasn’t going to be feasible. A custom script in Python seemed to be the best solution. Each raster temperature file has the date embedded in the file name. If we iterate through the point observations, we could grab its observation date, and using string manipulation grab the raster with the matching date in its file name, and then do the overlay and extraction. We would need to use Python’s datetime module to convert each date to a common format, and use a function to iterate over dates from the previous week.

Prior to doing that, we needed to clip or mask the rasters to the study area, which consists of the three southern New England states (Connecticut, Rhode Island, and Massachusetts). The PRISM rasters cover the lower 48 states, and clipping them to our small study area would speed processing time. I downloaded the latest Census TIGER file for states, and extracted the three SNE states. ArcGIS Pro does have batch clipping tools, but I found they were terribly slow. I opted to write one Python script to do the clipping, and a second to do the overlay and extraction.

Batch Clipping Rasters

I downloaded a sample of PRISM’s raster data that included two full months of daily mean temperature files, from Jan and Feb 2020. At the top of the clipper script, we import all the modules we need, and set our input and output paths. It’s best to use the path.join method from the os module to construct cross platform paths, so we don’t encounter the forward / backward \ slash issues between Mac and Linux versus Windows. Using geopandas I read in the shapefile of the southern New England (SNE) states into a geodataframe.

import os
import matplotlib.pyplot as plt
import geopandas as gpd
import rasterio
from rasterio.mask import mask
from shapely.geometry import Polygon
from rasterio.plot import show

#Inputs
clip_file=os.path.join('input_raster','mask','states_southern_ne.shp')
# new file created by script:
box_file=os.path.join('input_raster','mask','states_southern_ne_bbox.shp') 
raster_path=os.path.join('input_raster','to_clip')
out_folder=os.path.join('input_raster','clipped')

clip_area = gpd.read_file(clip_file)

Next, I create a new geodataframe that represents the bounding box for the SNE states. The total_bounds method provides a list of the four coordinates (west, south, east, north) that form a minimum bounding rectangle for the states. Using shapely, I build polygon geometry from those coordinates by assigning them to pairs, beginning with the northwest corner. This data is from the Census Bureau, so the coordinates are in NAD83. Why bother with the bounding box when we can simply mask the raster using the shapefile itself? Since the bounding box is a simple rectangle, the process will go much faster than if we used the shapefile that contains thousands of coordinate pairs.

corners=clip_area.total_bounds
minx=corners[0]
miny=corners[1]
maxx=corners[2]
maxy=corners[3]
areabbox = gpd.GeoDataFrame({'geometry':Polygon([(minx,maxy),
                                                (maxx,maxy),
                                                (maxx,miny),
                                                (minx,miny),
                                                (minx,maxy)])},index=[0],crs="EPSG:4269")

Once we have the bounding box as geometry, we proceed to iterate through the rasters in the folder in a loop, reading in each raster (PRISM files are in the .bil format) using rasterio, and its mask function to clip the raster to the bounding box. The PRISM rasters and the TIGER states both use NAD83, so we didn’t need to do any coordinate reference system (CRS) transformation prior to doing the mask (if they were in different systems, we’d have to convert one to match the other). In creating a new raster, we need to specify metadata for it. We copy the metadata from the original input file to the output file, and update specific attributes for the output file (such as the pixel height and width, and the output CRS). Here’s a mask example and update from the rasterio docs. Once that’s done, we write the new file out as a simple GeoTIFF, using the name of the input raster with the prefix “clipped_”.

idx=0
for rf in os.listdir(raster_path):
    if rf.endswith('.bil'):
        raster_file=os.path.join(raster_path,rf)
        in_raster=rasterio.open(raster_file)
        # Do the clip operation
        out_raster, out_transform = mask(in_raster, areabbox.geometry, filled=False, crop=True)
        # Copy the metadata from the source and update the new clipped layer 
        out_meta=in_raster.meta.copy() 
        out_meta.update({
            "driver":"GTiff",
            "height":out_raster.shape[1], # height starts with shape[1]
            "width":out_raster.shape[2], # width starts with shape[2]
            "transform":out_transform})  
        # Write output to file
        out_file=rf.split('.')[0]+'.tif'
        out_path=os.path.join(out_folder,'clipped_'+out_file)
        with rasterio.open(out_path,'w',**out_meta) as dest:
            dest.write(out_raster)
        idx=idx+1
        if idx % 20 ==0:
            print('Processed',idx,'rasters so far...')
    else:
        pass
    
print('Finished clipping',idx,'raster files to bounding box: \n',corners)

Just to see some evidence that things worked, outside of the loop I take the last raster that was processed, and plot that to the screen. I also export the bounding box out as a shapefile, to verify what it looks like in GIS.

#Show last clipped raster
fig, ax = plt.subplots(figsize=(12,12))
areabbox.plot(ax=ax, facecolor='none', edgecolor='black', lw=1.0)
show(in_raster,ax=ax)

fig, ax = plt.subplots(figsize=(12,12))
show(out_raster,ax=ax)

# Write bbox to shapefile 
areabbox.to_file(box_file)
Clipped raster with bounding box
PRISM US mean daily temperature raster, clipped / masked to bounding box of southern New England

Extract Raster Values by Date at Point Locations

In the second script, we begin with reading in the modules and setting paths. I added an option at the top with a variable called temp_many_days; if it’s set to True, it will take the date range below it and retrieve temperatures for x to y days before the observation date in the point file. If it’s False, it will retrieve just the matching date. I also specify the names of columns in the input point observation shapefile that contain a unique ID number, name, and date. In this case the input data consists of ten sample points and dates that I’ve concocted, labeled alfa through juliett, all located in Rhode Island and stored as a shapefile.

import os,csv,rasterio
import matplotlib.pyplot as plt
import geopandas as gpd
from rasterio.plot import show
from datetime import datetime as dt
from datetime import timedelta
from datetime import date

#Calculate temps over multiple previous days from observation
temp_many_days=True # True or False
date_range=(1,7) # Range of past dates 

#Inputs
point_file=os.path.join('input_points','test_obsv.shp')
raster_dir=os.path.join('input_raster','clipped')
outfolder='output'
if not os.path.exists(outfolder):
    os.makedirs(outfolder)

# Column names in point file that contain: unique ID, name, and date
obnum='OBS_NUM'
obname='OBS_NAME'
obdate='OBS_DATE'

Next, we loop through the folder of clipped raster files, and for each raster (ending in .tif) we grab the file name and extract the date from it. We take that date and store it in Python’s standard date format. The date becomes a key, and the path to the raster its value, which get added to a dictionary called rf_dict. For example, if we split the file name clipped_PRISM_tmean_stable_4kmD2_20200131_bil.tif using the underscores, counting from zero we get the date in the 5th position, 20200131. Converting that to the standard datetime format gives us datetime.date(2020, 1, 31).

rf_dict={} # Create dictionary of dates and raster file names

for rf in os.listdir(raster_dir):
    if rf.endswith('.tif'):
        rfdatestr=rf.split('_')[5]
        rfdate=dt.strptime(rfdatestr,'%Y%m%d').date() #format of dates is 20200131
        rfpath=os.path.join(raster_dir,rf)
        rf_dict[rfdate]=rfpath
    else:
        pass

Then we read the observation point shapefile into a geodataframe, create an empty result_list that will hold each of our extracted values, and construct the header row for the list. If we are grabbing temperatures for multiple days, we generate extra header values to add to that row.

#open point shapefile
point_data = gpd.read_file(point_file)

result_list=[]
result_list.append(['OBS_NUM','OBS_NAME','OBS_DATE','RASTER_ROW','RASTER_COL','RASTER_FILE','TEMP'])

if temp_many_days==True:
    for d in range(date_range[0],date_range[1]):
        tcol='TMINUS_'+str(d)
        result_list[0].append(tcol)
    result_list[0].append('TEMP_RANGE')
    result_list[0].append('AVG_TEMP')
    temp_ftype='multiday_'
else:
    temp_ftype='singleday_'

Now the preliminaries are out of the way, and processing can begin. This post and tutorial helped me to grasp the basics of the process. We loop through the point data in the geodataframe (we indicate point.data.index because these are dataframe records we’re looping through). We get the observation date for the point and store that it the standard Python date format. Then we take that date, compare it to the dictionary, and get the path to the corresponding temperature raster for that date. We open that raster with rasterio, isolate the x and y coordinate from the geometry of the point observation, and retrieve the corresponding row and column for that coordinate pair from the raster. Then we read the value that’s associated with the grid cell at those coordinates. We take some info from the observation points (the number, name, and date) and the raster data we’ve retrieved (the row, column, file name, and temperature from the raster) and add it to a list called record.

#Pull out and format the date, and use date to look up file
for idx in point_data.index:
    obs_date=dt.strptime(point_data[obdate][idx],'%m/%d/%Y').date() #format of dates is 1/31/2020
    obs_raster=rf_dict.get(obs_date)
    if obs_raster == None:
        print('No raster available for observation and date',
              point_data[obnum][idx],point_data[obdate][idx])
    #Open raster for matching date, overlay point coordinates, get cell location and value
    else:
        raster=rasterio.open(obs_raster)
        xcoord=point_data['geometry'][idx].x
        ycoord=point_data['geometry'][idx].y
        row, col = raster.index(xcoord,ycoord)
        tempval=raster.read(1)[row,col]
        rfile=os.path.split(obs_raster)[1]
        record=[point_data[obnum][idx],point_data[obname][idx],
                point_data[obdate][idx],row,col,rfile,tempval]

If we had specified that we wanted a single day (option near the top of the script), we’d skip down to the bottom of the next block, append the record to the main result_list, and continue iterating through the observation points. Else, if we wanted multiple dates, we enter into a sub-loop to get data from a range of previous dates. The datetime timedelta function allows us to do date subtraction; if we subtract 1 from the current date, we get the previous day. We loop through and get rasters and the temperature values for the points from each previous date in the range and append them to an old_temps list; we also build in a safety mechanism in case we don’t have a raster file for a particular date. Once we have all the dates, we do some calculations to get the average temperature and range for that entire period. We do this on a copy of old_temps called all_temps, where we delete null values and add the current observation date. Then we add the average and range to old_temps, and old_temps to our record list for this point observation, and when finished we append the observation record to our main result_list, and proceed to the next observation.

       # Optional block, if pulling past dates
        if temp_many_days==True:
            old_temps=[]
            for d in range(date_range[0],date_range[1]):
                past_date=obs_date-timedelta(days=d) # iterate through days, subtracting
                past_raster=rf_dict.get(past_date)
                if past_raster == None: # if no raster exists for that date
                    old_temps.append(None)
                else:
                    old_raster=rasterio.open(past_raster)
                    # Assumes rasters from previous dates are identical in structure to 1st date
                    past_temp=old_raster.read(1)[row,col]
                    old_temps.append(past_temp)
            # Calculate avg and range, must exclude None values and include obs day
            all_temps=[t for t in old_temps if t is not None]
            all_temps.append(tempval)
            temp_range=max(all_temps)-min(all_temps)
            avg_temp=sum(all_temps)/len(all_temps)
            old_temps.extend([temp_range,avg_temp])
            record.extend(old_temps)
            result_list.append(record)
        else: # if NOT doing many days, just append data for observation day
            result_list.append(record)
    if (idx+1)%200==0:
        print('Processed',idx+1,'records so far...')

Once the loop is complete, we plot the last point and raster to the screen just to check that it looks good, and we write the results out to a CSV.

#Plot the points over the final raster that was processed    
fig, ax = plt.subplots(figsize=(12,12))
point_data.plot(ax=ax, color='black')
show(raster, ax=ax)

today=str(date.today()).replace('-','_')
outfile='temp_observations_'+temp_ftype+today+'.csv'
outpath=os.path.join(outfolder,outfile)

with open(outpath, 'w', newline='') as writefile:
    writer=csv.writer(writefile, quoting=csv.QUOTE_MINIMAL, delimiter=',')
    writer.writerows(result_list)  

print('Done. {} observations in input file, {} records in results'.format(len(point_data),len(result_list)-1))
Output data for script
CSV output from script, temperatures extracted from raster by date for observation points

Results and Wrap-up

Visit the GitHub repo for full copies of the scripts, plus input and output data. In creating test observation points, I purposefully added some locations that had identical coordinates, identical dates, dates that varied by a single day, and dates for which there would be no corresponding raster file in the sample data if we went one week back in time. I looked up single dates for all point observations manually, and a sample of multi-day selections as well, and they matched the output of the script. The scripts ran quickly, and the overall process seemed intuitive to me; resetting the metadata for rasters after masking is the one part that wouldn’t have occurred to me, and took a little bit of time to figure out. This solution worked well for this case, and I would definitely apply geospatial Python to a problem like this again. An alternative would have been to use a spatial database like PostGIS; this would be an attractive option if we were working with a bigger dataset and processing time became an issue. The benefit of using this Python approach is that it’s easier to share the script and replicate the process without having to set up a database.

Observation points on raster in QGIS
Observation points plotted on temperature raster with single-day output temperatures in QGIS
Raster Temperature Jan 1, 2020 Southern NE

Summarizing Raster Data for Areas and Assigning Values to Points

It’s been a busy few months, but I have a few days to catch my breath now that it’s spring break and most people (except me) have gone away! One question that’s come up quite a bit this semester is how to associate raster data with coinciding vector data. I’ll summarize some approaches in this post using ArcGIS Pro and QGIS, to summarize raster values for polygons (zonal statistics) and to assign raster values to points (aka raster sampling).

Zonal Statistics: Summarize Rasters by Area

Imagine that you have quantitative values such as temperature or a vegetation index in a raster grid, and you want to use this data to calculate an average for counties or metro areas. The goal is to have a new attribute column in the vector layer that contains the summarized raster value, perhaps because you want to make thematic maps of that value, or you want to use it in conjunction with other variables to run spatial statistics, or you just want a plain and simple summary for given places.

The term zonal statistics is used to define any operation that calculates statistics on cell values of a raster within an area or zone defined by another dataset, either a raster or a vector. The ArcGIS Pro toolbox has a Zonal Statistics tool where the output is a new raster file with cells that are summarized by the input zones. That’s not desirable for the use case I’m presenting here; the better choice is the Zonal Statistics as Table tool. The output is a table containing the unique identifiers of the raster and vector, the summary stats you’ve generated (average, sum, min, max, etc), and a count of the number of cells used to generate the summary. You can join this resulting table back to the vector file using their common unique identifier in a table join.

In the example below, I’m using counties from the census TIGER files for southern New England as my Input Feature Zone, the AFFGEOID (Census ANSI / FIPS code) to identify the Zone Field, and a temperature grid for January 1, 2020 from PRISM as the Input Value Raster. I’m calculating the mean temperature for the counties on that day.

ArcGIS Zonal Statistics as Table Tool
ArcGIS Pro Zonal Statistics as Table; Temperature Grid and Southern New England Counties

The output table consists of one record for each zone / county, with the count of the cells used to create the average, and the mean temperature (in degrees Celsius). This table can be joined back to the original vector feature (select the county feature in the Contents, right click, Joins and Relates – Join) to thematically map the average temp.

ArcGIS Zonal Statistics Result
ArcGIS Pro Zonal Statistics; Table Output and Join to Show Average Temperature per County

In QGIS, this tool is simply called Zonal Statistics; search for it in the Processing toolbox. The vector with the zones is the Input layer, and the Raster layer is the grid with the values. By default the summary stats are the count, sum, and mean, but you can check the Statistics to calculate box to select others. Unlike ArcGIS, QGIS allows you to write output as a table or a new shapefile / geopackage, which carries along the feature geometry from the Input zones and adds the summaries, allowing you to skip the step of having to do a table join (if you opted to create a table, you could join it to the zones using the Joins tab under the Properties menu for the vector features).

QGIS Zonal Stats
QGIS Zonal Statistics

Extract Raster Values for Point Features

Zonal stats allows you to summarize raster data within a polygon. But what if you had point features, and wanted to assign each point the value of the raster cell that it falls within? In ArcGIS Pro, search the toolbox for the Extract Values to Points tool. You select your input points and raster, and a new point feature that will include the raster values. The default is to take the value for the cell that the point falls within, but there is an Interpolate option that will calculate the value from adjacent cells. The output point feature contains a new column called RASTERVALU. I created some phony point data and used it to generate the output below.

ArcGIS Extract Values to Points
ArcGIS Pro Extract Values to Points (assign raster cell values to points)

In QGIS the name of this tool is Sample raster values, which you can find in the Processing toolbox. Input the points, choose a raster layer, and write the output to a new vector point file. Unlike ArcGIS, there isn’t an option for interpolation from surrounding cells; you simply get the value for the cell that the point falls within. If you needed to interpolate, you can go to the Plugins menu, enable the SAGA plugin, and in the Processing toolbox try the SAGA tool Raster Values to Points instead.

QGIS Sample Raster Values
QGIS Sample Raster Values (assign raster cell values to points)

A variation on this theme would be to create and assign an average value around each point at a given distance, such as the average temperature within five miles. One way to achieve this would be to use the buffer tools in either ArcGIS or QGIS to create distinct buffers around each point at the specified distance. The buffer will automatically carry over all the attributes from the point features, including unique identifiers. Then you can run the zonal statistics tools against the buffer polygons and raster to compute the average, and if need be do a table join between the output table and the original point layer using their common identifier.

Wrap-up

In using any of these tools, it’s important to consider the resolution of the raster (i.e. the size of the grid cell):

1. Relative to the size of the zonal areas or number of points, and

2. In relation to the phenomena that you’re studying.

When larger grid cells or zonal areas are used for measurement, any phenomena becomes more generalized, and any variations within these large areas become masked. The temperature grid cells in this example had a resolution of 2.5 miles, which was suitable for creating county summaries. Summarizing data for census tracts at this resolution would be less ideal, as the tracts are much smaller than the cells, with the cell value characterizing a much larger area. This might be okay in the case of temperature, which tends not to vary considerably over a distance of a few miles. In contrast, averaging temperature data for states is not worthwhile, as states vary considerably in size and most are large enough that they contain multiple ecosystems and elevation levels.

The solutions I’ve described here are the desktop GIS solutions. You could also use either spatial SQL in a geodatabase or a spatial extension in a scripting language like Python or R to perform similar operations. In both cases a basic overlay and intersection statement is used, in conjunction with some grouping function for calculating summaries. I’ve been doing a lot more spatial Python work with geopandas these past few months – perhaps a topic for a subsequent post…

Project Linework Wargames

Snazzy Thematic Maps with Project Linework

When I’m making global thematic maps, I usually turn to Natural Earth. They provide country polygons and boundary lines, as well as features like cities and rivers, at several different scales. I always reference it in workshops that I teach, including the 2-week GIS Institute that I participated in earlier this month. It’s a solid, free data source and a good example for illustrating how scale and generalization work in cartography. It’s a “natural choice”, as they provide boundaries that depict the way the world actually looks.

I also discussed aesthetics and map design during the Institute. What if you don’t necessarily care about representing the boundaries exactly the way they are? If you rely on the map reader’s knowledge of the relative shape of the countries and their position on the globe, and you employ good labeling, you can choose boundaries that are more artistic and fun (provided that your only goal is making a basic thematic map and it’s not being published in a stodgy journal).

Project Linework is part of Something About Maps, an excellent blog by Daniel Huffman. The project consists of different series of public domain boundary files that have been generalized to provide interesting and visually attractive alternatives to standard features. The gallery contains a sample image and brief description of each series, including details on geographic coverage. Most of the series cover just North America or select portions of the world.

The three I’ll mention below are global in coverage. They come in shapefile and geojson formats, are projected in World Gall Stereographic (ESRI 54016), and include line and polygon coverages. The attribute tables have fields for ISO country codes, which are standard unique identifiers that allow for table joins for thematic mapping. I took my map of Wheat and Meslin Exports from Ukraine from an earlier post to create the following examples.

With the Wargames series, the world has been rendered using the little hexagon grids familiar to many war board gamers, and plenty of non-war gamers for that matter (think Settlers of Catan). Hexes are a an alternative to grids for determining adjacency.

Project Lineworks Map - Wargames
Project Linework: Wargames

Moriarty Hand is a more whimsical interpretation. It was drawn by hand by tracing line work from Natural Earth. The end result is more organic compared to Wargames. It comes in two scales, small and large (with an example of the latter below):

Project Lineworks Map - Moriarty Hand
Project Linework: Moriarty Hand

My personal favorite is 1981. It’s inspired by the basic polygon shapes that you would have seen in early computer graphics. When I was little I remember loading a DOS-based atlas program from a floppy disk, and slowly panning across a CGA monochrome screen as the machine chunked away to render countries that looked like these. Good if you’re looking for a retro vibe.

Project Lineworks Map - 1981
Project Linework: 1981

Happy mapping! Also from Something About Maps, check out this excellent poster and related post about families of map projections.

Dingo Paths from ZoaTrack

Wildlife Tracking GIS Data Sources

I’ve also received a number of questions this semester about animal observation and tracking data. Since I usually study people and not animals, I was a bit out of my element and had some homework to do. If you’ve ever watched nature shows, you’ve seen scientists tagging animals with collars or bands to track them by radio or satellite, or setting up cameras to record them. Many scientists upload their GPS coordinate data into publicly accessible repositories for others to download and use.

I’ve written a short, three-part document that I’ve posted on our tutorials page: GIS Data Sources for Wildlife Tutorial. In the first part, I provide summaries, links, and guidance on using large portals like Movebank and Zoatrack* that include many species from all over the world (wild and domestic), as well a government repositories including NOAA’s National Center for Environment Information Geoportal and the National Park Service’s Data Store. The second part focuses on search strategies, crawling the web and combing through academic literature in library databases to find additional data. Since these datasets are highly diffuse, it’s worth going beyond the portals to see what else you can discover.

I describe how you can add and visualize this data in QGIS and ArcGIS Pro in the third and final part. Wildlife data comes packaged in a number of formats; in some cases you’ll find shapefiles or geodatabases that you can readily add and visualize, but more often than not the data is packaged in a plain CSV / TXT format. This requires you to plot the coordinates (X for longitude, Y for latitude) to create a dot map of the observations. Data files will often contain a number of individual animals, which can be uniquely identified with a tag ID, allowing you to symbolize the points by category so you have a different color or symbol for each individual. Alternatively, there might be separate data files for each individual, that you could add and symbolize differently. The files will contain either a sequential integer or a timestamp that indicates the order of the observations. With one field that indicates the order and another that identifies each individual, you can use a Points to Line or Points to Path tool to generate lines (tracks or trajectories) from the points (observations or detections).

You can see where dingos in Queensland, Australia are going in the screenshot below, which displays individual observation points, and the screenshot in the header of this post where the points were connected to form paths. I obtained the data from ZoaTrack and used QGIS for mapping. Check out the tutorial for details on how to find and map your favorite animals.

* NOTE: ZoaTrack went offline in July 2024. You can still access an archive of the site and its datasets via the Internet Archive’s Wayback Machine. Here is a cached version of Zoatrack from June 2024. The tutorial will be updated to reflect this change soon.

Dingo observations from ZoaTrack plotted in QGIS
Map of Ukraine Wheat Exports in 2021

Import and Export Data for Countries: Grain from Ukraine

I’ve been receiving more questions about geospatial data sources as the semester draws to a close. I’ll describe some sources that I haven’t used extensively before in the next couple of posts, beginning with data on bilateral trade: imports and exports between countries. We’ll look at the IMF’s Direction of Trade Statistics (DOTS) and the UN’s COMTRADE database. Both sources provide web-based portals, APIs, and bulk downloading. I’ll focus on the portals.

IMF Direction of Trade Statistics

IMF DOTS provides monthly, quarterly, and annual import and export data, represented as total dollar values for all goods exchanged. The annual data goes back to 1947, while the monthly / quarterly data goes back to 1960. All countries that are part of the IMF are included, plus a few others. Data for exports are published on a Free and On Board (FOB) price basis, while imports are published on a Cost, Insurance, Freight (CIF) price basis. Here are definitions for each term, quoted directly from the OECD’s Statistical Glossary:

The f.o.b. price (free on board price) of exports and imports of goods is the market value of the goods at the point of uniform valuation, (the customs frontier of the economy from which they are exported). It is equal to the c.i.f. price less the costs of transportation and insurance charges, between the customs frontier of the exporting (importing) country and that of the importing (exporting) country.

The c.i.f. price (i.e. cost, insurance and freight price) is the price of a good delivered at the frontier of the importing country, including any insurance and freight charges incurred to that point, or the price of a service delivered to a resident, before the payment of any import duties or other taxes on imports or trade and transport margins within the country.

OECD Glossary of Statistical Terms

There are a few different ways to browse and search for data. Start with the Data Tables tab at the top, and Exports and Imports by Areas and Countries. The default table displays monthly exports by region and country for the entire world (you could switch to imports by selecting the Imports CIF tab beside the Export sFOB tab). Hitting the Calendar dropdown allows you to change the date range and frequency. Hitting the Country dropdown lets you select a specific region or country. In the example below, I’ve changed the calendar from months to years, and the country to Ukraine. By doing so, the table now depicts the total US dollar value of exports and imports between Ukraine and all other countries. The Export button at the top allows you to save the report in a number of formats, Excel being the most data friendly option.

IMF DOTS Basic Report – Total Value of Exports from Ukraine, Last Five Years

While this is the quickest option, it comes with some downsides; the biggest one is that there are no unique identifiers for the countries, which is important if you wanted to join this table to a GIS vector file for mapping, or another country-level table in a database.

A better approach is to return to the home page and use the Query tab, which allows you to get a unique identifier and filter out countries and regions that are not of interest.

DOTS Query Tab
  1. Under Columns, select the time frame and interval. For example, check Years for Frequency at the top, and change the dropdowns at the bottom from Months to Years. From -5 to 0 would give you the last five years in ascending order.
  2. Rows allows you to filter out countries or regions that you don’t want to see in the results. You can also change the attribute that is displayed. Once the menu is open, right click in an empty area and choose Attribute. Here you can choose a variant country name, or an ISO country code. ISO codes are commonly used for uniquely identifying countries.
  3. Indicator lets you choose Exports (FOB), Imports (CIF or FOB), or Trade Balance, all in US dollars.
  4. Counterpart country is the country or region that you want to show trade for, such as Ukraine in our previous example.
  5. The tabs along the top allow you to produce graphs instead of a table (View – Table), to pivot the table (Adjust), and calculate summaries like sums or averages (Advanced).
  6. Export to produce an Excel file. By choosing the ISO codes you’ll lose the country names, but you can join the result to another country data table or shapefile and grab the names from there.
Modify Time
Modify Rows – Country – Change Attribute
DOTS Modified Table to Export: Total Value of Exports from Ukraine Last Five Years

UN COMTRADE

If you want data on the exchange of specific goods and services, quantities in addition to dollar values, and exchanges beyond simple imports and exports, then the UN’s COMTRADE database will be your source. You need to register to download data, but you can generate previews without having to log in. There is an extensive wiki that describes how to use the different database tools, and summaries of technical terms that you need to know for extracting and interpreting the data. You’ll need some understanding of the different systems for classifying commodities and goods. Your options (the links that follow lead to documentation and code lists) are: the Harmonized Classification System (HC), the Standard Industrial Trade Classification (SITC), and the Broad Economic Categories (BEC). What’s the difference? Here are some summaries, quoted directly from a UN report on the BEC:

The HS classification is maintained by the World Customs Organization. Its main purpose is to classify goods crossing the border for import tariffs or for application of some non-tariff measures for safety or health reasons. The HS classification is revised on a five-year cycle (p. 18)

The original SITC was designed in the 1950s as a tool for collection and dissemination of international merchandise trade statistics that would help in establishing internationally comparable trade statistics. By its introduction in 1988, the HS took over as collection and dissemination tool, and SITC was thereon used mostly as an analytical tool. (p. 19)

The Classification by Broad Economic Categories (BEC) is an international product classification. Its main purpose is to provide a set of broad product categories for the analysis of trade statistics. Since its adoption in 1971, statistical offices around the world have used BEC to report trade statistics in a concise and meaningful way (p. iii). The broad economic categories of BEC include all subheadings of the HS classification. Therefore, the total trade in terms of HS equals the total trade of the goods side of BEC. (p. 18)

Classification of Broad Economic Categories Rev 5 (2018)

In short, go with the BEC if you’re interested in high-level groupings, or the HS if you need detailed subdivisions. The SITC would be useful if you need to go further back in time, or if it facilitates looking at certain subdivisions or groupings that the other systems don’t capture.

From COMTRADE’s homepage, I suggest leaving the defaults in place and doing a basic, preliminary search for all global exports for the most recent year, so you can see basic output on the next screen. Then you can apply filters for a narrower search.

For example, let’s look at annual exports of wheat from Ukraine to other countries. Under the HS filter, remove the TOTAL code. Start typing wheat, and you’ll see various product categories: 6-digit codes are the most specific, while 4-digit codes are broader groups that encapsulate the 6-digit categories. We’ll choose wheat and meslin 1001. We’ll select Ukraine as the Reporter (the country that supplied the statistics and represents the origin point), and for the 1st partner we’ll choose All to get a list of all countries that Ukraine exported wheat to. The 2nd partner country we’ll leave as World (alternatively, you would add specific countries here if you wanted to know if there were intermediary nations between the origin and destination).

UN COMTARDE Refine Search with Filters

Hit Preview to see the results. You can click on a heading to sort by dollar value, weight, or country name. Like IMF DOTS, UN COMTRADE measures dollar amounts of exports as FOB and imports as CIF. At this point, you would need to log in to download the data as a CSV (creating an account is free). You would also need to be logged in if you generated an extract that has more than 500 records, otherwise the results will be truncated. You could always copy and paste data for shorter extracts directly from the screen to a spreadsheet, but you wouldn’t get any of the extra metadata fields that come with download, like ISO Country Codes and the classification codes for goods and merchandise.

COMTRADE Filtered Results – Exports of Wheat and Meslin from Ukraine 2021
Data Exported from COMTRADE to CSV with Identifiers

Mapping

For data from either source, if you wanted to map it you’d need to have a data table where there is one row for each country with columns of attributes, and with one column that has the ISO country code to serve as a unique identifier. Save the data table in an Excel file or as a table in a database. Download a country shapefile from Natural Earth. Add the shapefile and data table to a project and join them using the ISO code. Natural Earth shapefiles have several different ISO code columns that represent nations, sovereigns, and parent – child relationships; be sure you select the right one. Data table records that represent regions or groupings of countries (i.e. the EU, ASEAN, sum of smaller countries per continent not enumerated, etc.) will fall out of the dataset, as they won’t have a matching feature in the country shapefile. The map at the top of this post was created in QGIS, using COMTRADE and Natural Earth.