In this post I demonstrate how export a list of variables from a STATA dta file to an Excel spreadsheet, and how to create a STATA do file by using Python to read in a list of variables from a spreadsheet; the do file will generate an extract of attributes and observations from a larger dta file. Gallup Analytics microdata serves as the example.
Gallup Analytics Microdata
Many academic libraries subscribe to an online database called Gallup Analytics, which lets users explore and download summary statistics from a number of on-going polls and surveys conducted by the Gallup Organization, such as the US Daily Tracker poll, World Poll, and SPSS polling series. As part of the package, subscribing institutions also receive microdata files for some of the surveys, in STATA and SPSS formats. These files contain the anonymized, individual responses to the surveys. The microdata is valuable to social science researchers who use the responses to conduct statistical analyses.
Microdata in STATA
Naturally, the microdata is copyrighted and licensed for non-commercial research purposes to members of the university or institution who are covered by the license agreement, and cannot be shared outside the institution. Another stipulation is that the files cannot be shared in their entirety, even for members of the licensed institution; researchers must request individual extracts of variables and observations to answer a specific research question. This poses a challenge for the data librarian, who somehow has to communicate to the researcher what’s available in the files and mediate the request. Option 1 is to share the codebooks (which are also copyrighted and can’t be publicly distributed) with the researcher and haggle back and forth via email to iron out the details of the request. Option 2 is to have a stand-alone computer set up in the library, where a researcher can come and generate their own extract from files stored on a secure, internal network. In both cases, the manual creation of the extract and the researcher’s lack of familiarity with the contents of the data makes for a tedious process.
My solution was to create spreadsheets that list all of the variables in each dataset, and have the researcher check the ones they want. I created a resource guide that advertises and describes the datasets, and provides secure links to the Gallup codebooks and these spreadsheets, which are stored on a Google Drive and are protected via university authentication. The researcher can then fill out a Google form (also linked to from that page), where they describe the nature of the request, select the specific dataset of interest, specify filters on observations (rows), and upload the spreadsheet of requested variables (columns). Then, I can read the spreadsheet variables into Python and generate a STATA do file (STATA scripts stored in plain text format), to create the desired extract which I can share with the researcher.
Create List of STATA Variables in Excel Spreadsheet
First, I created a standard set of STATA do files to output lists of all variables to a spreadsheet for the different data files. An example for the US Daily Tracker poll from pre-2018 is below. I was completely unfamiliar with STATA, but the online docs and forums taught me what I needed to pull this together.
Some commands are the same across all the do files. I use describe and then translate to create a simple text file that saves a summary from the screen that counts rows and columns. Describe gives a description of the data stored in memory, while replace is used to swap out existing variables with a new subset. Then, generate select_vars gives me codebook information about the dataset (select_vars is a variable name I created), which I sort using the name column. The export excel command is followed by the specific summary fields I wish to output; the position of the variable, data type, variable label, and the variable name itself.
* Create variable list for Gallup US Tracker Survey 2008-2017
local y = YEAR in 1
describe,short
summarize YEAR
translate @Results gallup_tracker_`y'_summary.txt, replace
describe, replace
generate select_vars = ""
sort name
export excel position name type varlab select_vars using gallup_tracker_`y'_vars.xlsx, firstrow(variables) replace
The variation for this particular US Daily Tracker dataset is that the files are packaged as one file per year. I load the first file for 2008, and the do file saves the YEAR attribute as a local variable, which allows me to include the year in the summary and excel output file names. I had to run this do file for each subsequent year up to 2017. This is not a big deal as I’ll never have to repeat the process on the old files, as new data will be released in separate, new files. Other datasets imposed different requirements; the GPSS survey is packaged in eleven separate files for different surveys, and the updates are cumulative (each file contains older data plus any updates – Gallup sends us updated files a few times each year). For the GPSS, I prompt the user for input to specify the survey file name, and overwrite the previous Excel file.
With the do file in hand, you open STATA and the data file you want to process, change the working directory from the default user folder to a better location for storing the output, open the do file, and it runs and creates the variable list spreadsheet.
List of variables in Excel generated from STATA file. Users check the variables they want in an extract in the select_vars column
Create a STATA Do File with Python and Excel
Once a researcher submits their Google form and their selected variable spreadsheet (placing an X in a dedicated column to indicate that they want to include a variable), I run the Python script below. I use the openpyxl module to read the Excel file. I have to modify the paths, spreadsheet file name, and an integer for the particular survey each time I run it. I use the os module to navigate up and down through folders to store outputs in specific places. If the researcher specifies in the Google form that they want to filter observations, for example records for specific states or age ranges, I have to add those manually but I commented out a few examples that I can copy and modify. One caveat is that you must filter using the coded variable and not its label (i.e. if a month value is coded as 2 and its label is February, I must reference the code and not the label). Reading in the requested columns is straightforward; the script identifies cells in the selection column (E) that have an X, then grabs the variable name from the adjacent column.
# -*- coding: utf-8 -*-
"""
Pull selected gallup variables from spreadsheet to create STATA Do File
Frank Donnelly / GIS and Data Librarian / Brown University
"""
import openpyxl as xl, os
from datetime import date
thedate=date.today().strftime("%m%d%Y")
surveys={1:'gallup_covid',2:'gallup_gpss',3:'gallup_tracker',4:'gallup_world'}
rpath=os.path.join('requests','test') # MODIFY BASED ON INPUT
select_file=os.path.join(rpath,'gallup_tracker_2017_vars_TEST.xlsx') #MODIFY BASED ON INPUT
survey_file=surveys[3] #MODIFY BASED ON INPUT
dofile=os.path.join(rpath,'{}_vars_{}.do'.format(survey_file,thedate))
dtafile=os.path.join(os.path.abspath(os.getcwd()),rpath,'{}_extract_{}.dta'.format(survey_file,thedate))
#MODIFY to filter by observations - DO NOT ERASE EXAMPLES - copy, then modify
obsfilter=None
# obsfilter=None
# obsfilter='keep if inlist(STATE_NAME,"CT","MA","ME","NH","RI","VT")'
# obsfilter='keep if inrange(WP1220,18,64)'
# obsfilter='keep if SC7==2 & MONTH > 6'
# obsfilter='keep if FIPS_CODE=="44007" | FIPS_CODE=="25025"'
workbook = xl.load_workbook(select_file)
ws = workbook['Sheet1']
# May need to modify ws col and cell values based on user input
vlist=[]
for cell in ws['E']:
if cell.value in ('x','X'):
vlist.append((ws.cell(row=cell.row, column=2).value))
outfile = open(dofile, "w")
outfile.writelines('keep ')
outfile.writelines(" ".join(vlist)+"\n")
if obsfilter==None:
pass
else:
outfile.writelines(obsfilter+"\n")
outfile.writelines('save '+dtafile+"\n")
outfile.close()
print('Created',dofile)
The plain text do file begins with the command keep followed by the columns, and if requested, an additional keep statement to filter by records. The final save command will direct the output to a specific location.
All that remains is to open the requested data file in STATA, open the do file, and an extract is created. Visit my GitHub for the do files, Python script, and sample output. The original source data and the variable spreadsheets are NOT included due to licensing issues; if you have the original data files you can generate what I’ve described here. Sorry, I can’t share the Gallup data with you (so please don’t ask). You’ll need to contact your own university or institution to determine if you have access.
Even though I’ve left New York, there are still occasions where I refer back to NYC resources in order to help students and faculty here with NYC-based research. Most recently I’ve revisited NYC DOITT’s Geoclient API for geocoding addresses, and I discovered a number of things have changed since I’ve last used it a few years ago. I’ll walk through my latest geocoding script in this post.
First and foremost: if you landed on this page because you’re trying to figure out how to get your Geoclient API key to work, the answer is:
&subscription-key=YOURKEYHERE
This replaces the old format that required you to pass an app id and key. I searched through two websites and scanned through hundreds of pages of documentation, only to find this solution in a cached Google search result, as the new docs don’t mention this change and the old docs still have the previous information and examples of the application ID and key. So – hopefully this should save you some hours of frustration.
I was working with someone who needed to geocode a subset of the city’s traffic violation data from the open data portal, as the data lacks coordinates. It’s also missing postal city names and ZIP Codes, which precludes using most geocoders that rely on this information. Even if we had these fields, I’ve found that many geocoders struggle with the hyphenated addresses used throughout Queens, and some work-around is needed to get matches. NYC’s geoclient is naturally able to handle those Queens addresses, and can use the borough name or code for locating addresses in lieu of ZIP Codes. The traffic data uses pseudo-county codes, but it’s easy to replace those with the corresponding borough codes.
The older documentation is still solid for illustrating the different APIs and the variables that are returned; you can search for a parsed or non-parsed street address, street intersections, places of interest or landmarks, parcel blocks and lots, and a few others.
I wrote some Python code that I’ve pasted below for geocoding addresses that have house numbers, street, and borough stored in separate fields using the address API, and if the house number is missing we try again by doing an intersection search, as an intersecting street is stored in a separate field in the traffic data. In the past I used a thin client for accessing the API, but I’m skipping that as it’s simpler to just build the URLs directly with the requests module.
The top of the script has the standard stuff: the name of the input file, the column locations (counting from zero) in the input file that contain each of the four address components, the base URL for the API, a time function for progress updates, reading the API key in from a file, and looping through the input CSV with the addressees to save the header row in one list and the records in a nested list. I created a list of fields that are returned from the API that I want to hold on to and add them to the header row, along with a final variable that records the results of the match. In addition to longitude and latitude you can also get xCoordinate and yCoordinate, which are in the NY State Plane Long Island (ft-US) map projection. I added a counts dictionary to keep track of the result of each match attempt.
Then we begin a long loop – this is a bit messy and if I had more time I’d collapse much of this into a series of functions, as there is repetitive code. I loop through the index and value of each record beginning with the first one. The loop is in a try / except block, so in the event that something goes awry it should exit cleanly and write out the data that was captured. We take the base url and append the address request, slicing the record to get the values for house, street, and borough into the URL. An example of a URL after passing address components in:
https://api.nyc.gov/geo/geoclient/v1/address.json?houseNumber=12345&street=CONEY ISLAND AVE&borough=BROOKLYN&subscription-key=KEYGOESHERE
Pass that URL to the requests module and get a response back. If an address is returned, the JSON resembles a Python dictionary, with ‘address’ as the key and the value as another dictionary with key value pairs of several variables. Otherwise, we get an error message that something was wrong with the request.
A successful address match returns an address dictionary, with a sub-dictionary of keys and values
The loop logic:
If the package contains an ‘address’ key, flatten to get the sub-dictionary
If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
Else if the dictionary contains a ‘message’ key with a value that the house number was missing, do an intersection match
If the package contains an ‘intersection’ key, flatten to get the sub-dictionary
If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
If not, there was no intersection match, just get the messages and append blanks for each value to the record
If not, an error was returned, capture the error and append blanks for each value to the record, and continue
If not, there was no address match, just get the messages and append blanks for each value to the record
If not, an error was returned, capture the error and append blanks for each value to the record, and continue
The API has limits of 2500 matches per minute and 500k per day, so after 2000 records I built in a pause of 15 seconds. Once the process finishes, successfully or not, the records are written out to a CSV file, header row first followed by the records. If the process bailed prematurely, the last record and its index are printed to the screen. This allows you to rerun the script where you left off, by changing the start index in the variables list at the top of the script from 0 to the last record that was read. When it comes time to write output, the previous file is appended rather than overwritten and the header row isn’t written again.
It took about 90 minutes to match a file of 25,000 records. I’d occasionally get an error message that the API key was bad for a given record; the error would be recorded and the script continued. It’s likely that there are illegal characters in the input fields for the address that end up creating a URL where the key parameter can’t be properly interpreted. I thought the results were pretty good; beyond streets it was able to recognize landmarks like large parks and return matched coordinates with relevant error messages (example below). Most of the flops were, not surprisingly, due to missing borough codes or house numbers.
Output fields from the NYC Geoclient written to CSV
To use this code you’ll need to sign up for an NYC Developer API account, and then you can request a key for the NYC Geoclient service. Store the key in a text file in the same folder as the script. I’m also storing inputs and outputs in the same folder, but with a few functions from the os module you can manipulate paths and change directories. If I get time over the winter break I may try rewriting to incorporate this, plus functions to simplify the loops. An alternative to the API would be to download the LION street network geodatabase, and you could set up a local address locator in ArcGIS Pro. Might be worth doing if you had tons of matches to do. I quickly got frustrated with with the ArcGIS documentation and after a number of failed attempts I opted to use the Geoclient instead.
"""
Match addresses to NYC Geoclient using house number, street name, and borough
Frank Donnelly / GIS and Data Librarian / Brown University
11/22/2021 - Python 3.7
"""
import requests, csv, time
#Variables
addfile='parking_nov2021_nyc.csv' #Input file with addresses
matchedfile=addfile[:-4]+'_output.csv' #Output file with matched data
keyfile='nycgeo_key.txt' #File with API key
start_idx=0 #If program breaks, change this to pick up with record where you left off
#Counting from 0, positions in the CSV that contain the address info
hous_idx=23
st_idx=24
boro_idx=21
inter_idx=25
base_url='https://api.nyc.gov/geo/geoclient/v1/'
def get_time():
time_now = time.localtime() # get struct_time
pretty_time = time.strftime("%m/%d/%Y, %H:%M:%S", time_now)
return pretty_time
print('*** Process launched at', get_time())
#Read api key in from file
with open(keyfile) as key:
api_key=key.read().strip()
records=[]
with open(addfile,'r') as infile:
reader = csv.reader(infile)
header = next(reader) # Capture column names as separate list
for row in reader:
records.append(row)
# Fields returned by the API to capture
# https://maps.nyc.gov/geoclient/v1/doc
fields=['message','message2','houseNumber','firstStreetNameNormalized',
'uspsPreferredCityName','zipCode','longitude','latitude','xCoordinate',
'yCoordinate']
header.extend(fields)
header.append('match_result')
datavals=len(fields)-2 # Number of fields that are not messages
counts={'address match':0, 'intersection match':0,
'failed address':0, 'failed intersection':0,
'error':0}
print('Finished reading data from', addfile)
print('*** Geocoding process launched at',get_time())
for i,v in enumerate(records[start_idx:]):
try:
data_url = f'{base_url}address.json?houseNumber={v[hous_idx]}&street={v[st_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
response=requests.get(data_url)
package=response.json()
# If an address is returned, continue
if 'address' in package:
result=package['address']
# If longitude is returned, grab data
if 'longitude' in result:
for f in fields:
item=result.get(f,'')
v.append(item)
v.append('address match')
counts['address match']=counts['address match']+1
# If there was no house number, try street intersection match instead
elif 'message' in result and result['message']=='INPUT CONTAINS NO ADDRESS NUMBER' and v[inter_idx] not in ('',None):
try:
data_url = f'{base_url}intersection.json?crossStreetOne={v[st_idx]}&crossStreetTwo={v[inter_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
response=requests.get(data_url)
package=response.json()
# If an intersection is returned, continue
if 'intersection' in package:
result=package['intersection']
# If longitude is returned, grab data
if 'longitude' in result:
for f in fields:
item=result.get(f,'')
v.append(item)
v.append('intersection match')
counts['intersection match']=counts['intersection match']+1
# Intersection match fails, append messages and blank values
else:
v.append(result.get('message',''))
v.append(result.get('message2',''))
v.extend(['']*datavals)
v.append('failed intersection')
counts['failed intersection']=counts['failed intersection']+1
# Error returned instead of intersection
else:
v.append(package.get('message',''))
v.append(package.get('message2',''))
v.extend(['']*datavals)
v.append('error')
counts['error']=counts['error']+1
print(package.get('message',''))
print('Geocoder error at record',i,'continuing the matching process...')
except Exception as e:
print(str(e))
# Address match fails, append messages and blank values
else:
v.append(result.get('message',''))
v.append(result.get('message2',''))
v.extend(['']*datavals)
v.append('failed address')
counts['failed address']=counts['failed address']+1
# Error is returned instead of address
else:
v.append(package.get('message',''))
v.append(package.get('message2',''))
v.extend(['']*datavals)
v.append('error')
counts['error']=counts['error']+1
print(package.get('message',''))
print('Geocoder error at record',i,'continuing the matching process...')
if i%2000==0:
print('Processed',i,'records so far...')
time.sleep(15)
except Exception as e:
print(str(e))
# First attempt, write to new file, but if break happened, append to existing file
if start_idx==0:
wtype='w'
else:
wtype='a'
end_idx=start_idx+i
with open(matchedfile,wtype,newline='') as outfile:
writer = csv.writer(outfile, delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
if wtype=='w':
writer.writerow(header)
writer.writerows(records[start_idx:end_idx])
else:
writer.writerows(records[start_idx:end_idx])
print('Wrote',i+1,'records to file',matchedfile)
print('Final record written was number',i,':\n',v)
for k,val in counts.items():
print(k,val)
print('*** Process finished at',get_time())
I’m working on a project where I needed to generate a list of all the administrative subdivisions (i.e. states / provinces, counties / departments, etc) and their ID codes for several different countries. In this post I’ll demonstrate how I accomplished this using the Geonames API and Python. Geonames (https://www.geonames.org/) is a gazetteer, which is a directory of geographic features that includes coordinates, variant place names, and identifiers that classify features by type and location. Geonames includes many different types of administrative, populated, physical, and built-environment features. Last year I wrote a post about gazetteers where I compared Geonames with the NGA Geonet Names Server, and illustrated how to download CSV files for all places within a given country and load them into a database.
In this post I’ll focus on using an API to retrieve gazetteer data. Geonames provides over 40 different REST API services for accessing their data, all of them well documented with examples. You can search for places by name, return all the places that are within a distance of a set of coordinates, retrieve all places that are subdivisions of another place, geocode addresses, obtain lists of centroids and bounding boxes, and much more. Their data is crowd sourced, but is largely drawn from a body of official gazetteers and directories published by various countries.
This makes it an ideal source for generating lists of administrative divisions and subdivisions with codes for multiple countries. This information is difficult to find, because there isn’t an international body that collates and freely provides it. ISO 3166-1 includes the standard country codes that most of the world uses. ISO 3166-2 includes codes for 1st-level administrative divisions, but ISO doesn’t freely publish them. You can find them from various sources; Wikipedia lists them and there are several gist and github repos with screen scraped copies. The US GNA is a more official source that includes both ISO 3166 1 and 2. But as far as I know there isn’t a solid source for codes below the 1st level divisions. Many countries create their own systems and freely publish their codes (i.e. ANSI FIPS codes in the US, INSEE COG codes in France), but that would require you to tie them altogether. GADM is the go to source for vector-based GIS files of country subdivisions (map at the top of this post for example). For some countries they include ISO division codes, but for others they don’t (they do employ the HASC codes from Statoids, but it’s not clear if these codes are still being actively maintained) .
Geonames to the rescue – you can browse the countries on the website to see the country and 1st level admin codes (see image below), but the API will give us a quick way to obtain all division levels. First, you have to register to get an API username – it’s free – and you’ll tack that username on to the end of your requests. That gives you 20k credits per day, which in most instances equates with 1 request per credit. I recommend downloading one of their prepackaged country files first, to give you a sense for how the records are structured and what attributes are available. A readme file that describes all of the available fields accompanies each download.
1st Level Admin Divisions for Dominica from the Geonames website
My goal was to get all administrative divisions – names and codes and how the divisions nest within each other – for all of the countries in the French-speaking Caribbean (countries that are currently, or formerly, overseas territories of France). I also needed to get place names as they’re written in French. I’ll walk through my Python script that I’ve pasted below.
import requests,csv
from time import strftime
ccodes=['BL','DM','GD','GF','GP','HT','KN','LC','MF','MQ','VC']
fclass='A'
lang='fr'
uname='REQUEST FROM GEONAMES'
#Columns to keep
fields=['countryId','countryName','countryCode','geonameId','name','asciiName',
'alternateNames','fcode','fcodeName','adminName1','adminCode1',
'adminName2','adminCode2','adminName3','adminCode3','adminName4','adminCode4',
'adminName5','adminCode5','lng','lat']
fcode=fields.index('fcode')
#Divisions to keep
divisions=['ADM1','ADM2','ADM3','ADM4','ADM5','PCLD','PCLF','PCLI','PCLIX','PCLS']
base_url='http://api.geonames.org/searchJSON?'
def altnames(names,lang):
"Given a dict of names, extract preferred names for a given language"
aname=''
for entry in names:
if 'isPreferredName' in entry.keys() and entry['lang']==lang:
aname=entry.get('name')
else:
pass
return aname
places=[]
tossed=[]
for country in ccodes:
data_url = f'{base_url}?name=*&country={country}&featureClass={fclass}&lang={lang}&style=full&username={uname}'
response=requests.get(data_url)
data=response.json() #total retrieved and results in list of dicts
gnames=response.json()['geonames'] #create list of dicts only
gnames.sort(key=lambda e: (e.get('countryCode',''),e.get('fcode',''),
e.get('adminCode1',''),e.get('adminCode2',''),
e.get('adminCode3',''),e.get('adminCode4',''),
e.get('adminCode5','')))
for record in gnames:
r=[]
for f in fields:
item=record.get(f,'')
if f=='alternateNames' and f !='':
aname=altnames(item,'en')
r.append(aname)
else:
r.append(item)
if r[fcode] in divisions: #keep certain admin divs, toss others
places.append(r)
else:
tossed.append(r)
filetoday=strftime('%Y_%m_%d')
outfile='geonames_fwi_adm_'+filetoday+'.csv'
writefile=open(outfile,'w', newline='', encoding='utf8')
writer=csv.writer(writefile, delimiter=",", quotechar='"',quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(fields) #header row
writer.writerows(places)
writefile.close()
print(len(places),'records written to file',outfile)
First, I identify all of the variables I need: the two-letter ISO codes of the countries, a list of the Geonames attributes that I want to keep, the two-letter language code, and the specific feature type I’m interested in. There are different features codes classified with a single letter, and a number of subtypes below that. Feature class A is for records that represent administrative divisions, and within that class I needed records that represented the country as a whole (PCL codes) and its subdivisions (ADM codes). There are several different place name variables that include official names, short forms, and an ASCII form that only includes characters found in the Latin alphabet used in English. The language code that you pass into the url will alter these results, but you still have the option to obtain preferred place names from an alternate languages field. The admin codes I’m retrieving are the actual admin codes; you can also opt to retrieve the unique Geonames integer IDs for each admin level, if you wanted to use these for bridging places together (not necessary in my case).
There are a few different approaches for achieving this goal. I decided to use the Geonames full text search, where you search for features by name (separate APIs for working with hierarchies for parent and child entities are another option). I used an asterisk as a wildcard to retrieve all names, and the other parameters to filter for specific countries and feature classes. At the end of the base url I added JSON for the search; if you leave this off the records are returned as XML.
base_url='http://api.geonames.org/searchJSON?'
My primary for loop loops through each country, and passes the parameters into the data url to retrieve the data for that country: I pass in country code, feature class A, and French as the language for the place names. It took me a while to figure out that I also needed to add style=full to retrieve all of the possible info that’s available for a given record; the default is to capture a subset of basic information, which lacked the admin codes I needed.
I use the Python Requests module to interact with the API. Geonames returns two objects in JSON: an integer count of the total records retrieved, and another JSON object that essentially represents a list of python dictionaries, where each dictionary contains all the attributes for a record as a series of key and value pairs where the key is the attribute name (see examples below). I create a new gnames variable to isolate just this list, and then I sort the list based on how I want the final output to appear; by country and by admin codes, so that like-levels of admin codes are grouped together. The trick of using lamba to sort nested lists or dictionaries is well documented, but one variation I needed was to use the dictionary get method. Some features may not have five levels of admin codes; if they don’t then there is no key for that attribute, and using the simple dict[key] approach returns an error in those cases. Using dict.get(key,”) allows you to pass in a default value if no key is present. I provide a blank string as a placeholder, as ultimately I want each record to have the same number of columns in the output and need the attributes to line up correctly.
Records returned from Geonames as a list, where each list item is a dictionary of key / value pairs for a given place
Example of an individual list item, a dictionary of key / value pairs for the Parish of Charlotte, a 1st order admin division of Saint-Vincent-et-les-Grenadines. Variable names are keys.
Once I have records for the first country, I loop through them and choose just the attributes that I want from my field list. The attribute name is the key, I get the associated value, but if that key isn’t present I insert an empty string. In most cases the value associated with a key is a string or integer, but in a few instance it’s another container, as in the case of alternate names which is another list of dictionaries. If there are alternate names I want to pull out a preferred name in English if one exists. I handle this with a function so the loop looks less cluttered. Lastly, if this record represents an admin division or is a country-level record then I want to keep it, otherwise I append it to a throw-away list that I’ll inspect later.
Once the records returned for that country have been processed, we move on to the next country and keep appending records to the main list of places (image below). When we’re done, we write the results out to a CSV file. I write the list of fields out first as my header row, and then the records follow.
Final list called places that contains records for all admin divisions for specific countries and feature classes, where items are sublists that represent each place
Overall I think this approach worked well, but there are some small caveats. A number of the countries I’m studying are not independent, but are dependencies of France. For dependent countries, their 1st and sometimes even 2nd level subdivision codes appear identical to their top-level country code, as they represent a subdivision of an independent country (many overseas territories are departments of France). If I need to harmonize these codes between countries I may have to adjust the dependencies. The alternate English places names always appear for the country-level record, but usually not below that. I think I’d need to do some additional tweaking or even run a second set of requests in English if I wanted all the English spellings; for example in French many compound place names like Saint-Paul are separated by a hyphen, but in English they’re separated by a space. Not a big deal in my case as I was primarily interested in the alternate spellings for countries, i.e. Guyane versus French Guiana. See the final output below for Guyane; these subdivision codes are from INSEE COG, which are the official codes used by the French government for identifying all geographic areas for both metropolitan France and overseas departments and collectivities.
1st half of CSV file imported into spreadsheet, records showing admin divisions of Guyane / French Guiana
2nd half of CSV file imported into spreadsheet, records showing admin codes and hierarchy of divisions for Guyane / French Guiana
Two final things to point out. First, my script lacks any exception handling, since my request is rather small and the API is reliable. If I was pulling down a lot of data I would replace my main for loop with a try and except block to handle errors, and would capture retrieved data as the process unfolds in case some problem forces a bail out. Second, when importing the CSV into a spreadsheet or database, it’s important to import the admin codes as text, as many of them have leading zeros that need to be preserved.
This example represents just the tip of the iceberg in terms of things you can do with Geonames APIs. Check it out!
In this post I’ll demonstrate how I created annotated heatmaps (or what I’m calling a rank change grid) showing change in rank over time using Python and Matplotlib’s imshow plots. I was writing a report on population trends and internal migration using the IRS county to county migration dataset, and wanted to depict the top origins and destinations of migrants for New York City and the New York Metropolitan Area and how they changed from year to year.
I hit upon this idea based on an example in the Matplotlib documentation using the imshow plot. Imshow was designed for manipulating and creating images, but since images are composed of rows and columns of pixels you can use this function to create grids (for GIS folks, think of a raster). The rows can indicate rank from 1 to N, while the columns could represent time, which in my case is years. I could label each grid cell with the name of a place (i.e. origin or destination), and if a place changes ranks over time I could assign the cell a color indicating increase or decrease; otherwise I’d assign a neutral color indicating no change. The idea is that you could look at place at a given rank in year 1 and follow it across the chart by looking at the label. If a new place appears in a given position, the color change clues you in, and you can quickly scan to see whether a given place went up or down.
The image below shows change in rank for the top metro area destinations for migrants leaving the NYC metro from 2011 to 2018. You can see that metro Miami was the top destination for several years, up until 2016-17 when it flips positions with metro Philadelphia, which had been the number 2 destination. The sudden switch from a neutral color indicates that the place occupying this rank is new. You can also follow how 3rd ranked Bridgeport falls to 4th place in the 2nd year (displaced by Los Angeles), remains in 4th place for a few years, and then falls to 5th place (again bumped by Los Angeles, which falls from 3rd to 4th as it’s bumped by Poughkeepsie).
Annual Change in Ranks for Top Destinations for NYC Metro Migrants (Metro Outflows)
I opted for this over a more traditional approach called a bump chart (also referred to a slope chart or graph), with time on the x-axis and ranks on the y-axis, and observations labeled at either the first or last point in time. Each observation is assigned a specific color or symbol, and lines connect each observation to its changing position in rank so you can follow it along the chart. Interpreting these charts can be challenging; if there are frequent changes in rank the whole thing begins to look like spaghetti, and the more observations you have the tougher it gets to interpret. Most of the examples I found depicted a small and finite number of observations. I have hundreds of observations and only want to see the top ten, and if observations fall in and out of the top N ranks you get several discontinuous lines which look odd. Lastly, neither Matplotlib or Pandas have a default function for creating bump charts, although I found a few examples where you could create your own.
Creating the rank change grids was a three-part process that required: taking the existing data and transforming it into an array of the top or bottom N values that you want to show, using that array to generate an array that shows change in ranks over time, and generating a plot using both arrays, one for the value and the other for the labels. I’ll tackle each piece in this post. I’ve embedded the functions at the end of each explanation; you can also look at my GitHub repo that has the Jupyter Notebook I used for the analysis for the paper (to be published in Sept 2020).
Create the Initial Arrays
In the paper I was studying flows between NYC and other counties, and the NYC metro area and other metropolitan statisical areas. I’ll refer just to the metro areas as my example in this post, but my functions were written to handle both types of places, stored in separate dataframes. I began with a large dataframe with every metro that exchanged migrants with the NYC metro. There is a row for each metro where the index is the Census Bureau’s unique FIPS code, and columns that show inflows, outflows, and net flows year by year (see image below). There are some rows that represent aggregates, such as flows to all non-metro areas and the sum of individual metro flows that could not be disclosed due to privacy regulations.
Initial Dataframe
The first step is to create an array that has just the top or bottom N places that I want to depict, just for one flow variable (in, out, or net). Why an array? Arrays are pretty solid structures that allow you to select specific rows and columns, and they mesh nicely with imshow charts as each location in the matrix can correspond with the same location in the chart. Most of the examples I looked at used arrays. It’s possible to use other structures but it’s more tedious; nested Python lists don’t have explicit rows and columns so a lot of looping and slicing is required, and with dataframes there always seems to be some catch with data types, messing with the index versus the values, or something else. I went with NumPy’s array type.
I wrote a function where I pass in the dataframe, the type of variable (in, out, or net flow), the number of places I want, whether they are counties or metro areas, and whether I want the top or bottom N records (true or false). Two arrays are returned: the first shows the FIPS unique ID numbers of each place, while the second returns the labels. You don’t have to do anything to calculate actual ranks, because once the data is sorted the ranks become implicit; each row represents ranks 1 through 10, each column represents a year, and the ID or label for a place that occupies each position indicates its rank for that year.
In my dataframe, the names of the columns are prefixed based on the type of variable (inflow, outflow, or net flow), followed by the year, i.e. inflows_2011_12. In the function, I subset the dataframe by selecting columns that start with the variable I want. I have to deal with different issues based on whether I’m looking at counties or metro areas, and I need to get rid of any IDs that are for summary values like the non-metro areas; these IDS are stored in a list called suppressed, and the ~df.indexisin(suppressed) is pandaesque for taking anything that’s not in this list (the tilde acts as not). Then, I select the top or bottom values for each year, and append them to lists in a nested list (each sub-list represents the top / bottom N places in order for a given year). Next, I get the labels I want by creating a dictionary that relates all ID codes to label names, pull out the labels for the actual N values that I have, and format them before appending them to lists in a nested list. For example, the metro labels are really long and won’t fit in the chart, so I split them and grab just the first piece: Albany-Schenectady-Troy, NY becomes Albany (split using the dash) while Akron, OH becomes Akron (if no dash is present, split at comma). At the end, I use np.array to turn the nested lists into arrays, and transpose (T) them so rows become ranks and years become values. The result is below:
Function and Result for Creating Array of IDs Top N Places
# Create array of top N geographies by flow type, with rows as ranks and columns as years
# Returns 2 arrays with values for geographies (id codes) and place names
# Must specify: number of places to rank, counties or metros, or sort by largest or smallest (True or False)
def create_arrays(df,flowtype,nsize,gtype,largest):
geogs=[]
cols=[c for c in df if c.startswith(flowtype)]
for c in cols:
if gtype=='counties':
row=df.loc[~df.index.isin(suppressed),[c]]
elif gtype=='metros':
row=df.loc[~df.index.isin(msuppressed),[c]]
if largest is True:
row=row[c].nlargest(nsize)
elif largest is False:
row=row[c].nsmallest(nsize)
idxs=list(row.index)
geogs.append(idxs)
if gtype=='counties':
fips=df.to_dict()['co_name']
elif gtype=='metros':
fips=df.to_dict()['mname']
labels=[]
for row in geogs:
line=[]
for uid in row:
if gtype=='counties':
if fips[uid]=='District of Columbia, DC':
line.append('Washington\n DC')
else:
line.append(fips[uid].replace('County, ','\n')) #creates short labels
elif gtype=='metros':
if '-' in fips[uid]:
line.append(fips[uid].split('-')[0]) #creates short labels
else:
line.append(fips[uid].split(',')[0])
labels.append(line)
a_geogs=np.array(geogs).T
a_labels=np.array(labels).T
return a_geogs, a_labels
Change in Rank Array
Using the array of geographic ID codes, I can feed this into function number two to create a new array that indicates change in rank over time. It’s better to use the ID code array as we guarantee that the IDs are unique; labels (place names) may not be unique and pose all kinds of formatting issues. All places are assigned a value of 0 for the first year, as there is no previous year to compare them to. Then, for each subsequent year, we look at each value (ID code) and compare it to the value in the same position (rank) in the previous column (year). If the value is the same, that place holds the same rank and is assigned a 0. Otherwise, if it’s different we look at the new value and see what position it was in in the previous year. If it was in a higher position last year, then it has declined and we assign -1. If it was in a lower position last year or was not in the array in that column (i.e. below the top 10 in that year) it has increased and we assign it a value of 1. This result is shown below:
Function and Result for Creating Change in Rank Array
# Create array showing how top N geographies have changed ranks over time, with rows as rank changes and
# columns as years. Returns 1 array with values: 0 (no change), 1 (increased rank), and -1 (descreased rank)
def rank_change(geoarray):
rowcount=geoarray.shape[0]
colcount=geoarray.shape[1]
# Create a number of blank lists
changelist = [[] for _ in range(rowcount)]
for i in range(colcount):
if i==0:
# Rank change for 1st year is 0, as there is no previous year
for j in range(rowcount):
changelist[j].append(0)
else:
col=geoarray[:,i] #Get all values in this col
prevcol=geoarray[:,i-1] #Get all values in previous col
for v in col:
array_pos=np.where(col == v) #returns array
current_pos=int(array_pos[0]) #get first array value
array_pos2=np.where(prevcol == v) #returns array
if len(array_pos2[0])==0: #if array is empty, because place was not in previous year
previous_pos=current_pos+1
else:
previous_pos=int(array_pos2[0]) #get first array value
if current_pos==previous_pos:
changelist[current_pos].append(0)
#No change in rank
elif current_posprevious_pos: #Larger value = smaller rank
changelist[current_pos].append(-1)
#Rank has decreased
else:
pass
rankchange=np.array(changelist)
return rankchange
Create the Plot
Now we can create the actual chart! The rank change array is what will actually be charted, but we will use the labels array to display the names of each place. The values that occupy the positions in each array pertain to the same place. The chart function takes the names of both these arrays as input. I do some fiddling around at the beginning to get the labels for the x and y axis the way I want them. Matplotlib allows you to modify every iota of your plot, which is in equal measures flexible and overwhelming. I wanted to make sure that I showed all the tick labels, and changed the default grid lines to make them thicker and lighter. It took a great deal of fiddling to get these details right, but there were plenty of examples to look at (Matplotlib docs, cookbook, Stack Overflow, and this example in particular). For the legend, shrinking the colorbar was a nice option so it’s not ridiculously huge, and I assign -1, 0, and 1 to specific colors denoting decrease, no change, and increase. I loop over the data values to get their corresponding labels, and depending on the color that’s assigned I can modify whether the text is dark or light (so you can see it against the background of the cell). The result is what you saw at the beginning of this post for outflows (top destinations for migrants leaving the NY metro). The function call is below:
Function for Creating Rank Change Grid
# Create grid plot based on an array that shows change in ranks and an array of cell labels
def rank_grid(rank_change,labels):
alabels=labels
xlabels=[yr.replace('_','-') for yr in years]
ranklabels=['1st','2nd','3rd','4th','5th','6th','7th','8th','9th','10th',
'11th','12th','13th','14th','15th','16th','17th','18th','19th','20th']
nsize=rank_change.shape[0]
ylabels=ranklabels[:nsize]
mycolors = colors.ListedColormap(['#de425b','#f7f7f7','#67a9cf'])
fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(rank_change, cmap=mycolors)
# Show all ticks...
ax.set_xticks(np.arange(len(xlabels)))
ax.set_yticks(np.arange(len(ylabels)))
# ... and label them with the respective list entries
ax.set_xticklabels(xlabels)
ax.set_yticklabels(ylabels)
# Create white grid.
ax.set_xticks(np.arange(rank_change.shape[1]+1)-.5, minor=True)
ax.set_yticks(np.arange(rank_change.shape[0]+1)-.5, minor=True)
ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
ax.grid(which="major",visible=False)
cbar = ax.figure.colorbar(im, ax=ax, ticks=[1,0,-1], shrink=0.5)
cbar.ax.set_yticklabels(['Increased','No Change','Decreased'])
# Loop over data dimensions and create text annotations.
for i in range(len(ylabels)):
for j in range(len(xlabels)):
if rank_change[i,j] < 0:
text = ax.text(j, i, alabels[i, j],
ha="center", va="center", color="w", fontsize=10)
else:
text = ax.text(j, i, alabels[i, j],
ha="center", va="center", color="k", fontsize=10)
#ax.set_title("Change in Rank Over Time")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
fig.tight_layout()
plt.show()
return ax
Conclusions and Alternatives
I found that this approach worked well for my particular circumstances, where I had a limited number of data points to show and the ranks didn’t fluctuate much from year to year. The charts for ten observations displayed over seven points in time fit easily onto standard letter-sized paper; I could even get away with adding two additional observations and an eighth point in time if I modified the size and placement of the legend. However, beyond that you can begin to run into trouble. I generated charts for the top twenty places so I could see the results for my own analysis, but it was much too large to create a publishable graphic (at least in print). If you decrease the dimensions for the chart or reduce the size of the grid cells, the labels start to become unreadable (print that’s too small or overlapping labels).
There are a number of possibilities for circumventing this. One would be to use shorter labels; if we were working with states or provinces we can use the two-letter postal codes, or ISO country codes in the case of countries. Not an option in my example. Alternatively, we could move the place names to the y-axis (sorted alphabetically or by first or final year rank) and then use the rank as the annotation label. This would be a fundamentally different chart; you could see how one place changes in rank over time, but it would be tougher to discern which places were the most important source / destination for the area you’re studying (you’d have to skim through the whole chart). Or, you could keep ranks on the y-axis and assign each place a unique color in the legend, shade the grid cells using that color, and thus follow the changing colors with your eye. But this flops is you have too many places / colors.
A different caveat is this approach doesn’t work so well if there is a lot of fluctuation in ranks from year to year. In this example, the top inflows and outflows were relatively stable from year to year. There were enough places that held the same rank that you could follow the places that changed positions. We saw the example above for outflows, below is an example for inflows (i.e. the top origins or sources of migrants moving to the NY metro):
Annual Change in Ranks for Top Origins for NYC Metro Migrants (Metro Inflows)
In contrast, the ranks for net flows were highly variable. There was so much change that the chart appears as a solid block of colors with few neutral (unchanged) values, making it difficult to see what’s going on. An example of this is below, representing net flows for the NYC metro area. This is the difference between inflows and outflows, and the chart represents metros that receive more migrants from New York than they send (i.e. net receivers of NY migrants). While I didn’t use the net flow charts in my paper, it was still worth generating as it made it clear to me that net flow ranks fluctuate quite a bit, which was a fact I could state in the text.
Annual Change in Ranks for Net Receivers of NYC Metro Migrants (Metro Net Flows)
There are also a few alternatives to using imshow. Matplotlib’s pcolor plot can produce similar effects but with rectangles instead of square grid cells. That could allow for more observations and longer labels. I thought it was less visually pleasing than the equal grid, and early on I found that implementing it was clunkier so I went no further. My other idea was to create a table instead of a chart. Pandas has functions for formatting dataframes in a Jupyter Notebook, and there are options for exporting the results out to HTML. Formatting is the downside – if you create a plot as an image, you export it out and can then embed it into any document format you like. When you’re exporting tables out of a notebook, you’re only exporting the content and not the format. With a table, the content and formatting is separate, and the latter is often tightly bound to the publication format (Word, LaTeX, HTML, etc.) You can design with this in mind if you’re self-publishing a blog post or report, but this is not feasible when you’re submitting something for publication where an editor or designer will be doing the layout.
I really wanted to produce something that I could code and run automatically in many different iterations, and was happy with this solution. It was an interesting experiment, as I grappled with taking something that seemed intuitive to do the old-fashioned way (see below) and reproducing it in a digital, repeatable format.
NOTE – in mid-2021 the GeoBlacklight metadata standard was refined and renamed as the OpenGeoMetadata Aardvark Metadata Schema. The documentation for the standard was moved from the GB to the OGM GitHub repo.
During the COVID-19 lock down this past spring, I took the opportunity to tackle a project that was simmering on my back burner for quite a while: creating a coherent metadata standard for describing all of the GIS data we create in the lab. For a number of years we had been creating ISO 19115 / 19139 metadata records in XML, which is the official standard. But the process for creating them was cumbersome and I felt it provided little benefit. We don’t have a sophisticated repository where the records are part of a search and retrieval system, and that format wasn’t great for descriptive documentation either, i.e. people weren’t reading or referring to the metadata to learn about the datasets.
We were already creating basic Dublin Core (DC) records for our spatial databases and non-spatial data layers, so I decided to create some in-house guidelines and build processes and scripts for creating all of our metadata in DC, while adhering to guidelines and best practices established by the GeoBlacklight community. I’ll describe my process in this post, and hopefully it should be useful for others who are looking to do similar work. I’ll mention some resources for learning more about geospatial metadata in the conclusion.
FGDC and ISO 19115 / 19139 have long been the standards for creating geospatial metadata and expressing it in XML. The ISO standard has eclipsed the FGDC standard, but some institutions continue to use FGDC for legacy reasons. Since ISO was the standard, I decided that we should use it for all the layers we create in the lab. So why am I abandoning it?
The ISO standard is enormous and complex. Even though we were using just the required elements in the North American Profile, there were still over 100 fields that we had to enter. Many of them seemed silly, like having to input a name and mailing address for our organization for every instance where we were a publisher, distributor, creator, etc. We were creating the records in the ArcCatalog, which only permits you to create and edit records using the ArcGIS metadata standard. You have to export the results out to ISO, but if you have changes to make you need to edit the original ArcGIS metadata, which forces you to keep two copies of every record. The metadata can be easily viewed in ArcCatalog, but not in other software. Creating your own stylesheet for displaying ISO records is a real pain, because there are dozens of namespaces and lots of nesting in the XML. So we ended up printing the styled records from ArcCatalog as PDFs (ugh), to include alongside the XML. Processing and validating the records to our own standards was equally painful.
In short, metadata creation became a real bottleneck in our data creation process, and the records didn’t serve a useful purpose for either search or description. Each record was about six pages long, and at least two-thirds of the information wasn’t relevant for people who were trying to understand our data.
In with the New
Dublin Core (DC) is a basic, open-ended metadata standard that was created in the mid 1990s, and has been adopted by numerous libraries and archives for documenting collections. There is a set of core elements and an additional set of terms that refine these elements. For example, there is a core Coverage element intended for describing time and place, but then there are more specific DC terms for Temporal and Spatial description. DC hasn’t been widely used for geospatial metadata because it’s not able to express all of the details that are important to those datasets, such as coordinate reference systems, resolution, and scale.
At least that’s what I thought. GeoBlacklight is an open source search and discovery platform for GIS data which relies on metadata records to power the platform. It was developed at Stanford and is used by several universities such as NYU and Cornell. A working group behind the project created and adopted a GeoBlacklight Metadata Schema (GB) which uses a mix of DC elements and GB-specific elements that are necessary for that platform. GB metadata is stored in a JSON format.
The primary aim of the standard is to provide records that aid searching, and there is less of an emphasis on fully describing the datasets for the purpose of providing documentation. But for my purposes, it was fine. I could adopt a subset of the DC elements and terms that I felt were necessary for describing our resources, while keeping an eye on the GB standard and following it wherever I could. If there was a conflict between what we wanted to do and the GB standard, I made sure that I could crosswalk one to the other. I would express our records in XML, and would write a script to transform them to the GB standard in a JSON format. That way, I can share our data and metadata with NYU, which has a GB-backed spatial data repository and re-hosts several of our datasets.
Data Application Profile
The first step was to establish our in-house guidelines. I created a data application profile to specify the DC elements we would use, which are required versus optional, and whether they can repeat or not. For example, we require a Title and Creator element, where there can be only one title but multiple creators (as more than one person may have created the resource). We also use the Contributor element to reference people who had some role in creating the dataset, but not as significant as the creator. This element is optional, as there might not be a contributor.
For some of the elements we require specific formatting, while for others we specify a set of controlled terms that must be used. For the Title, we follow Open Geoportal best practices where the title is written as: Layer Name, Place, Date (OGP is a separate but related geospatial data discovery platform and community). Publication dates are written as YYYY-MM. For the Subject element, we use the ISO 19115 categories that are commonly used for describing geospatial data. These are useful for filtering records by topic, and are much easier to consistently implement than Library of Congress Subject Headings. In some cases I adopted elements and vocabulary that were required by GB, such as the Provenance field which is used by GB to record the name of the institution that holds the dataset.
Snippet of an XML metadata record:
<title>Real Estate Sales, New York NY, 2019</title>
<creator>Frank Donnelly</creator>
<creator>Anastasia Clark</creator>
<subject>economy</subject>
<subject>planningCadastre</subject>
<subject>society</subject>
<subject>structure</subject>
In other cases I went contrary to the GB standard, but kept an eye on how to satisfy it so when records are crosswalked from our DC standard in XML to the GB standard in JSON, the GB records will contain all the required information. I used the DC Medium element to specify how the GIS data was represented (point, line, polygon, raster, etc), which is required as a non-DC element in GB records. I used the Source element to express the original data sources we used for creating our data, which is something that’s important to us. GB doesn’t use the Source element in this way – so when our records are crosswalked this information will be dropped. I used Coverage for expressing place names from Geonames, and Spatial to record the bounding box for the features using the DC Box format. I also used The Spatial field as a way of expressing the coordinate reference system for the layer: the coordinates and system name published in the record are for the system the layer uses. When this information gets crosswalked to a GB record, the data in Coverage will be transferred to Spatial (as GB uses this DC field for designating place names), and the Spatial bounding box data gets migrated and transformed to a GB-specific envelope element (in GB, the bounding box is used for spatial searching and is expressed in WGS 84).
Creating Records and Expressing them in XML
I created a blank XML template that contains all our required and optional elements, with some sample values to indicate how the elements should be populated. If we are creating new records from scratch, we can copy the template and fill it in using a plain text editor. Alternatively we can also use the Dublin Core Metadata Generator, but in using that we’d have to be careful to follow our DAP, selecting just our required elements and omitting ones that we don’t use. All of our elements are expressed under a single root element called “metadata”. While it is common to prefix DC elements with a namespace, we do NOT include these namespaces because it makes validating the records a pain… more about that in the next section.
In most cases, we won’t be creating records from scratch but will be modifying existing ones, and only a handful of the elements would need to be updated in any given year. To this end, I wrote two scripts that make copies of specific records, one for each year or time period (for one feature that repeats over many years – one file to many), and one for each file name (for a series of different layers produced for one time period – one file to one file). That way, we can edit the copies and change just the fields that need updating.
The process of updating the records is going to vary with each series and with each particular iteration. I wrote some Python functions that can be used repeatedly, and import these into scripts that are custom designed for each dataset. I use the ElementTree module to parse the XML and update specific fields. For elements that require updated values, I either hard code them into the script or read them from a JSON file that I construct. For example, to update our real estate sales layers I took the file for the most recent year and made copies of it, named for each year in the past. Then I created a JSON file that’s read into a Python dictionary, and using the year in the file name as the key, elements for: year issued, year published, creator, and contributor in the XML are modified using the dictionary values. In some cases the substring of an element must be changed, such as the year that appears at the end of the Title field. In other cases an entire value can simply be swapped out.
A portion of a json file is below, where year is the key and value is a dictionary that contains element name as key, and value is element text / value to be updated:
Several files called “real_property_sales_nyc_YEAR.xml” are read in. These files are all copies of data for the most recent year (2019), but YEAR in the file name is substituted with each year of data that we have (2018, 2017, etc). The script reads in the year from the file name, and updates the appropriate elements using that year as the key (in the actual code the functions are imported from a separate file so they can be used in multiple scripts). With ElementTree: parse the xml file to save it as a tree object, then getroot to get the root element of the tree, which allows you to find specific elements (like the title element) and get the text value of that element (the actual title), so you can modify the value:
import xml.etree.ElementTree as ET
def esimple(root,elem,value):
"""Replace the value of an element"""
e=root.find(elem)
e.text=value
def esubstring(root,elem,oldstring,newstring):
"""Replace a portion (substring) of the value of an element"""
e=root.find(elem)
e.text=e.text.replace(oldstring,newstring)
with open(json_update_file) as jfile:
updates = json.load(jfile)
for file in os.listdir(infolder):
if file[-4:]=='.xml':
xfilein=os.path.join(infolder,file)
year=file[-8:-4] #assumes year is at end of file name
tree = ET.parse(xfilein)
root = tree.getroot()
esimple(root,'temporal',year)
esimple(root,'issued',updates[year]['issued'])
esubstring(root,'title','2019',year)
...
Validating Records
Since I’m already operating within Python, I decided to use a 3rd party Python module called xmlschema to validate our records. It’s simple and works quite nicely, but first we need to have a schema. An XML schema or XSD file is an XML file that contains instructions on how our XML metadata should be structured: all the elements we use, whether they are required or not, whether they can repeat or not, the order in which they must appear, and whether there are controlled vocabularies. I hard-coded shorter vocabs like the ISO categories and many of the GB requirements into the schema, but I didn’t include absolutely everything. Since we’re a small shop that makes a limited amount of metadata, we’ll still be checking the records manually.
The Python validator script checks for well formed-ness first (i.e. every open tag has a closing tag), and then proceeds to check the record against the schema file. If it hits something that’s invalid, it stops. I wrote the script as a batch process, so if one file fails it moves on to the next one. It produces an error report, and we make the corrections and go back and try validating again until everything passes.
In the snippet below, we read in the schema file and a folder of records to be validated. myschema.is_valid(file) tests whether the schema is valid or not. If it’s valid (True) I perform some other check (not shown), and if that turns out OK then we have no problems. Otherwise if my check fails, or the schema returns as not valid (False), record the problem and move on to the next file. Problems are printed to screen, and other status messages are output (not shown):
import xmlschema as xmls
import os
schemafile=os.path.join('bcgis_dc_schema.xsd')
xmlfolder=os.path.join('projects','nyc_real_estate','newfiles')
problems={}
noproblems=[]
my_schema = xmls.XMLSchema(schemafile)
for file in os.listdir(xmlfolder):
if file[-4:]=='.xml':
filepath=os.path.join(xmlfolder,file)
my_xml = xmls.XMLResource(filepath)
test=my_schema.is_valid(my_xml)
if test==True:
msg=spatial_check(my_xml)
if msg==None:
noproblems.append(file)
else:
problems[file]=msg
else:
try:
my_schema.validate(my_xml)
except xmls.XMLSchemaException as e:
problems[file]=e
continue
if len(problems)&amp;gt;0:
print('***VALIDATION ERRORS***\n')
for k, v in problems.items():
print(k,v)
else:
pass
Creating the schema was an area where I ran into trouble, because I initially tried to refer to the DC namespaces in my records and kept all of the namespace prefixes with the elements, i.e. dc:title instead of title. This seemed like the “right” thing to do. It turns out that schema creation and validation for XML records with multiple namespaces is a royal pain; you have to create multiple schemas for each namespace, which in this case would mean three: DC elements, DC terms, and my own root metadata element. Then I’d have to write all kinds of stuff to over-ride the default DC schema files, because they were designed to be flexible: the only requirement is that just the listed DC elements and terms can appear. My requirements are much more stringent, as I wanted to use a subset of the DC elements and impose restrictions to insure consistency. I watched countless YouTube videos on XML schemas and read several tutorials that didn’t address this complexity, and what I did find suggested that it was going to be arduous.
Ultimately I decided – why bother? I jettisoned the namespaces from our records, and have yet another script for stripping them out of records (for older records where we had included namespaces). As long as I have a solid DAP, am following best practices consistently, am using vocabularies that others are using, and we validate and check our records, others should be able to interpret them and either ingest or transform them if they need to.
Styling the Records
While XML is fine for search and data exchange, it’s not friendly for humans to read, so it’s a good idea to apply a stylesheet to render the XML in a readable form in a web browser. XLST stylesheets give you lots of control, as you can actually process and transform the underlying XML to display what you want. But to my dismay, this was also needlessly complicated. Searching around, I discovered that many web browsers had deprecated the display of XML via XLST stylesheets stored locally for security reasons, which makes development cumbersome. I also got the sense that this is an atrophying technology; most of the references I found to XSLT were 10 to 20 years old.
So, I took a simpler route and created a CSS stylesheet, which can style XML essentially the same way that it styles HTML. You must display all of the underlying data in the XML file and your control is limited, but there are enough tweaks where you can insert labels, and vary the display for single items versus lists of items. I keep the stylesheet stored on our website, and as part of the metadata creation process a link to it is embedded in each of our records. So when you open the XML file in a browser, it looks to the online CSS file and renders it. The XML snippet at the beginning of this post looks like the image below, when a user clicks on the file to view it:
Migrating the DC XML records to GB JSON was actually straightforward, using Python and a combination of the ElementTree and JSON modules. I walked each of our elements over to its corresponding GB element, and made modifications when needed using the ElementTree functions and methods mentioned previously; I find each element in our XML and save its value in a dictionary using the GB element name as the key, and at the end I write the items out to JSON. In some cases elements from our records are dropped if there isn’t a suitable counterpart. The bounding box element was trickiest, as I had to convert the northing and easting coordinate format used by DC box into a standard OGC Envelope structure, and transform the CRS from whatever the layer uses to WGS 84. I used the pyproj module to do both. Some of the elements are left blank, because I’d have to coordinate with my colleagues at NYU for filling those values, like the unique id or “slug” that’s native to their repository.
Conclusion
You should always create metadata and rely on existing standards and vocabulary to the extent possible, but create something that works for you given your circumstances. Figure out whether your metadata serves a discovery function or a documentation function, or some combination of both. Create a clear set of rules that you can implement that will provide useful information to your data users, without becoming a serious impediment to your data creation process. Creating metadata does take time and you need to budget for it accordingly, but don’t waste time (like I was) dealing with clunky metadata creation software and lots of information that will largely go unused, simply because it’s “the right thing to do”. Ideally, whatever you create should be consistent and flexible enough that you can write some code to transfer it to a different standard or format if need be.
This month’s post is a bit shorter, as I have just two announcements I wanted to share about some resources I’ve created.
First, I’ve written a short technical paper that’s just been published as part of the Weissman Center of International Business’ Occasional Papers Series. Exploring US Census Datsets: A Summary of Surveys and Sources provides an overview of several different datasets (decennial census, American Community Survey, Population Estimates Program, and County Business Patterns) and sources for accessing data. The paper illustrates basic themes that are part of all my census-related talks: the census isn’t just the thing that happens every ten years but is an ecosystem of datasets updated on an on-going basis, and there are many sources for accessing data which are suitable for different purposes and designed for users with varying levels of technical skill. In some respects this paper is a super-abridged version of my book, designed to serve as an introduction and brief reference.
Second, I’ve created a series of introductory notebooks on GitHub that illustrate how to use the Census Bureau’s API with Python and Jupyter Notebooks. I designed these for a demonstration I gave at NYU’s Love Data Week back on Feb 10 (the slides for the talk are also available in the repo). I structured the talk around three examples. Example A demonstrates the basics of how the API works along with some best practices, such as defining your variables at the top and progressively building links to retrieve data. It also illustrates the utility of using these technologies in concert, as you can pull data into your script and process and visualize it in one go. I also demonstrate how to retrieve lists of census variables and their corresponding metadata, which isn’t something that’s widely documented. Example B is a variation of A, extended by adding an API key and storing data in a file immediately after retrieval. Example C introduces more complexity, reading variables in from files and looping through lists of geographies to make multiple API calls.
Since I’ve written a few posts on the census API recently, I went back and added an api tag to group them together, so you can access them via a single link.
Define census API variables, build links, and retrieve data
You must be logged in to post a comment.