scripting

Anything about scripting – Python, R, Javascript, etc

Sample of Geolocated Tweets Nov 1, 2022

Parsing the Internet Archive’s Twitter Stream Grab with Python

In this post I’ll share a process for getting geo-located tweets from Twitter, using large files of tweets archived by the Internet Archive. These are tweets where the user opted to have their phone or device record the longitude and latitude coordinates for their location, at the time of the tweet. I’ve created some straightforward scripts in Python without any 3rd party modules for processing a daily file of tweets. Given all the turmoil at Twitter in early 2023, most of the tried and true solutions for scraping tweets or using their APIs no longer function. What I’m presenting here is one, simple solution.

Social media data is not my forte, as I specialize in working with official government datasets. When such questions turn up from students, I’ve always turned to the great Web Scraping Toolkit developed by our library’s Center for Digital Scholarship. But the graduate student I was helping last week and I discovered that both the Twint and TAGS tools no longer function due to changes in Twitter’s developer policies. Surely there must be another solution – there are millions of posts on the internet that show how easy it is to grab tweets via R or Python! Alas, we tried several alternatives to no avail. Many of these projects rely on third party modules that are deprecated or dodgy (or both), and even if you can escape from dependency hell and get everything working, the changed policies rendered them moot.

You can register under Twitter’s new API policy and get access to a paltry number of records. But I thought – surely, someone else has scraped tons of tweets for academic research purposes and has archived them somewhere – could we just access those? Indeed, the folks at Harvard have. They have an archive of geolocated tweets in their dataverse repository, and another one for political tweets. They are also affiliated with a much larger project called DocNow with other schools that have different tweet archives. But alas, there are rules to follow, and to comply with Twitter’s license agreement Harvard and these institutions can’t directly share the raw tweets with anyone outside their institutions. You can search and get IDs to the tweets, using their Hydrator application, which you can use in turn to get the actual tweets. But then in small print:

“Twitter’s changes to their API which greatly reduce the amount of read-only access means that the Hydrator is no longer a useful application. The application keys, which functioned for the last 7 years, have been rescinded by Twitter.”

Fortunately, there is the Internet Archive, which has been working to preserve pieces of the internet for posterity for several decades. Their Twitter Stream Grab consists of monthly collections of daily files for the past few years, from 2016 to 2022. This project is no longer active, but there’s a newer one called the Twitter Archiving Project which has data from 2017 to now. I didn’t investigate this latter one, because I wasn’t sure if it provided the actual tweets or just metadata about them, while the older project definitely did. The IA describes the Stream Grab as the “spritzer” version of Twitter grabs (as opposed to a sprinkler or garden hose). Thanks to the internet, it’s easy to find statistics but hard to find reliable ones – this one, credible looking source (the GDELT Project) suggests that there are between 400 and 500 million tweets a day in recent years. The file I downloaded from IA for one day had over 4 million tweets, so that’s about 1% of all tweets.

I went into the November 2022 collection and downloaded the file for Nov 1st. It’s a TAR file that’s about 3 GB. Unzipping it gives you a folder for that data named for the date, with hundreds of gz ZIP files. Unzip those, and you have tons of JSON Line files. These are JSON files where each JSON record has been collapsed into one line.

Internet Archive Twitter Stream Grab

Python to the rescue. See GitHub for the full scripts – I’ll just add some snippets here for illustration. I wrote two scripts: the first reads in and aggregates all the tweets from the JSONL files, parses them into a Python dictionary, and writes out the geo-located records into regular JSON. The second reads in that file, selects the elements and values that we want into a list format, and writes those out to a CSV. The rationale was to separate importing and parsing from making these selections, as we’re not going to want to repeat the time-consuming first part while we’re tweaking and modifying the second part.

In the sample data I used for 11/01/2022, unzipping the downloaded TAR file gave me a date folder, and in that date folder were hundreds of gz ZIP files. Unzipping those revealed the JSONL files. I wrote the script to look in that date folder, one level below the folder that holds the scripts, and read in anything that ended with .json. Not all of the Internet Archive’s stream’s are structured this way; if your downloads are structured differently, you can simply move all the unzipped json files to one directory below the script to read them. Or, you can modify the script to iterate through sub-directories.

Because the data was stored as JSONL, I wasn’t able to read it in as regular JSON. I read each line as a string that I appended to a list, iterated through that list to convert it into a dictionary, pulled out the records that had geo-located elements, and added those records to a larger dictionary where I used an identifier in the record as a key and the value as a dictionary with all elements and values for a tweet. This gets written out as regular JSON at the end. Reading the data in didn’t take long; parsing the strings into dictionaries was the time consuming part. Originally, I wanted to parse and save all 4 million records, but the process stalled around 750k as I ran out of memory. Since so few records are geo-located, just selecting these circumvented this problem. If you wanted to modify this part to get other kinds of records, you would need to apply some filter, or implement a more efficient process than what I’m using.

json_list=[] # list of lists, each sublist has 1 string element = 1 line

for f in os.listdir(json_dir):
    if f.endswith('.json'):
        json_file=os.path.join(json_dir,f)
        with open(json_file,'r',encoding='utf-8') as jf:
            jfile_list = list(jf) # create list with one element, a line saved as a string 
            json_list.extend(jfile_list)
            print('Processed file',f,'...')

geo_dict={} # dictionary of dicts, each dict has line parsed into keys / values
i=0   
for json_str in json_list:
    result = json.loads(json_str) # convert line / string to dict
    if result.get('geo')!=None: # only take records that were geocoded
        geo_dict[result['id']]=result 
    i=i+1
    if i%100000==0:
        print('Processed',i,'records...')

The second script reads the JSON output from the first, and simply iterates through the dictionary and chooses the elements and values I want and assigns them to variables. Some of these are straightforward, such as grabbing the timestamp and tweet. Others required additional work. The source element provides HTML code with a source link and name, so I split and strip this value to get them separately. The coordinates are stored as a list, so to get longitude and latitude as separate values I indicate the list position. In cases where I’m delving into a sub-dictionary to get a value (like the coordinates), I added if statements to set values to None if they don’t exist in the JSON, otherwise you get an error. Once I finish iterating, I append all these variables to a list, and add this list to the main one that captures every record. I create a matching header row list, and both are written out as a CSV.

with open(input_json) as json_file:
    twit_data = json.load(json_file)

twit_list=[]

# In this block, select just the keys / values to save
for k,v in twit_data.items():
    tweet_id=k
    timestamp=v.get('created_at')
    tweet=v.get('text')
    # Source is in HTML with anchors. Separate the link and source name
    source=v.get('source') # This is in HTML
    source_url=source.split('"')[1] # This gets the url
    source_name=source.strip('</a>').split('>')[-1] # This gets the name
    lang=v.get('lang')
    # Value for long / lat is stored in a list, must specify position
    if v['geo'] !=None:
        longitude=v.get('geo').get('coordinates')[1]
        latitude=v.get('geo').get('coordinates')[0]
    else:
        longitude=None
        latitude=None
...

My code could use improvement – much of this could be abstracted into a function to avoid repetition. We were in a hurry, and I’m also working with folks who need data but aren’t necessarily familiar with Python, so something that’s inefficient but understandable is okay (although I will polish this up in the future).

I provide the output in GitHub, examples of the final CSV appear below. Every language in the world is captured in these tweets, so Windows users need to import the CSV into Excel (Data – From Text/CSV) and choose UTF-8 encoding. Double-clicking the CSV to open it in Excel in Windows will render most of the text as junk, in the default Windows-1252 encoding.

Tweets extracted from Internet Archive with timestamp, tweets, and source information
Geolocated Twitter Data 1
Tweets extracted from Internet Archive, showing geo-located information

So, is this data actually useful? That’s an open question. Of the 4 million tweets in this file, just over 1,158 were geo-located! I checked and this is not a mistake. The metadata record for the Harvard geolocated tweets mentions that only 1% to 2% of all tweets are geo-located. So of the 400 million daily tweets, only 4 million. And out of our daily 4 million sample from IA, just 1,158 (less than 1%). What we ended up with does give you a sense of variety and global coverage (see map at the top of the post, showing sample of tweets by language Nov 1, 2022). In this sample, the top five countries represented were: US (35%), Japan (17%), Brazil (4%), UK (4%), Mexico and Turkey (tied 3%). For languages, the top five: English (51%), Japanese (17%), Spanish (9%), Portuguese (5%), and Turkish (3%).

In many cases, I think you’d need a larger sample than a single day, assuming you’re interested in just geo-located records. Perhaps 4 million is large enough for certain non-spatial research? Again, not my area of expertise, but you would want to be aware of events that happened on a certain date that would influence what was tweeted. My graduate student wanted to see differences in certain kinds of tweets in the LA metro area versus the rest of the US, but this sample includes less than 20 tweets from LA. To do anything meaningful, she’d have to download and process a whole month of tweets (at least). Even then, there are certain tweeters that show up repeatedly in given areas. In NYC, most of the tweets on this date were from the 511 service, warning people where that day’s potholes were.

Beyond the location of the tweet, there is a lot of information about the user, including their self-reported location. This data is available in all tweets (not just the geo-located ones). But there are a lot problems with this attribute: the user isn’t necessarily tweeting from that location, as it represents their “static” home. This location is not geocoded, and it’s self reported and uncontrolled. In this example, some users dutifully reported their home as ‘Cleveland, OH’ or ‘New York City’. Other folks listed ‘NYC – LA – ATL – MIA’, ‘CIUDAD DE LAS BAJAS PASIONES’, ‘H E L L’, and ‘Earth. For now’.

Even for research that incorporated geo-located tweets from other, larger data sources that were previously accessible, how representative are all those studies when the data represents only 1% of the total tweet volume? I am skeptical. Also consider the information from the good folks at the Pew Research Center, that tells us that only one in five US adults use Twitter, and that the minority of Twitter users generate the vast majority of tweets: “The top 25% of US users by tweet volume produce 97% of all tweets, while the bottom 75% of users produce just 3%” (10 Facts About Americans and Twitter May 5, 2022).

For what’s it worth, if you need access to Twitter data for academic, non-commercial research purposes and the old methods aren’t working, perhaps the Internet Archive’s data and the solution posed here will fit the bill. You can see the geo-located output (JSON and CSV) from this example in the GitHub repo’s output folder. There is also a samples folder, which contains JSON and CSV for about 77k records that include both geo-located and non-geolocated examples. Looking at the examples can help you decide how to modify the scripts, to pull out different elements and values of interest.

Stata splash screen

Introduction to Stata Tutorial

This month’s post will be brief, but helpful for anyone who wants to learn Stata. I wrote a short tutorial called First Steps with Stata for an introductory social science data course I visited earlier this month. It’s intended for folks who have never used a statistical package or a command-driven interface, and represents initial steps prior to doing any statistical analysis:

  1. Loading and viewing Stata dta files
  2. Describing and summarizing data
  3. Modifying and recoding data
  4. Batch processing with Do files / scripts
  5. Importing data from other formats

I chose the data that I used in my examples to illustrate the difference between census microdata and summary data, using sample data from the Current Population Survey to illustrate the former, and a table from the American Community Survey to represent the latter.

I’m not a statistician by training; I know the basics but rely on Python, databases, Excel, or GIS packages instead of stats packages. I learned a bit of Stata on my own in order to maintain some datasets I’m responsible for hosting, but to prepare more comprehensively to write this tutorial I relied on Using Stata for Quantitative Analysis, which I highly recommend. There’s also an excellent collection of Stata learning modules created by UCLA’s Advance Research Computing Center. Stata’s official user documentation is second to none for clearly introducing and explaining the individual commands and syntax.

In my years working in higher ed, the social science and public policy faculty I’ve met have all sworn by Stata over the alternatives. A study of citations in the health sciences, where stats packages used for the research were referenced in the texts, illustrates that SPSS is employed most often, but that Stata and R have increased in importance / usage over the last twenty years, while SAS has declined. I’ve had some students tell me that Stata commands remind them of R. In searching through the numerous shallow reviews and comparisons on the web, I found this post from a data science company that compares R, Python, SPSS, SAS, and Stata to be comprehensive and even-handed in summarizing the strengths and weaknesses of each. In short, I thought Stata was fairly intuitive, and the ability to batch script commands in Do files and to capture all input / output in logs makes it particularity appealing for creating reproducible research. It’s also more affordable than the other proprietary stats packages and runs on all operating systems.

Example – print the first five records of the active dataset to the screen for specific variables:

list statefip age sex race cinethp in f/5
     +-------------------------------------------+
     | statefip   age      sex    race   cinethp |
     |-------------------------------------------|
  1. |  alabama    76   female   white       yes |
  2. |  alabama    68   female   black       yes |
  3. |  alabama    24     male   black        no |
  4. |  alabama    56   female   black        no |
  5. |  alabama    80   female   white       yes |
     |-------------------------------------------|
UN ICSC Retail Price Index Map

UN Retail Price Index Time Series

We recently launched our fledgling geodata portal on GitHub for the open datasets we’ll create in our new lab. In the spring we carved out a space on the 11th floor of the Sciences Library at Brown which we’ve christened GeoData@SciLi, a GIS and data consultation and work space. We’ll be doing renovations on both the webspace and workspace over the summer.

Our inaugural dataset was created by Ethan McIntosh, a senior (now graduate) who began working with me this spring. The dataset is the United Nations International Civil Service Commission’s (UN ICSC) Retail Price Indices with Details (RPID). The index measures the cost of living based on several categories of goods and services in duty stations around the world. It’s used to adjust the salaries of the UN’s international staff relative to UN headquarters in New York City (index value of 100 = cost of living in New York). The data is updated six times a year, published in an Excel spreadsheet that contains a macro that allows you to look up the value of each duty station via a dropdown menu. The UN ICSC makes the data public by request; you register and are granted access to download the data in PDF and Excel format in files that are packaged in one month / year at a time.

We were working with a PhD student in economics who wanted to construct a time-series of this data. Ethan wrote a Python script to aggregate all of the files from 2004 to present into a single CSV; the actual values for each country / duty station were stored in hidden cells that the macro pulled from, and he was able to pull them from these cells. He parsed the data into logical divisions, and added the standard 3-letter ISO 3166 country code to each duty station so that each record now has a unique place identifier. His script generates three outputs: a basic CSV of the data in separate month / year files, a “long” (aka flat) time series file where each record represents a specific duty station and retail index category or weight for a given month and year, and a “wide” time series file where the category / weight has been pivoted to a column, so each record represents all values for a duty station for a given month / year. He’s written the program to process and incorporate additional files as they’re published.

While the primary intention was to study this data as a time series in a statistical analysis, it can also be used for geospatial analysis and mapping. Using the wide file, I created the map in the header of this post, which depicts the total retail index for February 2022 for each country, where the value represents the duty station within the country (usually the capital city). I grabbed some boundaries from Natural Earth and joined the data to it using the ISO code. I classified the data using natural breaks, but manually adjusted the top level category to include all countries with a value greater than or equal to the base value of 100.

There were only five duty stations that were more expensive than New York, with values between 102 and 124: Tokyo, Ashkhabad (Turkmenistan), Singapore, Beirut, and Hong Kong. Beijing and Geneva were equivalent in price at 100. The least expensive stations with values between 52 and 69 were: Caracas (Venezuela), Tripoli, Damascus, Ankara (Turkey), Bucharest (Romania), Mbabane (Eswatini – formerly Swaziland), and Sofia (Bulgaria). There appears to be regional clustering of like values, although I didn’t run any tests. The station in the US that’s measured relative to NYC is Washington DC (index value of 89).

The final datasets and code used to generate them are available on GitHub, and we’ll update it at least once, if not a couple times, a year. We are not providing the original month / year macro spreadsheets; if you want those you should register with the UN ICSC and access them there. If you’re using our data files, you should still register with them, as they would like to be aware of how their data is being used.

We will post additional projects, datasets, and code in individual repos as we create them, linked to from our main page. I’m working on creating a basic metadata profile for our lab, so we’ll provide structured metadata for each of our datasets in the near future.

STATA records

Creating STATA Variable Lists in Excel and Do Files With Python

In this post I demonstrate how export a list of variables from a STATA dta file to an Excel spreadsheet, and how to create a STATA do file by using Python to read in a list of variables from a spreadsheet; the do file will generate an extract of attributes and observations from a larger dta file. Gallup Analytics microdata serves as the example.

Gallup Analytics Microdata

Many academic libraries subscribe to an online database called Gallup Analytics, which lets users explore and download summary statistics from a number of on-going polls and surveys conducted by the Gallup Organization, such as the US Daily Tracker poll, World Poll, and SPSS polling series. As part of the package, subscribing institutions also receive microdata files for some of the surveys, in STATA and SPSS formats. These files contain the anonymized, individual responses to the surveys. The microdata is valuable to social science researchers who use the responses to conduct statistical analyses.

STATA
Microdata in STATA

Naturally, the microdata is copyrighted and licensed for non-commercial research purposes to members of the university or institution who are covered by the license agreement, and cannot be shared outside the institution. Another stipulation is that the files cannot be shared in their entirety, even for members of the licensed institution; researchers must request individual extracts of variables and observations to answer a specific research question. This poses a challenge for the data librarian, who somehow has to communicate to the researcher what’s available in the files and mediate the request. Option 1 is to share the codebooks (which are also copyrighted and can’t be publicly distributed) with the researcher and haggle back and forth via email to iron out the details of the request. Option 2 is to have a stand-alone computer set up in the library, where a researcher can come and generate their own extract from files stored on a secure, internal network. In both cases, the manual creation of the extract and the researcher’s lack of familiarity with the contents of the data makes for a tedious process.

My solution was to create spreadsheets that list all of the variables in each dataset, and have the researcher check the ones they want. I created a resource guide that advertises and describes the datasets, and provides secure links to the Gallup codebooks and these spreadsheets, which are stored on a Google Drive and are protected via university authentication. The researcher can then fill out a Google form (also linked to from that page), where they describe the nature of the request, select the specific dataset of interest, specify filters on observations (rows), and upload the spreadsheet of requested variables (columns). Then, I can read the spreadsheet variables into Python and generate a STATA do file (STATA scripts stored in plain text format), to create the desired extract which I can share with the researcher.

Create List of STATA Variables in Excel Spreadsheet

First, I created a standard set of STATA do files to output lists of all variables to a spreadsheet for the different data files. An example for the US Daily Tracker poll from pre-2018 is below. I was completely unfamiliar with STATA, but the online docs and forums taught me what I needed to pull this together.

Some commands are the same across all the do files. I use describe and then translate to create a simple text file that saves a summary from the screen that counts rows and columns. Describe gives a description of the data stored in memory, while replace is used to swap out existing variables with a new subset. Then, generate select_vars gives me codebook information about the dataset (select_vars is a variable name I created), which I sort using the name column. The export excel command is followed by the specific summary fields I wish to output; the position of the variable, data type, variable label, and the variable name itself.

* Create variable list for Gallup US Tracker Survey 2008-2017

local y = YEAR in 1

describe,short
summarize YEAR
translate @Results gallup_tracker_`y'_summary.txt, replace

describe, replace
generate select_vars = ""
sort name

export excel position name type varlab select_vars using gallup_tracker_`y'_vars.xlsx, firstrow(variables) replace

The variation for this particular US Daily Tracker dataset is that the files are packaged as one file per year. I load the first file for 2008, and the do file saves the YEAR attribute as a local variable, which allows me to include the year in the summary and excel output file names. I had to run this do file for each subsequent year up to 2017. This is not a big deal as I’ll never have to repeat the process on the old files, as new data will be released in separate, new files. Other datasets imposed different requirements; the GPSS survey is packaged in eleven separate files for different surveys, and the updates are cumulative (each file contains older data plus any updates – Gallup sends us updated files a few times each year). For the GPSS, I prompt the user for input to specify the survey file name, and overwrite the previous Excel file.

With the do file in hand, you open STATA and the data file you want to process, change the working directory from the default user folder to a better location for storing the output, open the do file, and it runs and creates the variable list spreadsheet.

Excel spreadsheet of variables generated from STATA
List of variables in Excel generated from STATA file. Users check the variables they want in an extract in the select_vars column

Create a STATA Do File with Python and Excel

Once a researcher submits their Google form and their selected variable spreadsheet (placing an X in a dedicated column to indicate that they want to include a variable), I run the Python script below. I use the openpyxl module to read the Excel file. I have to modify the paths, spreadsheet file name, and an integer for the particular survey each time I run it. I use the os module to navigate up and down through folders to store outputs in specific places. If the researcher specifies in the Google form that they want to filter observations, for example records for specific states or age ranges, I have to add those manually but I commented out a few examples that I can copy and modify. One caveat is that you must filter using the coded variable and not its label (i.e. if a month value is coded as 2 and its label is February, I must reference the code and not the label). Reading in the requested columns is straightforward; the script identifies cells in the selection column (E) that have an X, then grabs the variable name from the adjacent column.

# -*- coding: utf-8 -*-
"""
Pull selected gallup variables from spreadsheet to create STATA Do File
Frank Donnelly / GIS and Data Librarian / Brown University
"""

import openpyxl as xl, os
from datetime import date

thedate=date.today().strftime("%m%d%Y")
surveys={1:'gallup_covid',2:'gallup_gpss',3:'gallup_tracker',4:'gallup_world'}

rpath=os.path.join('requests','test') # MODIFY BASED ON INPUT
select_file=os.path.join(rpath,'gallup_tracker_2017_vars_TEST.xlsx') #MODIFY BASED ON INPUT
survey_file=surveys[3] #MODIFY BASED ON INPUT

dofile=os.path.join(rpath,'{}_vars_{}.do'.format(survey_file,thedate))
dtafile=os.path.join(os.path.abspath(os.getcwd()),rpath,'{}_extract_{}.dta'.format(survey_file,thedate))


#MODIFY to filter by observations - DO NOT ERASE EXAMPLES - copy, then modify
obsfilter=None
# obsfilter=None
# obsfilter='keep if inlist(STATE_NAME,"CT","MA","ME","NH","RI","VT")'
# obsfilter='keep if inrange(WP1220,18,64)'
# obsfilter='keep if SC7==2 & MONTH > 6'
# obsfilter='keep if FIPS_CODE=="44007" | FIPS_CODE=="25025"'

workbook = xl.load_workbook(select_file)
ws = workbook['Sheet1']

# May need to modify ws col and cell values based on user input
vlist=[]
for cell in ws['E']:
    if cell.value in ('x','X'): 
        vlist.append((ws.cell(row=cell.row, column=2).value))
outfile = open(dofile, "w")
outfile.writelines('keep ')
outfile.writelines(" ".join(vlist)+"\n")
if obsfilter==None:
    pass
else:
    outfile.writelines(obsfilter+"\n")
outfile.writelines('save '+dtafile+"\n")
outfile.close()
print('Created',dofile) 

The plain text do file begins with the command keep followed by the columns, and if requested, an additional keep statement to filter by records. The final save command will direct the output to a specific location.

keep CENREG D17A D23 D24 D5 FIPS_CODE HISPANIC INT_DATE MONTH MOTHERLODE_ID PE_WEIGHT RACE SC7 STATE_NAME WP10202 WP10208 WP10209 WP10215 WP10216 WP10229 WP10230 WP1220 WP1223 YEAR ZIPGALLUPREGION ZIPSTATE
save S:\gallup\processing\scripts\reques\test\gallup_tracker_extract_02202022.dta

All that remains is to open the requested data file in STATA, open the do file, and an extract is created. Visit my GitHub for the do files, Python script, and sample output. The original source data and the variable spreadsheets are NOT included due to licensing issues; if you have the original data files you can generate what I’ve described here. Sorry, I can’t share the Gallup data with you (so please don’t ask). You’ll need to contact your own university or institution to determine if you have access.

Python API Code

Geocoding with the NYC Geoclient API and Python

Even though I’ve left New York, there are still occasions where I refer back to NYC resources in order to help students and faculty here with NYC-based research. Most recently I’ve revisited NYC DOITT’s Geoclient API for geocoding addresses, and I discovered a number of things have changed since I’ve last used it a few years ago. I’ll walk through my latest geocoding script in this post.

First and foremost: if you landed on this page because you’re trying to figure out how to get your Geoclient API key to work, the answer is:

&subscription-key=YOURKEYHERE

This replaces the old format that required you to pass an app id and key. I searched through two websites and scanned through hundreds of pages of documentation, only to find this solution in a cached Google search result, as the new docs don’t mention this change and the old docs still have the previous information and examples of the application ID and key. So – hopefully this should save you some hours of frustration.

I was working with someone who needed to geocode a subset of the city’s traffic violation data from the open data portal, as the data lacks coordinates. It’s also missing postal city names and ZIP Codes, which precludes using most geocoders that rely on this information. Even if we had these fields, I’ve found that many geocoders struggle with the hyphenated addresses used throughout Queens, and some work-around is needed to get matches. NYC’s geoclient is naturally able to handle those Queens addresses, and can use the borough name or code for locating addresses in lieu of ZIP Codes. The traffic data uses pseudo-county codes, but it’s easy to replace those with the corresponding borough codes.

The older documentation is still solid for illustrating the different APIs and the variables that are returned; you can search for a parsed or non-parsed street address, street intersections, places of interest or landmarks, parcel blocks and lots, and a few others.

I wrote some Python code that I’ve pasted below for geocoding addresses that have house numbers, street, and borough stored in separate fields using the address API, and if the house number is missing we try again by doing an intersection search, as an intersecting street is stored in a separate field in the traffic data. In the past I used a thin client for accessing the API, but I’m skipping that as it’s simpler to just build the URLs directly with the requests module.

The top of the script has the standard stuff: the name of the input file, the column locations (counting from zero) in the input file that contain each of the four address components, the base URL for the API, a time function for progress updates, reading the API key in from a file, and looping through the input CSV with the addressees to save the header row in one list and the records in a nested list. I created a list of fields that are returned from the API that I want to hold on to and add them to the header row, along with a final variable that records the results of the match. In addition to longitude and latitude you can also get xCoordinate and yCoordinate, which are in the NY State Plane Long Island (ft-US) map projection. I added a counts dictionary to keep track of the result of each match attempt.

Then we begin a long loop – this is a bit messy and if I had more time I’d collapse much of this into a series of functions, as there is repetitive code. I loop through the index and value of each record beginning with the first one. The loop is in a try / except block, so in the event that something goes awry it should exit cleanly and write out the data that was captured. We take the base url and append the address request, slicing the record to get the values for house, street, and borough into the URL. An example of a URL after passing address components in:

https://api.nyc.gov/geo/geoclient/v1/address.json?houseNumber=12345&street=CONEY ISLAND AVE&borough=BROOKLYN&subscription-key=KEYGOESHERE

Pass that URL to the requests module and get a response back. If an address is returned, the JSON resembles a Python dictionary, with ‘address’ as the key and the value as another dictionary with key value pairs of several variables. Otherwise, we get an error message that something was wrong with the request.

An address dictionary with sub-dictionaries returned by the NYC Geoclient
A successful address match returns an address dictionary, with a sub-dictionary of keys and values

The loop logic:

  • If the package contains an ‘address’ key, flatten to get the sub-dictionary
    • If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
    • Else if the dictionary contains a ‘message’ key with a value that the house number was missing, do an intersection match
      • If the package contains an ‘intersection’ key, flatten to get the sub-dictionary
        • If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
        • If not, there was no intersection match, just get the messages and append blanks for each value to the record
      • If not, an error was returned, capture the error and append blanks for each value to the record, and continue
    • If not, there was no address match, just get the messages and append blanks for each value to the record
  • If not, an error was returned, capture the error and append blanks for each value to the record, and continue

The API has limits of 2500 matches per minute and 500k per day, so after 2000 records I built in a pause of 15 seconds. Once the process finishes, successfully or not, the records are written out to a CSV file, header row first followed by the records. If the process bailed prematurely, the last record and its index are printed to the screen. This allows you to rerun the script where you left off, by changing the start index in the variables list at the top of the script from 0 to the last record that was read. When it comes time to write output, the previous file is appended rather than overwritten and the header row isn’t written again.

It took about 90 minutes to match a file of 25,000 records. I’d occasionally get an error message that the API key was bad for a given record; the error would be recorded and the script continued. It’s likely that there are illegal characters in the input fields for the address that end up creating a URL where the key parameter can’t be properly interpreted. I thought the results were pretty good; beyond streets it was able to recognize landmarks like large parks and return matched coordinates with relevant error messages (example below). Most of the flops were, not surprisingly, due to missing borough codes or house numbers.

Output from the NYC Geoclient
Output fields from the NYC Geoclient written to CSV

To use this code you’ll need to sign up for an NYC Developer API account, and then you can request a key for the NYC Geoclient service. Store the key in a text file in the same folder as the script. I’m also storing inputs and outputs in the same folder, but with a few functions from the os module you can manipulate paths and change directories. If I get time over the winter break I may try rewriting to incorporate this, plus functions to simplify the loops. An alternative to the API would be to download the LION street network geodatabase, and you could set up a local address locator in ArcGIS Pro. Might be worth doing if you had tons of matches to do. I quickly got frustrated with with the ArcGIS documentation and after a number of failed attempts I opted to use the Geoclient instead.

"""
Match addresses to NYC Geoclient using house number, street name, and borough
Frank Donnelly / GIS and Data Librarian / Brown University
11/22/2021 - Python 3.7
"""

import requests, csv, time

#Variables
addfile='parking_nov2021_nyc.csv' #Input file with addresses
matchedfile=addfile[:-4]+'_output.csv' #Output file with matched data
keyfile='nycgeo_key.txt' #File with API key
start_idx=0 #If program breaks, change this to pick up with record where you left off
#Counting from 0, positions in the CSV that contain the address info 
hous_idx=23
st_idx=24
boro_idx=21
inter_idx=25
base_url='https://api.nyc.gov/geo/geoclient/v1/'

def get_time():
    time_now = time.localtime() # get struct_time
    pretty_time = time.strftime("%m/%d/%Y, %H:%M:%S", time_now)
    return pretty_time

print('*** Process launched at', get_time())

#Read api key in from file
with open(keyfile) as key:
    api_key=key.read().strip()

records=[]

with open(addfile,'r') as infile:
    reader = csv.reader(infile)
    header = next(reader) # Capture column names as separate list
    for row in reader:
        records.append(row)

# Fields returned by the API to capture
# https://maps.nyc.gov/geoclient/v1/doc
fields=['message','message2','houseNumber','firstStreetNameNormalized',
        'uspsPreferredCityName','zipCode','longitude','latitude','xCoordinate',
        'yCoordinate']
header.extend(fields)
header.append('match_result')
datavals=len(fields)-2 # Number of fields that are not messages
counts={'address match':0, 'intersection match':0,
        'failed address':0, 'failed intersection':0,
        'error':0}

print('Finished reading data from', addfile)
print('*** Geocoding process launched at',get_time())

for i,v in enumerate(records[start_idx:]):
    try:
        data_url = f'{base_url}address.json?houseNumber={v[hous_idx]}&street={v[st_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
        response=requests.get(data_url)
        package=response.json()
        # If an address is returned, continue
        if 'address' in package:
            result=package['address']     
            # If longitude is returned, grab data
            if 'longitude' in result:
                for f in fields:
                    item=result.get(f,'')
                    v.append(item)
                v.append('address match')
                counts['address match']=counts['address match']+1
            # If there was no house number, try street intersection match instead
            elif 'message' in result and result['message']=='INPUT CONTAINS NO ADDRESS NUMBER' and v[inter_idx] not in ('',None):
                try:
                    data_url = f'{base_url}intersection.json?crossStreetOne={v[st_idx]}&crossStreetTwo={v[inter_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
                    response=requests.get(data_url)
                    package=response.json()
                    # If an intersection is returned, continue
                    if 'intersection' in package:
                        result=package['intersection']
                        # If longitude is returned, grab data
                        if 'longitude' in result:
                            for f in fields:
                                item=result.get(f,'')
                                v.append(item)
                            v.append('intersection match')
                            counts['intersection match']=counts['intersection match']+1
                        # Intersection match fails, append messages and blank values
                        else:
                            v.append(result.get('message',''))
                            v.append(result.get('message2',''))
                            v.extend(['']*datavals)
                            v.append('failed intersection')
                            counts['failed intersection']=counts['failed intersection']+1
                    # Error returned instead of intersection
                    else:
                        v.append(package.get('message',''))
                        v.append(package.get('message2',''))
                        v.extend(['']*datavals)
                        v.append('error')
                        counts['error']=counts['error']+1
                        print(package.get('message',''))
                        print('Geocoder error at record',i,'continuing the matching process...')
                except Exception as e:
                     print(str(e))
            # Address match fails, append messages and blank values
            else:
                v.append(result.get('message',''))
                v.append(result.get('message2',''))
                v.extend(['']*datavals)
                v.append('failed address')
                counts['failed address']=counts['failed address']+1
        # Error is returned instead of address
        else:
            v.append(package.get('message',''))
            v.append(package.get('message2',''))
            v.extend(['']*datavals)
            v.append('error')
            counts['error']=counts['error']+1
            print(package.get('message',''))
            print('Geocoder error at record',i,'continuing the matching process...')
        if i%2000==0:
            print('Processed',i,'records so far...')
            time.sleep(15)         
    except Exception as e:
        print(str(e))

# First attempt, write to new file, but if break happened, append to existing file
if start_idx==0:
    wtype='w' 
else:
    wtype='a'

end_idx=start_idx+i

with open(matchedfile,wtype,newline='') as outfile:
    writer = csv.writer(outfile, delimiter=',', quotechar='"',
                        quoting=csv.QUOTE_MINIMAL)
    if wtype=='w':
        writer.writerow(header)
        writer.writerows(records[start_idx:end_idx])
    else:
        writer.writerows(records[start_idx:end_idx])
print('Wrote',i+1,'records to file',matchedfile)
print('Final record written was number',i,':\n',v)
for k,val in counts.items():
    print(k,val)
print('*** Process finished at',get_time())