api

Geocoding with the NYC Geoclient API and Python

Even though I’ve left New York, there are still occasions where I refer back to NYC resources in order to help students and faculty here with NYC-based research. Most recently I’ve revisited NYC DOITT’s Geoclient API for geocoding addresses, and I discovered a number of things have changed since I’ve last used it a few years ago. I’ll walk through my latest geocoding script in this post.

First and foremost: if you landed on this page because you’re trying to figure out how to get your Geoclient API key to work, the answer is:

&subscription-key=YOURKEYHERE

This replaces the old format that required you to pass an app id and key. I searched through two websites and scanned through hundreds of pages of documentation, only to find this solution in a cached Google search result, as the new docs don’t mention this change and the old docs still have the previous information and examples of the application ID and key. So – hopefully this should save you some hours of frustration.

I was working with someone who needed to geocode a subset of the city’s traffic violation data from the open data portal, as the data lacks coordinates. It’s also missing postal city names and ZIP Codes, which precludes using most geocoders that rely on this information. Even if we had these fields, I’ve found that many geocoders struggle with the hyphenated addresses used throughout Queens, and some work-around is needed to get matches. NYC’s geoclient is naturally able to handle those Queens addresses, and can use the borough name or code for locating addresses in lieu of ZIP Codes. The traffic data uses pseudo-county codes, but it’s easy to replace those with the corresponding borough codes.

The older documentation is still solid for illustrating the different APIs and the variables that are returned; you can search for a parsed or non-parsed street address, street intersections, places of interest or landmarks, parcel blocks and lots, and a few others.

I wrote some Python code that I’ve pasted below for geocoding addresses that have house numbers, street, and borough stored in separate fields using the address API, and if the house number is missing we try again by doing an intersection search, as an intersecting street is stored in a separate field in the traffic data. In the past I used a thin client for accessing the API, but I’m skipping that as it’s simpler to just build the URLs directly with the requests module.

The top of the script has the standard stuff: the name of the input file, the column locations (counting from zero) in the input file that contain each of the four address components, the base URL for the API, a time function for progress updates, reading the API key in from a file, and looping through the input CSV with the addressees to save the header row in one list and the records in a nested list. I created a list of fields that are returned from the API that I want to hold on to and add them to the header row, along with a final variable that records the results of the match. In addition to longitude and latitude you can also get xCoordinate and yCoordinate, which are in the NY State Plane Long Island (ft-US) map projection. I added a counts dictionary to keep track of the result of each match attempt.

Then we begin a long loop – this is a bit messy and if I had more time I’d collapse much of this into a series of functions, as there is repetitive code. I loop through the index and value of each record beginning with the first one. The loop is in a try / except block, so in the event that something goes awry it should exit cleanly and write out the data that was captured. We take the base url and append the address request, slicing the record to get the values for house, street, and borough into the URL. An example of a URL after passing address components in:

https://api.nyc.gov/geo/geoclient/v1/address.json?houseNumber=12345&street=CONEY ISLAND AVE&borough=BROOKLYN&subscription-key=KEYGOESHERE

Pass that URL to the requests module and get a response back. If an address is returned, the JSON resembles a Python dictionary, with ‘address’ as the key and the value as another dictionary with key value pairs of several variables. Otherwise, we get an error message that something was wrong with the request.

An address dictionary with sub-dictionaries returned by the NYC Geoclient — A successful address match returns an address dictionary, with a sub-dictionary of keys and values

The loop logic:

If the package contains an ‘address’ key, flatten to get the sub-dictionary
- If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
- Else if the dictionary contains a ‘message’ key with a value that the house number was missing, do an intersection match
  - If the package contains an ‘intersection’ key, flatten to get the sub-dictionary
    - If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
    - If not, there was no intersection match, just get the messages and append blanks for each value to the record
  - If not, an error was returned, capture the error and append blanks for each value to the record, and continue
- If not, there was no address match, just get the messages and append blanks for each value to the record
If not, an error was returned, capture the error and append blanks for each value to the record, and continue

The API has limits of 2500 matches per minute and 500k per day, so after 2000 records I built in a pause of 15 seconds. Once the process finishes, successfully or not, the records are written out to a CSV file, header row first followed by the records. If the process bailed prematurely, the last record and its index are printed to the screen. This allows you to rerun the script where you left off, by changing the start index in the variables list at the top of the script from 0 to the last record that was read. When it comes time to write output, the previous file is appended rather than overwritten and the header row isn’t written again.

It took about 90 minutes to match a file of 25,000 records. I’d occasionally get an error message that the API key was bad for a given record; the error would be recorded and the script continued. It’s likely that there are illegal characters in the input fields for the address that end up creating a URL where the key parameter can’t be properly interpreted. I thought the results were pretty good; beyond streets it was able to recognize landmarks like large parks and return matched coordinates with relevant error messages (example below). Most of the flops were, not surprisingly, due to missing borough codes or house numbers.

Output from the NYC Geoclient — Output fields from the NYC Geoclient written to CSV

To use this code you’ll need to sign up for an NYC Developer API account, and then you can request a key for the NYC Geoclient service. Store the key in a text file in the same folder as the script. I’m also storing inputs and outputs in the same folder, but with a few functions from the os module you can manipulate paths and change directories. If I get time over the winter break I may try rewriting to incorporate this, plus functions to simplify the loops. An alternative to the API would be to download the LION street network geodatabase, and you could set up a local address locator in ArcGIS Pro. Might be worth doing if you had tons of matches to do. I quickly got frustrated with with the ArcGIS documentation and after a number of failed attempts I opted to use the Geoclient instead.

"""
Match addresses to NYC Geoclient using house number, street name, and borough
Frank Donnelly / GIS and Data Librarian / Brown University
11/22/2021 - Python 3.7
"""

import requests, csv, time

#Variables
addfile='parking_nov2021_nyc.csv' #Input file with addresses
matchedfile=addfile[:-4]+'_output.csv' #Output file with matched data
keyfile='nycgeo_key.txt' #File with API key
start_idx=0 #If program breaks, change this to pick up with record where you left off
#Counting from 0, positions in the CSV that contain the address info 
hous_idx=23
st_idx=24
boro_idx=21
inter_idx=25
base_url='https://api.nyc.gov/geo/geoclient/v1/'

def get_time():
    time_now = time.localtime() # get struct_time
    pretty_time = time.strftime("%m/%d/%Y, %H:%M:%S", time_now)
    return pretty_time

print('*** Process launched at', get_time())

#Read api key in from file
with open(keyfile) as key:
    api_key=key.read().strip()

records=[]

with open(addfile,'r') as infile:
    reader = csv.reader(infile)
    header = next(reader) # Capture column names as separate list
    for row in reader:
        records.append(row)

# Fields returned by the API to capture
# https://maps.nyc.gov/geoclient/v1/doc
fields=['message','message2','houseNumber','firstStreetNameNormalized',
        'uspsPreferredCityName','zipCode','longitude','latitude','xCoordinate',
        'yCoordinate']
header.extend(fields)
header.append('match_result')
datavals=len(fields)-2 # Number of fields that are not messages
counts={'address match':0, 'intersection match':0,
        'failed address':0, 'failed intersection':0,
        'error':0}

print('Finished reading data from', addfile)
print('*** Geocoding process launched at',get_time())

for i,v in enumerate(records[start_idx:]):
    try:
        data_url = f'{base_url}address.json?houseNumber={v[hous_idx]}&street={v[st_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
        response=requests.get(data_url)
        package=response.json()
        # If an address is returned, continue
        if 'address' in package:
            result=package['address']     
            # If longitude is returned, grab data
            if 'longitude' in result:
                for f in fields:
                    item=result.get(f,'')
                    v.append(item)
                v.append('address match')
                counts['address match']=counts['address match']+1
            # If there was no house number, try street intersection match instead
            elif 'message' in result and result['message']=='INPUT CONTAINS NO ADDRESS NUMBER' and v[inter_idx] not in ('',None):
                try:
                    data_url = f'{base_url}intersection.json?crossStreetOne={v[st_idx]}&crossStreetTwo={v[inter_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
                    response=requests.get(data_url)
                    package=response.json()
                    # If an intersection is returned, continue
                    if 'intersection' in package:
                        result=package['intersection']
                        # If longitude is returned, grab data
                        if 'longitude' in result:
                            for f in fields:
                                item=result.get(f,'')
                                v.append(item)
                            v.append('intersection match')
                            counts['intersection match']=counts['intersection match']+1
                        # Intersection match fails, append messages and blank values
                        else:
                            v.append(result.get('message',''))
                            v.append(result.get('message2',''))
                            v.extend(['']*datavals)
                            v.append('failed intersection')
                            counts['failed intersection']=counts['failed intersection']+1
                    # Error returned instead of intersection
                    else:
                        v.append(package.get('message',''))
                        v.append(package.get('message2',''))
                        v.extend(['']*datavals)
                        v.append('error')
                        counts['error']=counts['error']+1
                        print(package.get('message',''))
                        print('Geocoder error at record',i,'continuing the matching process...')
                except Exception as e:
                     print(str(e))
            # Address match fails, append messages and blank values
            else:
                v.append(result.get('message',''))
                v.append(result.get('message2',''))
                v.extend(['']*datavals)
                v.append('failed address')
                counts['failed address']=counts['failed address']+1
        # Error is returned instead of address
        else:
            v.append(package.get('message',''))
            v.append(package.get('message2',''))
            v.extend(['']*datavals)
            v.append('error')
            counts['error']=counts['error']+1
            print(package.get('message',''))
            print('Geocoder error at record',i,'continuing the matching process...')
        if i%2000==0:
            print('Processed',i,'records so far...')
            time.sleep(15)         
    except Exception as e:
        print(str(e))

# First attempt, write to new file, but if break happened, append to existing file
if start_idx==0:
    wtype='w' 
else:
    wtype='a'

end_idx=start_idx+i

with open(matchedfile,wtype,newline='') as outfile:
    writer = csv.writer(outfile, delimiter=',', quotechar='"',
                        quoting=csv.QUOTE_MINIMAL)
    if wtype=='w':
        writer.writerow(header)
        writer.writerows(records[start_idx:end_idx])
    else:
        writer.writerows(records[start_idx:end_idx])
print('Wrote',i+1,'records to file',matchedfile)
print('Final record written was number',i,':\n',v)
for k,val in counts.items():
    print(k,val)
print('*** Process finished at',get_time())

Creating Lists of Country Admin Divisions with Geonames and Python

I’m working on a project where I needed to generate a list of all the administrative subdivisions (i.e. states / provinces, counties / departments, etc) and their ID codes for several different countries. In this post I’ll demonstrate how I accomplished this using the Geonames API and Python. Geonames (https://www.geonames.org/) is a gazetteer, which is a directory of geographic features that includes coordinates, variant place names, and identifiers that classify features by type and location. Geonames includes many different types of administrative, populated, physical, and built-environment features. Last year I wrote a post about gazetteers where I compared Geonames with the NGA Geonet Names Server, and illustrated how to download CSV files for all places within a given country and load them into a database.

In this post I’ll focus on using an API to retrieve gazetteer data. Geonames provides over 40 different REST API services for accessing their data, all of them well documented with examples. You can search for places by name, return all the places that are within a distance of a set of coordinates, retrieve all places that are subdivisions of another place, geocode addresses, obtain lists of centroids and bounding boxes, and much more. Their data is crowd sourced, but is largely drawn from a body of official gazetteers and directories published by various countries.

This makes it an ideal source for generating lists of administrative divisions and subdivisions with codes for multiple countries. This information is difficult to find, because there isn’t an international body that collates and freely provides it. ISO 3166-1 includes the standard country codes that most of the world uses. ISO 3166-2 includes codes for 1st-level administrative divisions, but ISO doesn’t freely publish them. You can find them from various sources; Wikipedia lists them and there are several gist and github repos with screen scraped copies. The US GNA is a more official source that includes both ISO 3166 1 and 2. But as far as I know there isn’t a solid source for codes below the 1st level divisions. Many countries create their own systems and freely publish their codes (i.e. ANSI FIPS codes in the US, INSEE COG codes in France), but that would require you to tie them altogether. GADM is the go to source for vector-based GIS files of country subdivisions (map at the top of this post for example). For some countries they include ISO division codes, but for others they don’t (they do employ the HASC codes from Statoids, but it’s not clear if these codes are still being actively maintained) .

Geonames to the rescue – you can browse the countries on the website to see the country and 1st level admin codes (see image below), but the API will give us a quick way to obtain all division levels. First, you have to register to get an API username – it’s free – and you’ll tack that username on to the end of your requests. That gives you 20k credits per day, which in most instances equates with 1 request per credit. I recommend downloading one of their prepackaged country files first, to give you a sense for how the records are structured and what attributes are available. A readme file that describes all of the available fields accompanies each download.

1st Level Admin Divisions from Geonames — 1st Level Admin Divisions for Dominica from the Geonames website

My goal was to get all administrative divisions – names and codes and how the divisions nest within each other – for all of the countries in the French-speaking Caribbean (countries that are currently, or formerly, overseas territories of France). I also needed to get place names as they’re written in French. I’ll walk through my Python script that I’ve pasted below.

import requests,csv
from time import strftime

ccodes=['BL','DM','GD','GF','GP','HT','KN','LC','MF','MQ','VC']
fclass='A'
lang='fr' 
uname='REQUEST FROM GEONAMES'

#Columns to keep
fields=['countryId','countryName','countryCode','geonameId','name','asciiName',
        'alternateNames','fcode','fcodeName','adminName1','adminCode1',
        'adminName2','adminCode2','adminName3','adminCode3','adminName4','adminCode4',
        'adminName5','adminCode5','lng','lat']
fcode=fields.index('fcode')

#Divisions to keep
divisions=['ADM1','ADM2','ADM3','ADM4','ADM5','PCLD','PCLF','PCLI','PCLIX','PCLS']

base_url='http://api.geonames.org/searchJSON?'

def altnames(names,lang):
    "Given a dict of names, extract preferred names for a given language"
    aname=''
    for entry in names:
        if 'isPreferredName' in entry.keys() and entry['lang']==lang:
            aname=entry.get('name')
        else:
            pass
    return aname

places=[]
tossed=[]

for country in ccodes:
    data_url = f'{base_url}?name=*&country={country}&featureClass={fclass}&lang={lang}&style=full&username={uname}'
    response=requests.get(data_url)
    data=response.json() #total retrieved and results in list of dicts
    gnames=response.json()['geonames'] #create list of dicts only
    gnames.sort(key=lambda e: (e.get('countryCode',''),e.get('fcode',''),
                               e.get('adminCode1',''),e.get('adminCode2',''),
                               e.get('adminCode3',''),e.get('adminCode4',''),
                               e.get('adminCode5','')))
    for record in gnames:
        r=[]
        for f in fields:
            item=record.get(f,'')
            if f=='alternateNames' and f !='': 
                aname=altnames(item,'en')
                r.append(aname)             
            else:
                r.append(item)             
        if r[fcode] in divisions: #keep certain admin divs, toss others
            places.append(r)
        else:
            tossed.append(r)
        
filetoday=strftime('%Y_%m_%d')
outfile='geonames_fwi_adm_'+filetoday+'.csv'
    
writefile=open(outfile,'w', newline='', encoding='utf8')
writer=csv.writer(writefile, delimiter=",", quotechar='"',quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(fields) #header row
writer.writerows(places)
writefile.close()
        
print(len(places),'records written to file',outfile)

First, I identify all of the variables I need: the two-letter ISO codes of the countries, a list of the Geonames attributes that I want to keep, the two-letter language code, and the specific feature type I’m interested in. There are different features codes classified with a single letter, and a number of subtypes below that. Feature class A is for records that represent administrative divisions, and within that class I needed records that represented the country as a whole (PCL codes) and its subdivisions (ADM codes). There are several different place name variables that include official names, short forms, and an ASCII form that only includes characters found in the Latin alphabet used in English. The language code that you pass into the url will alter these results, but you still have the option to obtain preferred place names from an alternate languages field. The admin codes I’m retrieving are the actual admin codes; you can also opt to retrieve the unique Geonames integer IDs for each admin level, if you wanted to use these for bridging places together (not necessary in my case).

There are a few different approaches for achieving this goal. I decided to use the Geonames full text search, where you search for features by name (separate APIs for working with hierarchies for parent and child entities are another option). I used an asterisk as a wildcard to retrieve all names, and the other parameters to filter for specific countries and feature classes. At the end of the base url I added JSON for the search; if you leave this off the records are returned as XML.

base_url='http://api.geonames.org/searchJSON?'

My primary for loop loops through each country, and passes the parameters into the data url to retrieve the data for that country: I pass in country code, feature class A, and French as the language for the place names. It took me a while to figure out that I also needed to add style=full to retrieve all of the possible info that’s available for a given record; the default is to capture a subset of basic information, which lacked the admin codes I needed.

data_url = f'{base_url}?name=*&country={country}&featureClass={fclass}&lang={lang}&style=full&username={uname}'

I use the Python Requests module to interact with the API. Geonames returns two objects in JSON: an integer count of the total records retrieved, and another JSON object that essentially represents a list of python dictionaries, where each dictionary contains all the attributes for a record as a series of key and value pairs where the key is the attribute name (see examples below). I create a new gnames variable to isolate just this list, and then I sort the list based on how I want the final output to appear; by country and by admin codes, so that like-levels of admin codes are grouped together. The trick of using lamba to sort nested lists or dictionaries is well documented, but one variation I needed was to use the dictionary get method. Some features may not have five levels of admin codes; if they don’t then there is no key for that attribute, and using the simple dict[key] approach returns an error in those cases. Using dict.get(key,”) allows you to pass in a default value if no key is present. I provide a blank string as a placeholder, as ultimately I want each record to have the same number of columns in the output and need the attributes to line up correctly.

Gnames List — Records returned from Geonames as a list, where each list item is a dictionary of key / value pairs for a given place

Example of an individual list item, a dictionary of key / value pairs for the Parish of Charlotte, a 1st order admin division of Saint-Vincent-et-les-Grenadines. Variable names are keys.

Once I have records for the first country, I loop through them and choose just the attributes that I want from my field list. The attribute name is the key, I get the associated value, but if that key isn’t present I insert an empty string. In most cases the value associated with a key is a string or integer, but in a few instance it’s another container, as in the case of alternate names which is another list of dictionaries. If there are alternate names I want to pull out a preferred name in English if one exists. I handle this with a function so the loop looks less cluttered. Lastly, if this record represents an admin division or is a country-level record then I want to keep it, otherwise I append it to a throw-away list that I’ll inspect later.

Once the records returned for that country have been processed, we move on to the next country and keep appending records to the main list of places (image below). When we’re done, we write the results out to a CSV file. I write the list of fields out first as my header row, and then the records follow.

Final list called places that contains records for all admin divisions for specific countries and feature classes, where items are sublists that represent each place

Overall I think this approach worked well, but there are some small caveats. A number of the countries I’m studying are not independent, but are dependencies of France. For dependent countries, their 1st and sometimes even 2nd level subdivision codes appear identical to their top-level country code, as they represent a subdivision of an independent country (many overseas territories are departments of France). If I need to harmonize these codes between countries I may have to adjust the dependencies. The alternate English places names always appear for the country-level record, but usually not below that. I think I’d need to do some additional tweaking or even run a second set of requests in English if I wanted all the English spellings; for example in French many compound place names like Saint-Paul are separated by a hyphen, but in English they’re separated by a space. Not a big deal in my case as I was primarily interested in the alternate spellings for countries, i.e. Guyane versus French Guiana. See the final output below for Guyane; these subdivision codes are from INSEE COG, which are the official codes used by the French government for identifying all geographic areas for both metropolitan France and overseas departments and collectivities.

1st half of CSV file imported into spreadsheet, records showing admin divisions of Guyane / French Guiana

2nd half of CSV file imported into spreadsheet, records showing admin codes and hierarchy of divisions for Guyane / French Guiana

Two final things to point out. First, my script lacks any exception handling, since my request is rather small and the API is reliable. If I was pulling down a lot of data I would replace my main for loop with a try and except block to handle errors, and would capture retrieved data as the process unfolds in case some problem forces a bail out. Second, when importing the CSV into a spreadsheet or database, it’s important to import the admin codes as text, as many of them have leading zeros that need to be preserved.

This example represents just the tip of the iceberg in terms of things you can do with Geonames APIs. Check it out!

An Overview of Census Datasets and Census API Examples

This month’s post is a bit shorter, as I have just two announcements I wanted to share about some resources I’ve created.

First, I’ve written a short technical paper that’s just been published as part of the Weissman Center of International Business’ Occasional Papers Series. Exploring US Census Datsets: A Summary of Surveys and Sources provides an overview of several different datasets (decennial census, American Community Survey, Population Estimates Program, and County Business Patterns) and sources for accessing data. The paper illustrates basic themes that are part of all my census-related talks: the census isn’t just the thing that happens every ten years but is an ecosystem of datasets updated on an on-going basis, and there are many sources for accessing data which are suitable for different purposes and designed for users with varying levels of technical skill. In some respects this paper is a super-abridged version of my book, designed to serve as an introduction and brief reference.

Second, I’ve created a series of introductory notebooks on GitHub that illustrate how to use the Census Bureau’s API with Python and Jupyter Notebooks. I designed these for a demonstration I gave at NYU’s Love Data Week back on Feb 10 (the slides for the talk are also available in the repo). I structured the talk around three examples. Example A demonstrates the basics of how the API works along with some best practices, such as defining your variables at the top and progressively building links to retrieve data. It also illustrates the utility of using these technologies in concert, as you can pull data into your script and process and visualize it in one go. I also demonstrate how to retrieve lists of census variables and their corresponding metadata, which isn’t something that’s widely documented. Example B is a variation of A, extended by adding an API key and storing data in a file immediately after retrieval. Example C introduces more complexity, reading variables in from files and looping through lists of geographies to make multiple API calls.

Since I’ve written a few posts on the census API recently, I went back and added an api tag to group them together, so you can access them via a single link.

Define census API variables, build links, and retrieve data

County and ZIP Code Business Patterns 2017 and the Census API

The U.S. Census Bureau’s County and ZIP Code Business Patterns (CBP and ZBP) datasets are generated annually from the Business Register, a large administrative database updated by several federal agencies which contains every business establishment in the U.S. with paid employees. Business establishments are defined as single physical locations where business is conducted or where services or industrial operations are performed. Establishments are assigned to industries, which are groups of businesses that produce similar products or provide similar services, using the North American Industrial Classification System (NAICS). The ZBP contains tables with total establishments, employment, and wages by ZIP and counts of business establishments by NAICS and ZIP. The CBP has these tables plus a few others for counties.

The 2017 Business Patterns was recently released, and there are a few important changes to the dataset over previous iterations. I’ll summarize what they are and how they impact data retrieval using the Census Bureau’s ZBP API. I unwittingly discovered these issues this week as I was trying to use a Python / Pandas notebook I’d written for extracting ZBP data and aggregating the USPS ZIP codes to Zip Code Tabulation Areas (ZCTAs), which are used for publishing decennial and ACS census data. Everything went smoothly when I tested the scripts against the 2016 ZBP, but a few things went awry with 2017 and I was forced to make some revisions.

If you’re not familiar with the API, take a look at this earlier post for a basic introduction. The notebooks I’ll refer to are available on my github; zbp_to_zcta.ipynb works for the 2017 ZBP release, and I kept the earlier version that worked for 2016.

2017 NAICS Codes

NAICS codes are revised every five years in tandem with the Economic Census (conducted in years ending in 2 and 7), to effectively capture the changing nature of the economy. The CBP and ZBP employ the latest NAICS series in the year that it’s released, so beginning with 2012 the 2012 NAICS were used for categorizing establishments into industries. The 2012 definitions were used up through 2016, but now that we’re in 2017 we have a new NAICS 2017 series, and this was employed for the 2017 CBP and ZBP and will be used through 2021.

How different are the categories? If you’re working at the broad two-digit sector level nothing has changed. The more detailed the categories are (3 to 6 digit), the more likely it is that you’ll encounter changes: industries that were created, or removed (aggregated into a broader miscellaneous category), or modified. You can use the concordance tables to see how definitions have changed, and in some cases crosswalk data from one category to another.

If you’re using the API, you’ll need to modify your url to access the 2017 NAICS variables (&NAICS2017=) as opposed to the 2012 series (&NAICS2012= ).

New Privacy Regulations

For confidentiality purposes, the Census Bureau has always employed various methods to insure that the summary data produced for the CBP and ZBP can’t be used to identify characteristics of an individual business. If a geographic area or industrial category had fewer than 3 establishments in it, or if one establishment in an area or category constituted an overwhelming majority of the employment or wages, then those values were not disclosed or published. The only characteristic that was always published was the number of establishments.

Not any more – beginning with the 2017 CBP and ZBP, the following applies:

> Prior to reference year 2017, the number of establishments in a particular tabulation cell was not considered sensitive; therefore, counts of establishments were released without any disclosure avoidance methods applied. Beginning with reference year 2017, cells with fewer than 3 establishments have been omitted from the release.

So what does this mean? First, for any county or ZIP Code that has fewer than 3 business establishments in total, records for that county or ZIP Code will not appear in the dataset at all (although establishments in these areas will be counted in summaries of larger areas, like states or metro areas). In my script, about 30 ZIP Codes for NYC fell out of my results compared to last year; these were primarily non-residential ZIPs that represented a single business that processes lots of mail, and post office box ZIPs.

Second, for a given geographic area, if a given NAICS category has less than three business establishments, the number of establishments won’t be reported for that category, but they will be included in the sum total. Once again, in my case I’m working with two-digit sector codes. There is a 00 code that captures the sum of all establishments. When I was summing the values of all of the two-digit codes together, I discovered that these sums rarely matched the 00 total, like they did in the past, because of the new non-disclosure policy. To account for this, and to calculate percent totals correctly, I had to create a category that takes the difference between the total 00 category and the sum of all the others, to count how many businesses were not disclosed (see pic below). I could then treat that category like the others, and the sum of the parts would equal the whole again.

These data frames show counts of establishments by two digit NAICS sectors. In the top df, the totals column N00 does not equal the sum of the others columns. A column was added to the bottom df to get the difference between the two.

Subsequently, I replaced the zeros for any ZIP code that had businesses that weren’t disclosed with NULLs, as I can’t know for certain if the values are truly zero. The most likely categories (at the two digit level for ZIPs) where data was not disclosed were: 11 (agriculture), 21 (mining), 22 (utilities), and 99 (unclassified businesses).

Looping Through and Retrieving Geographies

The API allows you to select all geographies within another geography using the ‘in’ clause (visit the ZBP API to see a list of variables and examples). For example, you can select all the counties in a particular state – in the example below, values would be passed into the variables in braces, and you would pass ANSI FIPS codes into the geography variables:

base_url = f'https://api.census.gov/data/{year}/{dsource}'
edata_url=f'{base_url}?get={ecols}&for={county}:*&in=state:{state}&key={api_key}'

This option is only available for geographies that nest, according to the Census Bureau’s geographic hierarchy. ZIP Codes are not a census geography and don’t nest within anything, so we can’t use the ‘in’ clause. For the 2016 and prior versions of the ZBP API, there was a trick for getting around this; there was a state variable called ST, which you could use in a similar fashion to get all the ZIP Codes in a state in a ‘for’ clause:

edata_url = f'{base_url}?get={ecols}&for=zipcode:*&ST={state}&key={api_key}'

Not any more – the ST variable disappeared in the 2017 API for the ZBP. So what can you do instead? Option one is to loop through a list of ZIP codes, passing them to the API one by one. This is fine if you just need a few, but pretty slow if you need the 260 something that I needed. Option two is to pass in several ZIP codes into the URL at once, but there’s a catch: you’re only allowed to pass in 50 values at a time to any variable. To do this, you need to divide your list of ZIPs into chunks of no more than 50, loop through the sub-lists to insert them into the url, and append the results to a big list as you go along.

A function for breaking a list of ZIP Codes (or any list of variables) into chunks:

def chunks(l, n):
    for i in range(0, len(l), n):
        yield l[i:i+n]

Call the function to generate a list of lists with an equal number of values (in my case, my ZIP Codes are an index in a dataframe):

reqzips=list(chunks(zip2zcta.index.tolist(),48))

Then run the following to iterate through the list of ZIP code lists. I use enumerate so I can grab both the indices and values in the list. The ZIP codes values (v) have to be strung together and separated by commas before passing them into the url. The ecols variable is a list of columns I want to retrieve, which is also a single string with columns separated by commas. Once I receive the first chunk I append everything to a list (emp_data), but for every subsequent chunk I start reading from the second value [1:] and skip the first [0] because I only want to append the column headers once.

emp_data=[]
for i, v in enumerate (reqzips):
    batchzips=','.join(v)
    edata_url = f'{base_url}?get={ecols}&for=zipcode:{batchzips}&key={api_key}'
    response=requests.get(edata_url)
    if response.status_code==200:
        clear_output(wait=True)
        data=response.json()
        if i == 0:
            for record in data:
                emp_data.append(record)
        else:
            for record in data[1:]:
                emp_data.append(record)
        print('Retrieved data for chunk',i)
    else:
        print('***Problem with retrieval***, response code',response.status_code)
        break

The key here is to get the looping right, to insure that you end up with a list of lists where each list represents a row of data, in this case a ZIP code record with establishment data. I employed something similar (but a bit more complicated) with an ACS script that I wrote, but in that case I was looping through lists of columns / attributes instead of geographies.

If you’d like to learn more about the census business datasets and understand how to navigate NAICS, check out chapter 8 in my book. I don’t cover the APIs, but I do demonstrate how to use the new data.census.gov and I delve into the concepts behind these datasets in good detail.

Examples of using the Census Bureau’s API with Python

At the end of my book I briefly illustrate how the Census Bureau’s API works using Python. I’ll expand on that in this post; we’ll pull data from the Population Estimates Program, transform it, and create a chart using Python with Pandas in a Notebook. I’ll conclude with an additional example using the ZIP Code Business Patterns.

The Census Bureau has dedicated API pages for each dataset (decennial, acs, pop estimates, and more), and you need to familiarize yourself with the geographies and variables that are available for each. The API is a basic REST API, where you insert parameters into a base url and retrieve data based on the link you submit. Python has several modules you can use for interacting with APIs – the requests module is a popular choice.

The following pop estimates example is on github (but if github flops see the nbviewer example instead).

The top of the script contains basic stuff – import the modules you need, read in your key, and define the variables that you want to pull. You don’t have to use an API key, but if you don’t you’re limited to pulling in 500 records a day. Requesting a key is simple and free. A best practice is to store your key (a big integer) in a file that you read in, so you’re not exposing it in the script. Most of the census APIs require that you pass in a year and a dataset (dsource). Larger datasets may be divided into subsets (dname); for example the population estimates is divided into estimates, components of change, and characteristics (age, sex, race, etc.). Save the columns and geographies that you want to get in a comma-separated string. You have to consult the documentation and variable lists that are available for each dataset to build these, and the geography requires ANSI / FIPS codes.

%matplotlib inline
import requests,pandas as pd

with open('census_key.txt') as key:
    api_key=key.read().strip()

year='2018'
dsource='pep'
dname='components'
cols='GEONAME,NATURALINC,DOMESTICMIG,INTERNATIONALMIG'
state='42'
county='017,029,045,091,101'

Next, you can create the url. I’ve been doing this in two parts. The first part:

base_url = f'https://api.census.gov/data/{year}/{dsource}/{dname}'

Includes the base https://api.census.gov/data/ followed by parameters that you fill in. The year, data source, and dataset name are the standard pieces. The output looks like this:

'https://api.census.gov/data/2018/pep/components'

Then you take that base_url and add additional parameters that are going to vary within the script, in this case the columns and the geography, which all appear in the ‘get’ portion of the url. The ‘for’ and ‘in’ options allow you to select the type of geography within another geography, in this case counties within states, and you pass in the appropriate ANSI FIPS codes from the string you’ve created. The key appears at the end of the url, but if you opt not to use it you can omit that part. Once the link is fully constructed you use the requests module to fetch the data using that url. You can print the result out as text (assuming it’s not too long).

data_url = f'{base_url}?get={cols}&for=county:{county}&in=state:{state}&key={api_key}'
response=requests.get(data_url)
print(response.text)

The result looks like a nested list, but is actually a string that’s structured in a non-standard JSON format:

[["GEONAME","NATURALINC","DOMESTICMIG","INTERNATIONALMIG","state","county"],
["Bucks County, Pennsylvania","-178","-605","862","42","017"],
["Chester County, Pennsylvania","1829","-887","1374","42","029"],
["Delaware County, Pennsylvania","1374","-2513","1579","42","045"],
["Montgomery County, Pennsylvania","1230","-1987","2315","42","091"],
["Philadelphia County, Pennsylvania","8617","-11796","8904","42","101"]]

To do anything with it, convert it to JSON with response.json(). Then you can convert it into a list, dictionary, or in this example a Pandas dataframe. Here, I build the dataframe with everything from row one forward [1:]; row zero contains the column headers[0]. I rename some of the columns, build a unique ID by concatenating the state and county FIPS codes and set that as the new index, and drop the individual county and state FIPS columns. By default every object that’s returned is a string, so I convert the numeric columns to integers:

data=response.json()
df=pd.DataFrame(data[1:], columns=data[0]).\
    rename(columns={"NATURALINC": "Natural Increase", "DOMESTICMIG": "Net Domestic Mig", "INTERNATIONALMIG":"Net Foreign Mig"})
df['fips']=df.state+df.county
df.set_index('fips',inplace=True)
df.drop(columns=['state','county'],inplace=True)
df=df.astype(dtype={'Natural Increase':'int64','Net Domestic Mig':'int64','Net Foreign Mig':'int64'},inplace=True)
df

Then I can see the result:

Once the data is in good shape, you can begin to analyze and visualize it. Here’s the components of population change for Philadelphia and the surrounding suburban counties in Pennsylvania from 2017 to 2018 – natural increase is the difference between births and deaths, and there’s net migration within the US (domestic) and between the US and other countries (foreign):

labels=df['GEONAME'].str.split(' ',expand=True)[0]
ax=df.plot.bar(rot=0, title='Components of Population Change 2017-18')
ax.set_xticklabels(labels)
ax.set_xlabel('')

Each request is going to vary based on your specific needs and the construction of the particular dataset. Here’s another example where I pull data on business establishments, employees, and wages (in $1,000s of dollars) from the ZIP Code Business Patterns (ZBP). This dataset is smaller, so it doesn’t have a dataset name, just a data source. To get all the ZIP Codes in Delaware I use the asterisk * wildcard. Because ZIP Codes do not nest within states I can’t use the ‘in’ option, it’s simply not available. A state code is stored in a special field called ST, and I can use it as a general limiter with equals in the query:

year='2016'
dsource='zbp'
cols='ESTAB,EMP,PAYQTR1,PAYANN'
state='10'

base_url = f'https://api.census.gov/data/{year}/{dsource}'

data_url = f'{base_url}?get={cols}&for=zipcode:*&ST={state}&key={api_key}'
response=requests.get(data_url)
print(response.text)

[["ESTAB","EMP","PAYQTR1","PAYANN","ST","zipcode"],
["982","26841","448380","1629024","10","19713"],
["22","628","3828","15848","10","19716"],
["8","15","371","2030","10","19732"],
["7","0","0","0","10","19718"],
["738","9824","83844","353310","10","19709"]...

data=response.json()
zbp_data=pd.DataFrame(data[1:], columns=data[0]).set_index('zipcode')
zbp_data.drop(columns=['ST'],inplace=True)
for field in cols.split(','):
    zbp_data=zbp_data.astype(dtype={field:'int64'},inplace=True)
zbp_data.head()

One of the issues with the ZBP is that many variables are not disclosed due to privacy regulations; instead of returning nulls a zero is returned, but in this dataset they are not true zeros. Once you retrieve the data and set the types you can replace zeros with NaNs, which are numpy / Panda nulls – although there’s a quirk in that dataframe columns declared as integers cannot contain null values. Instead you can use a float, or a workaround that’s been implemented for new Pandas versions (for my specific use case this data will be inserted into a database, so I’ll use SQL to accomplish the zero to null conversion). ZBP data is also injected with noise to protect privacy, and you can retrieve special columns that contain noise flags.

The API is convenient for automating the data acquisition process, and allows you to cherry pick the variables you want. To avoid accessing the API over and over again as you build your scripts (which is prohibitive when requesting lots of data) you can pickle the data right after you retrieve it – a pickle is a python data object that efficiently stores data locally, and pandas has special functions for creating and accessing them. Once you pull your data and pickle it, you can comment out (or in a notebook, don’t rerun) the requests block, and subsequently pull the data from the pickle as you tweak your code (see caveat in the postscript – perhaps best to use json instead of pickle).

#Write to a pickle
zbp_data.to_pickle('insert path here.pickle')

#Read from a pickle to dataframe
zbp_new=pd.read_pickle('insert path here.pickle')

Take a look at the Census Data API User Guide to learn more. The guide focuses just on the REST API, and is not specific to a scripting language. Of course, you also need to familiarize yourself with the datasets and how they’re created and organized, and with census geography (which is why I wrote this book).

Postscript

Since I’ve finished this post I’ve created a notebook that pulls ZBP data from the API (alt nbviewer here) and have some additional thoughts I’d like to share:

I decided to dump the data I retrieved from the API to a json file and then pull data from it instead of using a pickle. Pickles come with serious security issues. If you don’t intend to share your code with anyone pickles are fine, otherwise consider an alternative.
My method for parsing the retrieved data into a dataframe worked fine because the census API uses non-standard JSON; essentially the string that’s returned resembles a nested Python list. If this was true JSON, we may need to employ a different method to account for the fact that the number of elements per record may vary.
Wildcards are not always available to build urls for certain data; for example to download the number of establishments classified by industry I wasn’t able to grab everything for one state using the method I illustrated in this post. Instead I had to loop through a list of ZIP and NAICS codes to retrieve what I wanted one at a time.
In the case of retrieving establishments classified by industry there were many cases when there was no data for a particular ZIP Code (i.e. no farms and mines in midtown Manhattan). Since I needed records that showed zero establishments, I had to insert them myself if the API returned no result. Even if you didn’t need records with zeros, it’s important to consider the potential impact of getting nothing back from the API on your subsequent code.
Given my experience thus far these APIs were pretty reliable, in that I haven’t had issues with time outs and partially returned data. If this was not the case and you had lots of data to retrieve, you would need to build in some try – except statements to handle exceptions, save data as you go along, and pick up where you left off if something breaks. Read about this geocoding script I wrote a few years back for examples.

At These Coordinates

Dispatches from the Geospatial Data World

api