series – python geocoding

For series of articles that was about geocoding addresses with Python

Python API Code

Geocoding with the NYC Geoclient API and Python

Even though I’ve left New York, there are still occasions where I refer back to NYC resources in order to help students and faculty here with NYC-based research. Most recently I’ve revisited NYC DOITT’s Geoclient API for geocoding addresses, and I discovered a number of things have changed since I’ve last used it a few years ago. I’ll walk through my latest geocoding script in this post.

First and foremost: if you landed on this page because you’re trying to figure out how to get your Geoclient API key to work, the answer is:

&subscription-key=YOURKEYHERE

This replaces the old format that required you to pass an app id and key. I searched through two websites and scanned through hundreds of pages of documentation, only to find this solution in a cached Google search result, as the new docs don’t mention this change and the old docs still have the previous information and examples of the application ID and key. So – hopefully this should save you some hours of frustration.

I was working with someone who needed to geocode a subset of the city’s traffic violation data from the open data portal, as the data lacks coordinates. It’s also missing postal city names and ZIP Codes, which precludes using most geocoders that rely on this information. Even if we had these fields, I’ve found that many geocoders struggle with the hyphenated addresses used throughout Queens, and some work-around is needed to get matches. NYC’s geoclient is naturally able to handle those Queens addresses, and can use the borough name or code for locating addresses in lieu of ZIP Codes. The traffic data uses pseudo-county codes, but it’s easy to replace those with the corresponding borough codes.

The older documentation is still solid for illustrating the different APIs and the variables that are returned; you can search for a parsed or non-parsed street address, street intersections, places of interest or landmarks, parcel blocks and lots, and a few others.

I wrote some Python code that I’ve pasted below for geocoding addresses that have house numbers, street, and borough stored in separate fields using the address API, and if the house number is missing we try again by doing an intersection search, as an intersecting street is stored in a separate field in the traffic data. In the past I used a thin client for accessing the API, but I’m skipping that as it’s simpler to just build the URLs directly with the requests module.

The top of the script has the standard stuff: the name of the input file, the column locations (counting from zero) in the input file that contain each of the four address components, the base URL for the API, a time function for progress updates, reading the API key in from a file, and looping through the input CSV with the addressees to save the header row in one list and the records in a nested list. I created a list of fields that are returned from the API that I want to hold on to and add them to the header row, along with a final variable that records the results of the match. In addition to longitude and latitude you can also get xCoordinate and yCoordinate, which are in the NY State Plane Long Island (ft-US) map projection. I added a counts dictionary to keep track of the result of each match attempt.

Then we begin a long loop – this is a bit messy and if I had more time I’d collapse much of this into a series of functions, as there is repetitive code. I loop through the index and value of each record beginning with the first one. The loop is in a try / except block, so in the event that something goes awry it should exit cleanly and write out the data that was captured. We take the base url and append the address request, slicing the record to get the values for house, street, and borough into the URL. An example of a URL after passing address components in:

https://api.nyc.gov/geo/geoclient/v1/address.json?houseNumber=12345&street=CONEY ISLAND AVE&borough=BROOKLYN&subscription-key=KEYGOESHERE

Pass that URL to the requests module and get a response back. If an address is returned, the JSON resembles a Python dictionary, with ‘address’ as the key and the value as another dictionary with key value pairs of several variables. Otherwise, we get an error message that something was wrong with the request.

An address dictionary with sub-dictionaries returned by the NYC Geoclient
A successful address match returns an address dictionary, with a sub-dictionary of keys and values

The loop logic:

  • If the package contains an ‘address’ key, flatten to get the sub-dictionary
    • If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
    • Else if the dictionary contains a ‘message’ key with a value that the house number was missing, do an intersection match
      • If the package contains an ‘intersection’ key, flatten to get the sub-dictionary
        • If ‘longitude’ is present as a key, a match is returned, get the relevant fields and append to the record
        • If not, there was no intersection match, just get the messages and append blanks for each value to the record
      • If not, an error was returned, capture the error and append blanks for each value to the record, and continue
    • If not, there was no address match, just get the messages and append blanks for each value to the record
  • If not, an error was returned, capture the error and append blanks for each value to the record, and continue

The API has limits of 2500 matches per minute and 500k per day, so after 2000 records I built in a pause of 15 seconds. Once the process finishes, successfully or not, the records are written out to a CSV file, header row first followed by the records. If the process bailed prematurely, the last record and its index are printed to the screen. This allows you to rerun the script where you left off, by changing the start index in the variables list at the top of the script from 0 to the last record that was read. When it comes time to write output, the previous file is appended rather than overwritten and the header row isn’t written again.

It took about 90 minutes to match a file of 25,000 records. I’d occasionally get an error message that the API key was bad for a given record; the error would be recorded and the script continued. It’s likely that there are illegal characters in the input fields for the address that end up creating a URL where the key parameter can’t be properly interpreted. I thought the results were pretty good; beyond streets it was able to recognize landmarks like large parks and return matched coordinates with relevant error messages (example below). Most of the flops were, not surprisingly, due to missing borough codes or house numbers.

Output from the NYC Geoclient
Output fields from the NYC Geoclient written to CSV

To use this code you’ll need to sign up for an NYC Developer API account, and then you can request a key for the NYC Geoclient service. Store the key in a text file in the same folder as the script. I’m also storing inputs and outputs in the same folder, but with a few functions from the os module you can manipulate paths and change directories. If I get time over the winter break I may try rewriting to incorporate this, plus functions to simplify the loops. An alternative to the API would be to download the LION street network geodatabase, and you could set up a local address locator in ArcGIS Pro. Might be worth doing if you had tons of matches to do. I quickly got frustrated with with the ArcGIS documentation and after a number of failed attempts I opted to use the Geoclient instead.

"""
Match addresses to NYC Geoclient using house number, street name, and borough
Frank Donnelly / GIS and Data Librarian / Brown University
11/22/2021 - Python 3.7
"""

import requests, csv, time

#Variables
addfile='parking_nov2021_nyc.csv' #Input file with addresses
matchedfile=addfile[:-4]+'_output.csv' #Output file with matched data
keyfile='nycgeo_key.txt' #File with API key
start_idx=0 #If program breaks, change this to pick up with record where you left off
#Counting from 0, positions in the CSV that contain the address info 
hous_idx=23
st_idx=24
boro_idx=21
inter_idx=25
base_url='https://api.nyc.gov/geo/geoclient/v1/'

def get_time():
    time_now = time.localtime() # get struct_time
    pretty_time = time.strftime("%m/%d/%Y, %H:%M:%S", time_now)
    return pretty_time

print('*** Process launched at', get_time())

#Read api key in from file
with open(keyfile) as key:
    api_key=key.read().strip()

records=[]

with open(addfile,'r') as infile:
    reader = csv.reader(infile)
    header = next(reader) # Capture column names as separate list
    for row in reader:
        records.append(row)

# Fields returned by the API to capture
# https://maps.nyc.gov/geoclient/v1/doc
fields=['message','message2','houseNumber','firstStreetNameNormalized',
        'uspsPreferredCityName','zipCode','longitude','latitude','xCoordinate',
        'yCoordinate']
header.extend(fields)
header.append('match_result')
datavals=len(fields)-2 # Number of fields that are not messages
counts={'address match':0, 'intersection match':0,
        'failed address':0, 'failed intersection':0,
        'error':0}

print('Finished reading data from', addfile)
print('*** Geocoding process launched at',get_time())

for i,v in enumerate(records[start_idx:]):
    try:
        data_url = f'{base_url}address.json?houseNumber={v[hous_idx]}&street={v[st_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
        response=requests.get(data_url)
        package=response.json()
        # If an address is returned, continue
        if 'address' in package:
            result=package['address']     
            # If longitude is returned, grab data
            if 'longitude' in result:
                for f in fields:
                    item=result.get(f,'')
                    v.append(item)
                v.append('address match')
                counts['address match']=counts['address match']+1
            # If there was no house number, try street intersection match instead
            elif 'message' in result and result['message']=='INPUT CONTAINS NO ADDRESS NUMBER' and v[inter_idx] not in ('',None):
                try:
                    data_url = f'{base_url}intersection.json?crossStreetOne={v[st_idx]}&crossStreetTwo={v[inter_idx]}&borough={v[boro_idx]}&subscription-key={api_key}'
                    response=requests.get(data_url)
                    package=response.json()
                    # If an intersection is returned, continue
                    if 'intersection' in package:
                        result=package['intersection']
                        # If longitude is returned, grab data
                        if 'longitude' in result:
                            for f in fields:
                                item=result.get(f,'')
                                v.append(item)
                            v.append('intersection match')
                            counts['intersection match']=counts['intersection match']+1
                        # Intersection match fails, append messages and blank values
                        else:
                            v.append(result.get('message',''))
                            v.append(result.get('message2',''))
                            v.extend(['']*datavals)
                            v.append('failed intersection')
                            counts['failed intersection']=counts['failed intersection']+1
                    # Error returned instead of intersection
                    else:
                        v.append(package.get('message',''))
                        v.append(package.get('message2',''))
                        v.extend(['']*datavals)
                        v.append('error')
                        counts['error']=counts['error']+1
                        print(package.get('message',''))
                        print('Geocoder error at record',i,'continuing the matching process...')
                except Exception as e:
                     print(str(e))
            # Address match fails, append messages and blank values
            else:
                v.append(result.get('message',''))
                v.append(result.get('message2',''))
                v.extend(['']*datavals)
                v.append('failed address')
                counts['failed address']=counts['failed address']+1
        # Error is returned instead of address
        else:
            v.append(package.get('message',''))
            v.append(package.get('message2',''))
            v.extend(['']*datavals)
            v.append('error')
            counts['error']=counts['error']+1
            print(package.get('message',''))
            print('Geocoder error at record',i,'continuing the matching process...')
        if i%2000==0:
            print('Processed',i,'records so far...')
            time.sleep(15)         
    except Exception as e:
        print(str(e))

# First attempt, write to new file, but if break happened, append to existing file
if start_idx==0:
    wtype='w' 
else:
    wtype='a'

end_idx=start_idx+i

with open(matchedfile,wtype,newline='') as outfile:
    writer = csv.writer(outfile, delimiter=',', quotechar='"',
                        quoting=csv.QUOTE_MINIMAL)
    if wtype=='w':
        writer.writerow(header)
        writer.writerows(records[start_idx:end_idx])
    else:
        writer.writerows(records[start_idx:end_idx])
print('Wrote',i+1,'records to file',matchedfile)
print('Final record written was number',i,':\n',v)
for k,val in counts.items():
    print(k,val)
print('*** Process finished at',get_time())

Python Geocoding Take 2 – US Addresses

Python Geocoding Take 1 – International Addresses I discussed my recent adventures with geocoding addresses outside the US. In contrast, there are countless options for batch geocoding addresses within the United States. I’ll discuss a few of those options here, but will focus primarily on the US Census Geocoder and a Python script I’ve written to batch match addresses using their API. The code and documentation is available on my lab’s resources page.

A Few Different Options

ESRI’s geocoding services allow you (with an account) to access their geocoding servers through tools in the ArcToolbox, or you can write a script and access them through an API. QGIS has a third-party plugin for accessing Google’s services (2500 records a day free) or the Open Streetmap. You can still do things the old fashioned way, by downloading geocoded street files and creating a matching service.

Alternatively, you can subscribe to any number of commercial or academic services where you can upload a file, do the matching, and download results. For years I’ve used the geocoding services at Texas A&M that allow you to do just that. Their rates are reasonable, or if you’re an academic institution and partner with them (place some links to their service on their website) you can request free credits for doing matches in batches.

The Census Geocoder and API, and a Python Script for Batch Geocoding

The Census Bureau’s TIGER and address files are often used as the foundational layers for building these other services, to which the service providers add refinements and improvements. You can access the Census Bureau’s services directly through the Census Geocoder, where you can match an address one at a time, or you can upload a batch of 1000 records. It returns longitude and latitude coordinates in NAD 83, and you can get names and codes for all the census geographies where the address is located. The service is pretty picky about the structure of the upload file (must be plain text, csv, with an id column and then columns with the address components in a specific order – with no other attributes allowed) but the nice thing is it requires no login and no key. It’s also public domain, so you can do whatever you want with the data you’ve retrieved. A tutorial for using it is available on our lab’s census tutorials page.

census geocoder

They also have an API with some basic documentation. You can match parsed and unparsed addresses, and can even do reverse geocoding. So I took a stab at writing a script to batch process addresses in text-delimited files (csv or txt). Unfortnately, the Census Geocoding API is not one of the services covered by the Python Geocoder that I mentioned in my previous post, but I did find another third party module called censusgeocode which provides a thin wrapper you can use. I incorporated that module into my Python 3 script, which I wrote as a function that takes the following inputs:

census_geocode(datafile,delim,header,start,addcol)
(str,str,str,int,list[int]) -> files

  • datafile – this is the name of the file you want to process (file name and extension). If you place the geocode_census_funct.py file in the same directory as your data file, then you just need to provide the name of the file. Otherwise, you need to provide the full path to the file.
  • delim – this is the delimiter or character that separates the values in your data file. Common delimiters includes commas ‘,’, tabs ‘t’, and pipes ‘|’.
  • header – here you specify whether your file has a header row, i.e. column names. Enter ‘y’ or ‘yes’ if it does, ‘n’ or ‘no’ if it doesn’t.
  • start – type 0 to specify that you want to start reading the file from the beginning. If you were previously running the script and it broke and exited for some reason, it provides an index number where it stopped reading; if that’s the case you can provide that index number here, to pick up where you left off.
  • addcol – provide a list that indicates the position number of the columns that contain the address components in your data file. For an unparsed address, you provide just one position number. For a parsed address, you provide 4 positions: address, city, state, and ZIP code. Whether you provide 1 or 4, the numbers must be supplied in brackets, as the function requires a Python list.

You can open the script in IDLE, run it to load it into memory, and then type the function with the necessary parameters in the shell to execute it. Some examples:

  • A tab-delimited, unparsed address file with a header that’s stored in the same folder as the script. Start from the beginning and the address is in the 2nd column: census_geocode('my_addresses.txt','t','y',0,[2])
  • A comma-delimited, parsed address file with no header that’s stored in the same folder as the script. Start from the beginning and the addresses are in the 2nd through 5th columns: census_geocode('addresses_to_match.csv',',','n',0,[2,3,4,5])
  • A comma-delimited, unparsed address file with a header that’s not in the same folder as the script. We ran the file before and it stopped at index 250, so restart there – the address is in the 3rd column: census_geocode('C:address_datadata1.csv',',','y',250,[3])

The beginning of the script “sets the table”: we read the address columns into variables, create the output files (one for matches, one for non-matches, and a summary report), and we handle whether or not there’s a header row. For reading the file I used Python’s CSV module. Typically I don’t use this module, as I find it’s much simpler to do the basic: read a line in, split it on a delimiter, strip whitespace, read it into a list, etc. But in this case the CSV module allows you to handle a wider array of input files; if the input data was a csv and there happened to be commas embedded in the values themselves, the CSV module easily takes care of it; if you ignore it, the parsing would get thrown off for that record.

Handling Exceptions and Server Errors

In terms of expanding my skills, the new things I had to learn were exception handling and control flows. Since the censusgeocoding module is a thin wrapper, it had no built in mechanism for retrying a match a certain number of times if the server timed out. This is an absolute necessity, because the census server often times out, is busy, or just hiccups, returning a generic error message. I had already learned how to handle crashes in my earlier geocoding experiments, where I would write the script to match and write a record one by one as it went along. It would try to do a match, but if any error was raised, it would exit that loop cleanly, write a report, and all would be saved and you could pick up where you left off. But in this case, if that server non-response error was returned I didn’t want to give up – I wanted to keep trying.

So on the outside there is a loop to try and do a match, unless any error happens, then exit the loop cleanly and wrap up. But inside there is another try loop, where we try to do a match but if we get that specific server error, continue: go back to the top of that for loop and try again. That loop begins with While True – if we successfully get to the end, then we start with the next record. If we get that server error we stay in that While loop and keep trying until we get a match, or we run out of tries (5) and write as a non-match.

error handling

In doing an actual match, the script does a parsed or unparsed match based on user input. But there was another sticking point; in some instances the API would return a matched result (we got coordinates!), but some of the objects that it returned were actually errors because of some java problem (failed to get the tract number or county name – here’s an error message instead!) To handle this, we have a for i in range loop. If we have a matched record and we don’t have a status message (that indicates an error) then we move along and grab all the info we need – the coordinates, and all the census geography where that coordinate falls, and write it out, and then that for loop ends with a break. But if we receive an error message we continue – go back to the top of that loop and try doing the match again. After 3 tries we give up and write no match.

Figuring all that out took a while – where do these loops go and what goes in them, how do I make sure that I retry a record rather than passing over it to the next one, etc. Stack Exchange to the rescue! Difference between continue, pass and break, returning to the beginning of a loop, breaking out of a nested loop, and retrying after an exception. The rest is pretty straightforward. Once the matching is done we close the files, and write out a little report that tells us how many matches we got versus fails. The Census Geocoder via the API is pretty unforgiving; it either finds a match, or it doesn’t. There is no match score or partial matching, and it doesn’t give you a ZIP Code or municipal centroid if it can’t find the address. It’s all or nothing; if you have partial or messy addresses or PO Boxes, it’s pretty much guaranteed that you won’t get matches.

There’s no limit on number of matches, but I’ve built in a number of pauses so I’m not hammering the server too hard – one second after each match, 5 seconds after every 1000 matches, a couple seconds before retrying after an error. Your mileage will vary, but the other day I did about 2500 matches in just under 2 hours. Their server can be balky at times – in some cases I’ve encountered only a couple problems for every 100 records, but on other occasions there were hang-ups on every other record. For diagnostic purposes the script prints every 100th record to the screen, as well as any problems it encountered (see pic below). If you launch a process and notice the server is hanging on every other record and repeatedly failing to get matches, it’s probably best to bail out and come back later. Recently, I’ve noticed fewer problems during off-peak times: evenings and weekends.

script_running

Wrap Up

The script and the documentation are posted on our labs resources page, for all to see and use – you just have to install the third party censusgeocode module before using it. When would you want to use this? Well, if you need something that’s free, this is a good choice. If you have batches in the 10ks to do, this would be a good solution. If you’re in the 100ks, it could be a feasible solution – one of my colleagues has confirmed that he’s used the script to match about 40k addresses, so the service is up to the task for doing larger jobs.

If you have less than a couple thousand records, you might as well use their website and upload files directly. If you’re pushing a million or more – well, you’ll probably want to set up something locally. PostGIS has a TIGER module that lets you do desktop matching if you need to go into the millions, or you simply have a lot to do on a consistent basis. The excellent book PostGIS in Action has a chapter dedicated to to this.

In some cases, large cities or counties may offer their own geocoding services, and if you know you’re just going to be doing matches for your local area those sources will probably have greater accuracy, if they’re adding value with local knowledge. For example, my results with NYC’s geocoding API for addresses in the five boroughs are better than the Census Bureau’s and is customized for local quirks; for example, I can pass in a borough name instead of a postal city and ZIP Code, and it’s able to handle those funky addresses in Queens that have dashes and similar names for multiple streets (35th st, 35th ave, 35th dr…). But for a free, public domain service that requires no registration, no keys, covers the entire country, and is the foundation for just about every US geocoding platform out there, the Census Geocoder is hard to beat.

Python Geocoding Take 1 – International Addresses

This past semester has been the semester of geocoding. I’ve had a number of requests for processing large batches of addresses. Now that the term is drawing to a close, I’ll share some of my trials and tribulations. In this post, I’ll focus on my adventures in international geocoding.

First, it’s necessary to provide some context. As an academic librarian I’m primarily engaged with assisting students and faculty with their coursework and their research. My users are interested in getting coordinates for data so that they can do both analysis and visualization, which requires them to download the actual coordinate data in a batch and integrate it with the rest of their projects.

This is an important distinction to make, because in many cases the large web mapping companies (Google, Bing, Mapquest, etc) are not catering to this population – they provide services and APIs to web developers, so these folks can integrate geocoding services into the Google, Bing, etc maps they are embedding in their website. They geocoding providers specifically prohibit (in the fine print of their terms of use) anyone from using their services to create and download geocoded data. This essentially excludes a lot of academic use – which, is something I hadn’t fully grasped at the outset.

Google’s Geocoding API Perhaps?

My adventure began when a professor asked me for help in geocoding about 1 million addresses – in Turkey. Right from the beginning, many of the usual sources I would turn to (for US addresses) were out the window. I knew that I could do small scale batches of international addresses with the mmQGIS geocoding plugin, so I started testing there. The address file we had consisted of unparsed addresses, and the formating looked rather chaotic – but after doing some research I discovered that geocoding Turkish addresses was a tough proposition. The Open Street Map plugin (using Nominatim) returned no matches for our 1000 test cases. The Google results were much better, so we decided to investigate writing a script and using an API and to pay for the matching. According to the documentation, it would end up costing $500 to do 1 million addresses.

I searched around for some Python APIs and found what I believed was the official one for google maps geocoding. So I spent a day writing a script that would loop through the addresses, which we divided into batches of 100k records each (which is the max you can do per day with Google if you set up billing), and the professor obtained an API key and set up billing for the account. The interface for setting up and managing the Google APIs was ridiculously confusing. Eventually we were set and I let the script rip, and found that it wouldn’t rip for long. It would consistently stop after doing a few thousand records. I had written it to write results one by one as they were obtained, and to exit cleanly in the case of errors. Upon exit, it provides the index number of the record where it stopped, so I was able to pick up where it left off. But the server would constantly time out – sometimes it could do 10 to 12k records in a stretch, but often less, so I could never leave it unattended for long. The matches themselves were a mixed bag – you could throw absolute garbage at the Google geocoder and still get a match – if not to an address or property, then to a street segment, and beyond that to useless things like postal codes, administrative districts, and the country as a whole (i.e. I can’t find your address, so here are the coordinates for the geographic center of Istanbul, or for all of Turkey. Have a nice day).

It seemed like it was going to be a long climb to get to 1 million – but after about 100k we could go no further. Google simply refused every additional request. A new API key would get us a little further, but soon after that nothing would work and we wouldn’t get any useful error messages to explain why. Having never done anything like this before, I started to investigate why, and eventually discovered the problem: these web mapping geocoding services, even if you pay for them, are not meant to be used this way. Buried in the documentation I found the license restrictions, which stipulate that you are not allowed to download any of the data, and you had to plot every coordinate you retrieved onto a Google map. This is a service for web mapping developers, not researchers.

Why hadn’t I realized this before? One, I simply had never made this distinction as I thought geocoding was geocoding, and in my world of course people are going to want to download the coordinates. Two, the Internet is full of thousands of little blog posts and tutorials which demonstrate how to use the Google Maps APIs, so I thought this was possible. But they never mention any of the caveats about what you can and can’t do with these services. In addition to violating the service terms, what I was doing was akin to yelling in the back of a crowded room, as I was hammering their server, sending requests as fast as I could with no limit. A normal web mapping application (which is what the service is designed for) would send a fraction of those requests in that amount of time. No wonder the requests were refused. Thus ended my Google geocoding experiment.

Nope – How About ESRI Instead?

So what to do next? I found that most of the other commercial web mapping services didn’t provide anything near the maximum caps and low prices that Google was offering. Mapquest for example requires that you subscribe to an account on a monthly or annual basis, and 100k is the amount you could do in a month. Most of the other commercial services also prohibit any downloads.

The big exception is ESRI – they are one of the few that understand and cater to the academic market, and they do allow downloads: they say quite plainly: “Take your Coordinates with you. Once you have the results of a Geocode operation, they’re yours to take anywhere.” My university has a site license for ArcGIS, but it doesn’t include geocoding. You can create an account and have a certain number of free credits, and after that you pay. 1 mil records was going to cost about $4000 – substantially more than Google, but totally legal. ESRI provides lists of countries and ranks them according to how complete their street network coverage is. You can use their API via a script, or you can set up the service in ArcGIS Desktop and do the matching through the ArcToolbox. This would be painfully slow if you were doing a large job (like this one) but for the purpose of testing it out with a few hundred records this is what I tried. Unfortunately, in our case the results still weren’t good. Most of the addresses were to administrative or postal areas; not specific enough.

The Python Geocoder and a Wealth of Options

What often happens in librarianship when a patron makes an initial request (this should be a piece of cake, right?) and then discovers that what they’re looking for is more involved (ahhh this will be tougher than we thought), is that they reframe the question. He went back through the addresses with a research partner and winnowed them down based on what they really, absolutely needed, so now we were down from 1 million to just finding a match for about 300k. His colleague also suggested that we use Yandex, the Russian search and mapping engine. The structure of Russian addresses is quite similar to Turkish ones, and since Russia is closer to Turkey geographically and economically Yandex might do a better job.

I was dubious of this at first, but was quickly surprised. I found the Python Geocoder module, which provides a common, uniform API to over two dozen different geocoding services – including Google. Given the simplicity and flexibility of this module, it’s the one I should have used in the first place. And while Google limits you to 2500 free matches in one day, Yandex allows you 25k – that’s 25,000 – free matches in one day, without having to request an API key! I modified the original script I wrote to use the Python Geocoder module with Yandex, and the initial small-batch tests were successful. Here’s a small portion of the code – it loops through a file where the address is stored in one field (unparsed):

for index, line in enumerate(readfile):
        address=line.strip().split(delim)
        result=geocoder.yandex(address[add]).json

And it spits you back this JSON result (you could also do XML if you prefer):

{‘quality’: ‘street’, ‘address’: ‘Türkiye, İstanbul, Fatih, Cankurtaran Mh., Ayasofya Meydanı’, ‘location’: ‘Hagia Sophia Museum, Sultanahmet Mh., Ayasofya Meydanı, Fatih/İstanbul’, ‘state’: ‘İstanbul’, ‘lng’: ‘28.979031’, ‘accuracy’: ‘street’, ‘encoding’: ‘utf-8’, ‘provider’: ‘yandex’, ‘country_code’: ‘TR’, ‘ok’: True, ‘status_code’: 200, ‘lat’: ‘41.00772’, ‘country’: ‘Türkiye’, ‘county’: ‘Fatih’, ‘confidence’: 10, ‘bbox’: {‘northeast’: [41.008156, 28.979714], ‘southwest’: [41.007285, 28.978349]}, ‘street’: ‘Ayasofya Meydanı’, ‘status’: ‘OK’}

If the result you get back is not OK (ok is False – nothing matched), then write the record to the unmatched file. Otherwise, get the bits and pieces out of the json object that you want, append them to the record, and write the whole record out to a matched file.

        if result.get('ok')==False:
                nomatch.append(address)
                nomatchfile.writelines('t'.join(address)+'n')
        else:
                lng=result.get('lng')
                lat=result.get('lat')
                qual=result.get('quality')
                accu=result.get('accuracy')
                matchadd=result.get('address')
                newitems=lng,lat,qual,accu,matchadd
                address.extend(newitems)
                matched.append(address)
                matchfile.writelines('t'.join(address)+'n')

But is it legal? It was unclear to me; they specify that map data is meant for personal/noncommercial use and in the same sentence: “Any copying of the Data, their reproduction, conversion, distribution, promulgation (publication) in the Internet, any use of the Data in mass media and/or for commercial purposes without a prior written consent of the right holder, shall be prohibited”. Does that mean any copying, or just copying for commercial use or for redistributing the data? In our case, this is for academic non-profit use and the data (individual geocoded records) wasn’t going to be republished – it would be used for plotting distances between locations and making highly generalized static dot maps for an article. At this stage we seemed to be out of options – if you need to geocode a large batch of international addresses, AND you are willing to pay for it, where on Earth can you go?

Ultimately, I left it up to the professor to contact them or not, and we decided to roll the dice. For my part, I engineered the script to put a minimum load on their servers – essentially I could take 24 hours to do 25k records. I used the time and random modules in Python to build pauses in between records to slow things down. In sharp contrast to Google, the Yandex servers were amazingly reliable – they were able to do batches of 25k records every single time without timing out – not even once – and in less than a couple weeks we were finished. About 50% of the matches were good, and for the others he and a research assistant went back and cleaned up unmatched records, and I gave them the script so they could try again.

International Geocoding: The Take-Aways

  1. If you need to geocode a large batch of foreign addresses for academic or research purposes, forget Google. Their service was less than stellar (to put it mildly) and anyway it’s a violation of their license agreement. And all those lousy little blog posts out there that show you how to use the Google Map APIs with Python and say “Gee isn’t this great!” are largely useless for practical purposes.
  2. The Python Geocoder module is simple to use and let’s you write a single script to access a ton of different geocoding services, including Open Streetmap, Yandex, and ESRI. But you still need to review the terms of service for each one to see what’s allowed and what the daily limits are.
  3. If you have funding for your research project, and ESRI geocoding has good coverage for your geographic area (based on their documentation but also on your own testing) then go with them, as you’re free and clear to download data under their terms. Arc Desktop will be too sluggish for large batches so write a script – you can use the Python Geocoder.
  4. Otherwise – the Open Street Map / Nominatim services are worth a try but your success will vary by country. I had used them before for addresses in France with fair success, but it didn’t help me with Turkey.
  5. You can also crawl through the GIS Stackexchange for advice. I’ve found that most of the suggestions are either for US geocoding, or are companies that are answering posts saying “Hey you can try my service!”

Happy geocoding, comrades! In my next post I’ll discuss my experience with batch geocoding addresses here in the US of A with Python.