esri

OSM Web Feature Service

OpenStreetMap Data with ArcGIS Pro and QGIS

A couple years ago I wrote a post that demonstrated how to use the QuickOSM plugin for QGIS to easily extract features from the OpenStreetMap (OSM). The OSM is a great source for free and open GIS data, especially for types of features that are not captured in government sources, and for parts of the world that don’t possess a free or robust GIS data infrastructure. I’ve been using ArcGIS Pro more extensively in my new job and was wondering how I could do the same thing: query features from the OSM based on keys and values (denoting feature type) and geographic area and extract them as a vector layer. I’m looking for straightforward solutions that I could use for answering questions from students (so no command line tricks or database stuff). In this post I’ll cover three approaches for achieving this in ArcGIS Pro, with references to QGIS.

File Approach

The most straightforward method would be to export data directly from the main OSM page by zooming into an area and hitting the Export button. This is a pretty blunt approach, as you have to be zoomed in pretty close and you grab every possible feature in the view. The “native” file format of OSM is the osm / pbf format; .osm is an XML file while .pbf is a compressed binary version of the osm. QGIS is able to handle these files directly; you just add them as a vector layer. ArcGIS Pro cannot. You have to download and install a special Data Interoperability extension, which is an esoteric thing that’s not part of the standard package and requires a special license from your site license coordinator.

A better and more targeted approach is to download pre-created extracts that are provided by a number of organizations listed in the OSM wiki. I started with Geofabrik in Germany, as it was a source I recognized. They package OSM data by geographic area and feature type. On their main page they list files that contain all features for each of the continents. These are enormous files, and as such they are only provided in the osm pbf format as shapefiles can’t effectively handle data that size. Even if you downloaded the osm pbf files and added them to QGIS, the software will struggle to render something that big.

But all is not lost; Geofabrik and many other providers package data in a shapefile format for smaller areas, provided that the size and number of features is not too great. For instance, on Geofabrik’s download page if you click on North America you’re presented with country extracts for that continent (see images below). You can get shapefiles for Greenland and Mexico, but not Canada or the US as the files are still too big. Click on US, and you’re presented with files for each of the states. No luck for California (too big), but the rest of states are small enough that you can get shapefiles for all of them.

Geofabrik OSM data: download continents
Default Geofrabrik OSM download page for continents. Click on a continent name…
Geofabrik OSM data downloads: countries in North America
…to access files for countries. Click on a country name…
Geofabrik OSM data downloads: states of the US
…to access files for states / provinces / admin divisions

I downloaded and unzipped the file for Rhode Island. It contains a number of individual shapefiles classified by type of feature: buildings, land use, natural, places, places of worship (pofw), points of interest (pois), railways, roads, traffic, transport, water, and waterways. Many of the files appear twice: files with an “a” suffix represent polygons (areas) while files without that suffix are points or lines. Some OSM features are stored as polygons when such detail is available, while others are represented as points.

For example, if I add the two places of workship files to a map, for some features you have the outline of the actual building, while for most you simply have a point. After adding the layers to the map, you’ll probably want to use Select by Attribute to select the features you want based on OSM tags with keys and values, and Select by Location in conjunction with a separate boundary file to pull data out for a smaller area. The Geofabrik OSM attribute table is limited to basic attributes: an OSM ID, feature code and class, and name. It’s also likely that you’ll want to unify the point and polygon features of the same type into one layer, as they’re usually mutually exclusive. Use the Centroid (Polygon) tool in the toolbox to turn the polygons into points, and the Merge tool to meld the two point layers together. In QGIS the comparable tools under the Vector menu are Centroids and Merge Vector Layers. WGS 84 is the default CRS for the layers.

ArcGIS Pro with OSM Places of Worship from Geofabrik
OSM Places of Worship. Some features are stored as points while others are polygons

Geofabrik is just one option. There are several others and they take different approaches for structuring their extracts. For example, BBBike.org organizes their layers by city for over 200 cities around the world, and they provide a number of additional formats beyond OSM PBF and shapefiles, such as Garmin GPS, GeoJSON, and CSV. They divide the data into fewer files, and if they don’t compile data for the area you’re interested in you can use a web-based tool to create a custom extract.

Plugin Approach

It would be nice to use a plugin, as that would allow you to specify a custom geographic area and retrieve just the specific features you want. QuickOSM works quite nicely for QGIS. Fortunately there is a good ArcGIS Pro solution called OSMquery. It works for both Pro and Desktop, tested for Pro 2.2 and Desktop 10.6. I’m using Pro 2.7 and the basic tool worked fine. It’s well documented, with good instructions for installation and use.

The plugin is written in Python and you add it as a tool to your ArcToolbox. Download the repo from the OSMquery GitHub as a ZIP file (click the green code button and choose Download ZIP). Save it in or near your ArcGIS project folders, and unzip it. In Pro, go into a project and open a Catalog Pane in the View ribbon. Right click on Toolbox to add a new one, and browse to the folder you unzipped to add the tool. There are two scripts in the box, a basic and an advanced version. The basic tool functioned without trouble for me. The advanced tool threw an error, probably some Python dependency issue (I didn’t investigate as the basic tool met my needs).

In the basic tool you choose the key and value for the features you want to extract; the dropdown menu is automatically populated with these options. For the geographic extent you can enter a place name, or you can use the extent of the current map window or of a layer in the project, or you can manually type in bounding box coordinates. Another nice option is you can transform the CRS of the extracted features from WGS 84 to another system, so it matches the CRS of layers in your existing project. Run the tool, and the features are extracted. If the features exist as both points and polygons, you get two separate files for each. If you choose, you can merge them together as described in the previous section; this is a bit tougher as the plugin approach yields a much wider selection of fields in the attribute table, and not all of the point and polygon attributes align. With the Merge tool in Pro you can select which attributes you want to hold on to, and common ones will be merged. QGIS is a bit messier in this regard, but in my earlier post I outlined a work-around using a spatial database.

OSMquery tool in ArcGIS Pro
The basic OSMquery tool in an ArcGIS Pro toolbox

Web Feature Service

This initially seemed to be the most promising route, but it turned out to be a dud. Like QGIS, Pro allows you to add OSM as a tiled base map. But ESRI also offers OSM as a web feature service: by hitting Add Data on the Map ribbon and searching the Living Atlas for “OpenStreetMap” you can select from a number of OSM web feature services, organized by continent and feature type. Once you add them to a map, you can select and click on individual features to see their name and feature type. The big problem is that you are not allowed to extract features from these layers, which leaves you with an enormous and heterogeneous mix of features for an entire continent. You can interact with the features, selecting by attribute and location in reference to other spatial layers, but that’s about it.

OSM web feature service in ArcGIS Pro
OSM web feature service in ArcGIS Pro

In Summary

I would recommend taking the step of downloading the OSMquery plugin for ArcGIS Pro if you want to take a highly targeted approach to OSM feature extraction (for QGIS users, enable the QuickOSM plugin). This approach is also best if you can’t download a pre-existing extract for your area because it’s too large or has too many features, and if you want to access the fullest possible range of attribute values. Otherwise, you can simply download one of the pre-created extracts, and use your software to winnow it down to what you need (or if you do need everything, the file approach makes more sense). Since the file-based option includes fewer attributes, converting polygon features to points and merging them with the other point features is a bit simpler.

Stamen Watercolor Map Tiles

Adding Basemaps to QGIS With Web Mapping Services

For this final post of 2020, I was looking back through recent projects for something interesting yet brief; I’ve been writing some encyclopedia-length posts lately and wanted to keep this one on the lighter side. In that vein, I’ve decided to share a short list of free web mapping services that I use as basemaps in QGIS (they’ll work in ArcGIS too). This has been on my mind as I’ve recently stumbled upon the OpenTopoMap, which is an alternate stylized version of the OpenStreetMap that looks pretty sharp.

See this earlier post for details, but in short, to connect to these services in QGIS:

QGIS Browser Panel
  1. Select the appropriate web map service type in the browser panel (usually WMS / WMTS or XYZ Tiles), right click, and add new connection.
  2. Give it a meaningful name, paste the appropriate URL into the URL box, click OK.
  3. In the browser panel drill down to see the service, and for WMS / WMTS layers you can drill down further to see specific layers you can add.
  4. Select the layer and drag it into the window, or select, right click, and add the layer to the project.
  5. If the resolution looks off, right click on a blank area of the toolbar and check the Tile Scale Panel. Use this to adjust the zoom for the web map. If the scale bar is greyed out you’ll need to set the map window to the same CRS as the map service: select the layer in the panel, right click, and choose set CRS – set project CRS from layer.
  6. Some web layers may render slowly if you’re zoomed out to the full extent, or even not at all if they contain many features or are super detailed. Conversely, some layers may not render if you’re zoomed too far in, as tiles may not be available at that resolution. Experiment!

If you’re an ArcGIS user see these concise instructions for adding various tile layers. This isn’t something that I’ve ever done, as ArcGIS already has a number of accessible basemaps that you can add.

In the list below, links for the service name take you to either the website version of the service, or to a list of additional layers that you can connect to. The URLs that follow are the actual connections to the service that you’ll use within your GIS package. If you use OSM, OTP, or Stamen in your maps, make sure to cite them (they use Creative Commons Licenses – follow links to their websites for details). The government sources are public domain, but you should still cite them anyway. Happy mapping, and happy holidays!

OpenStreetMap XYZ Tile (global)

http://tile.openstreetmap.org/{z}/{x}/{y}.png

OpenTopoMap XYZ Tile (global)

https://tile.opentopomap.org/{z}/{x}/{y}.png

Stamen XYZ Tile (global) see their website for examples; the image topping this post is from watercolor

http://tile.stamen.com/terrain/{z}/{x}/{y}.png
http://tile.stamen.com/toner/{z}/{x}/{y}.png
http://tile.stamen.com/watercolor/{z}/{x}/{y}.jpg

USGS National Map WMTS (global, but fine detail is US only)

Imagery:
https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryOnly/MapServer/WMTS/1.0.0/WMTSCapabilities.xml

Imagery & Topo:
https://basemap.nationalmap.gov/arcgis/rest/services/USGSImageryTopo/MapServer/WMTS/1.0.0/WMTSCapabilities.xml

Shaded Relief: 
https://basemap.nationalmap.gov/arcgis/rest/services/USGSShadedReliefOnly/MapServer/WMTS/1.0.0/WMTSCapabilities.xml

Topographic:
https://basemap.nationalmap.gov/arcgis/rest/services/USGSTopo/MapServer/WMTS/1.0.0/WMTSCapabilities.xml

US Census Bureau TIGERweb WMS (US only) see their website for older vintages

Current TIGER features:
https://tigerweb.geo.census.gov/arcgis/services/TIGERweb/tigerWMS_Current/MapServer/WMSServer 

Current physical features:
https://tigerweb.geo.census.gov/arcgis/services/TIGERweb/tigerWMS_PhysicalFeatures/MapServer/WMSServer

Percentage of Children in Households Without the Internet

Kids with No Internet at Home: Data Processing for US Census Mapping

In this post I’ll demonstrate some essential data processing steps prior to joining census American Community Survey (ACS) tables downloaded from data.census.gov to TIGER shapefiles, in order to create thematic maps. I thought this would be helpful for students in my university who are now doing GIS-related courses from home, due to COVID-19. I’ll illustrate the following with Excel and QGIS: choosing an appropriate boundary file for making your map, manipulating geographic id codes (GEOIDs) to insure you can match data file to shapefile, prepping your spreadsheet to insure that the join will work, and calculating new summaries and percent totals with ACS formulas. Much of this info is drawn from the chapters in my book that cover census geography (chapter 3), ACS data (chapter 6), and GIS (chapter 10). I’m assuming that you already have some basic spreadsheet, GIS, and US census knowledge.

For readers who are not interested in the technical details, you still may be interested in the map we’ll create in this example: how many children under 18 lack access to a computer with internet access at home? With COVID-19 there’s a sudden expectation that all school children will take classes remotely from home. There are 73.3 million children living in households in the US, and approximately 9.3 million (12.7%) either have no computer at home, or have a computer but no internet access. The remaining children have a computer with either broadband or dial-up at home. Click on the map below to explore the county distribution of the under 18 population who lack internet access at home, or follow this link: https://arcg.is/0TrGTy.

arcgis_webmap

Click on the Map to View Full Screen and Interact

Preliminaries

First, we need to get some ACS data. Read this earlier post to learn how to use data.census.gov (or for a shortcut download the files we’re using here). I downloaded ACS table B28005 Age by Presence of a Computer and Types of Internet Subscription in Household at the county-level. This is one of the detailed tables from the latest 5-year ACS from 2014-2018. Since many counties in the US have less than 65,000 people, we need to use the 5-year series (as opposed to the 1-year) to get data for all of them. The universe for this table is the population living in households; it does not include people living in group quarters (dormitories, barracks, penitentiaries, etc.).

Second, we need a boundary file of counties. You could go to the TIGER Line Shapefiles, which provides precise boundaries of every geographic area. Since we’re using this data to make a thematic map, I suggest using the Cartographic Boundary Files (CBF) instead, which are generalized versions of TIGER. Coastal water has been removed and boundaries have been smoothed to make the file smaller and less detailed. We don’t need all the detail if we’re making a national-scale map of the US that’s going on a small screen or an 8 1/2 by 11 piece of paper. I’m using the medium (5m) generalized county file for 2018. Download the files, put them together in a new folder on your computer, and unzip them.

TIGER Line shapefile

TIGER Line shapefile

CBF shapefile

CBF shapefile

GEOIDs

Downloads from data.census.gov include three csv files per table that contain: the actual data (data_with_overlays), metadata (list of variable ids and names), and a description of the table (table_title). There are some caveats when opening csv files with Excel, but they don’t apply to this example (see addendum to this post for details). Open your csv file in Excel, and save it as an Excel workbook (don’t keep it in a csv format).

The first column contains the GEOID, which is a code that uniquely identifies each piece of geography in the US. In my file, 0500000US28151 is the first record. The part before ‘US’ indicates the summary level of the data, i.e. what the geography is and where it falls in the census hierarchy. The 050 indicates this is a county. The part after the ‘US’ is the specific identifier for the geography, known as an ANSI / FIPS code: 28 is the state code for Mississippi, and 151 is the county code for Washington County, MS. You will need to use this code when joining this data to your shapefile, assuming that the shapefile has the same code. Will it?

That depends. There are two conventions for storing these codes; the full code 0500000US28151 can be used, or just the ANSI / FIPS portion, 28151. If your shapefile uses just the latter (find out by adding the shapefile in GIS and opening its attribute table), you won’t have anything to base the join on. The regular 2018 TIGER file uses just the ANSI / FIPS, but the 2018 CBF has both the full GEOID and the ANSI FIPS. So in this case we’re fine, but for the sake of argument if you needed to create the shorter code it’s easy to do using Excel’s RIGHT formula:

Excel formula: RIGHT

The formulas RIGHT, LEFT, and MID are used to return sub-strings of text

The formulas reads X characters from the right side of the value in the cell you reference and returns the result. You just have to count the number of characters up to the “S’ in the “US”. Copy and paste the formula all the way down the column. Then, select the entire column, right click and chose copy, select it again, right click and choose Paste Special and Values (in Excel, the little clipboard image with numbers on top of it). This overwrites all the formulas in the column with the actual result of the formula. You need to do this, as GIS can’t interpret your formulas. Put some labels in the two header spaces, like GEO_ID2 and id2.

Excel: Paste Special

Copy a column, and use Paste Special – Values on top of that column to overwrite formulas with values

Subsets and Headers

It’s common that you’ll download census tables that have more variables than you need for your intended purpose. In this example we’re interested in children (people under 18) living in households. We’re not going to use the other estimates for the population 18 to 64 and 65 and over. Delete all the columns you don’t need (if you ever needed them, you’ve got them saved in your csv as a backup).

Notice there are two header rows: one has a variable ID and the other has a label. In ACS tables the variables always come in pairs, where the first is the estimate and the second is the margin of error (MOE). For example, in Washington County, Mississippi there are 46,545 people living in households +/- 169. Columns are arranged and named to reflect how values nest: Estimate!!Total is the total number of people in households, Estimate!!Total!!Under 18 years is the number people under 18 living in households, which is a subset of the total estimate.

The rub here is that we’re not allowed to have two header rows when we join this table to our shapefile – we can only have one. We can’t keep the labels because they’re too long – once joined, the labels will be truncated to 10 characters and will be indistinguishable from each other. We’ll have to delete that row, leaving us with the cryptic variable IDs. We can choose to keep those IDs – remember we have a separate metadata csv file where we can look up the labels – or we can rename them. The latter is feasible if we don’t have too many. If you do rename them, you have to keep them short, no more than 10 characters or they’ll be truncated. You can’t use spaces (underscores are ok), any punctuation, and can’t begin variables names with a number. In this example I’m going to keep the variable IDs.

Two odd gotchas: first, find the District of Columbia in your worksheet and look at the MOE for total persons in households (variable 001M). There is a footnote for this value, five asterisks *****. Replace it with a zero. Keep an eye out for footnotes, as they wreak havoc. If you ever notice that a numeric column gets saved as text in GIS, it’s probably because there’s a footnote somewhere. Second, change the label for the county name from NAME to GEO_NAME (our shapefile already has a column called NAME, and it will cause problems if we have duplicates). If you save your workbook now, it’s ready to go if you want to map the data in it. But in this example we have some more work to do.

Create New ACS Values

We want to map the percentage of children that do not have access to either a computer or the internet at home. In this table these estimates are distinct for children with a computer and no internet (variable 006), and without a computer (variable 007). We’ll need to aggregate these two. For most thematic maps it doesn’t make sense to map whole counts or estimates; naturally places that have more people are going to have more computers. We need to normalize the data by calculating a percent total. We could do this work in the GIS package, but I think it’s easier to use the spreadsheet.

To calculate a new estimate for children with no internet access at home, we simply add the two values together (006_E and 007_E). To calculate a new margin of error, we take the square root of the sum of the squares for the MOEs that we’re combining (006_M and 007_M). We also use the ROUND formula so our result is a whole number. Pretty straightforward:

Excel Sum of Squares

When summing ACS estimates, take the square root of the sum of the squares for each MOE to calculate a MOE for the new estimate.

To calculate a percent total, divide our new estimate by the number of people under 18 in households (002_E). The formula for calculating a MOE for a percent total is tougher: square the percent total and the MOE for the under 18 population (002_M), multiply them, subtract that result from the MOE for the under 18 population with no internet, take the square root of that result and divide it by the under 18 population (002_E):

MOE for percentage

The formula for calculating the MOE for a proportion includes: the percentage, MOE for the subset population (numerator), and the estimate and MOE for the total population (denominator)

In Washington County, MS there are 3,626 +/- 724 children that have no internet access at home. This represents 29.4% +/- 5.9% of all children in the county who live in a household. It’s always a good idea to check your math: visit the ACS Calculator at Cornell’s Program for Applied Demographics and punch in some values to insure that your spreadsheet formulas are correct.

You should scan the results for errors. In this example, there is just one division by zero error for Kalawao County in Hawaii. In this case, replace the formula with 0 for both percentage values. In some cases it’s also possible that the MOE proportion formula will fail for certain values. Not a problem in our example, but if it does the solution is to modify the formula for the failed cases to calculate a ratio instead. Replace the percentage in the formula with the ratio (the total population divided by the subset population) AND change the minus sign under the square root to a plus sign.

Some of these MOE’s look quite high relative to the estimate. If you’d like to quantify this, you can calculate a coefficient of variation for the estimate (not the percentage). This formula is straightforward: divide the MOE by 1.645, divide that result by the estimate, and multiply by 100:

Calculate coefficient of variation

A CV can be used to gauge the reliability of an estimate

Generally speaking, a CV value between 0-15 indicates that as estimate is highly reliable, 12-34 is of medium reliability, and 35 and above is low reliability.

That’s it!. Make sure to copy the columns that have the formulas we created, and do a paste-special values over top of them to replace the formulas with the actual values. Some of the CV values have errors because of division by zero. Select the CV column and do a find and replace, to find #DIV/0! and replace it with nothing. Then save and close the workbook.

For more guidance on working with ACS formulas, take a look at this Census Bureau guidebook, or review Chapter 6 in my book.

Add Data to QGIS and Join

In QGIS, we select the Data Source Manager buttonQGIS Data Source Manager, and in the vector menu add the CBF shapefile. All census shapefiles are in the basic NAD83 system by default, which is not great for making a thematic map.  Go to the Vector Menu – Data Management Tools – Reproject Layer. Hit the little globe beside Target CRS. In the search box type ‘US National’, select the US National Atlas Equal Area option in the results, and hit OK. Lastly, we press the little ellipses button beside the Reprojected box, Save to File, and save the file in a good spot. Hit Run to create the file.

In the layers menu, we remove the original counties file, then select the new one (listed as Reprojected), right click, Set CRS, Set Project CRS From Layer. That resets our window to match the map projection of this layer. Now we have a projected counties layer that looks better for a thematic map. If we right click the layer and open its attribute table, we can see that there are two columns we could use for joining: AFFGEOID is the full census code, and GEOID is the shorter ANSI / FIPS.

Hit the Data Source Manager button again, stay under the vector menu, and browse to add the Excel spreadsheet. If our workbook had multiple sheets we’d be prompted to choose which one. Close the menu and we’ll see the table in the layers panel. Open it up to insure it looks ok.

To do a join, select the counties layer, right click, and choose properties. Go to the Joins tab. Hit the green plus symbol at the bottom. Choose the spreadsheet as the join layer, GEO_ID as the join field in the spreadsheet, and AFFGEOID as the target field in the counties file. Go down and check Custom Field Name, and delete what’s in the box. Hit OK, and OK again in the Join properties. Open the attribute table for the shapefile, scroll over and we should see the fields from the spreadsheet at the end (if you don’t, check and verify that you chose the correct IDs in the join menu).

QGIS Join Menu

QGIS Map

We’re ready to map. Right click the counties and go to the properties. Go to the Symbology tab and flip the dropdown from Single symbol to Graduated. This lets us choose a Column (percentage of children in households with no internet access) and create a thematic map. I’ve chosen Natural Breaks as the Mode and changed the colors to blues. You can artfully manipulate the legend to show the percentages as whole numbers by typing *100 in the Column box beside the column name, and adding a % at the end of the Legend format string. I also prefer to alter the default settings for boundary thickness: click the Change button beside Symbol, select Simple fill, and reduce the width of the boundaries from .26 to .06, and hit OK.

QGIS Symbology Menu

There we have a map! If you right click on the counties in the layers panel and check the Show Feature Count box, you’ll see how many counties fall in each category. Of course, to make a nice finished map with title, legend, and inset maps for AK, HI, and PR, you’d go into the Print Layout Manager. To incorporate information about uncertainty, you can add the county layer to your map a second time, and style it differently – maybe apply crosshatching for all counties that have a CV over 34. Don’t forget to save your project.

QGIS Map

Percentage of Children in Households without Internet Access by County 2014-2018

How About that Web Map?

I used my free ArcGIS Online account to create the web map at the top of the page. I followed all the steps I outlined here, and at the end exported the shapefile that had my data table joined to it out as a new shapefile; in doing so the data became fused to the new shapefile. I uploaded the shapefile to ArcGIS online, chose a base map, and re-applied the styling and classification for the county layer. The free account includes a legend editor and expression builder that allowed me to show my percentages as fractions of 100 and to modify the text of the entries. The free account does not allow you to do joins, so you have to do this prep work in desktop GIS. ArcGIS Online is pretty easy to learn if you’re already familiar with GIS. For a brief run through check out the tutorial Ryan and I wrote as part of my lab’s tutorial series.

Addendum – Excel and CSVs

While csv files can be opened in Excel with one click, csv files are NOT Excel files. Excel interprets the csv data (plain text values separated by commas, with records separated by line breaks) and parses it into rows and columns for us. Excel also makes assumptions about whether values represents text or numbers. In the case of ID codes like GEOIDs or ZIP Codes, Excel guesses wrong and stores these codes as numbers. If the IDs have leading zeros, the zeros are dropped and the codes become incorrect. If they’re incorrect, when you join them to a shapefile the join will fail. Since data.census.gov uses the longer GEOID this doesn’t happen, as the letters ‘US’ are embedded in the code, which forces Excel to recognize it as text. But if you ever deal with files that use the shorter ANSI / FIPS you’ll run into trouble.

Instead of clicking on csvs to open them in Excel: launch Excel to a blank workbook, go to the data ribbon and choose import text files, select your csv file from your folder system, indicate that it’s a delimited text file, and select your ID column and specify that it’s text. This will import the csv and save it correctly in Excel.

Python Geocoding Take 1 – International Addresses

This past semester has been the semester of geocoding. I’ve had a number of requests for processing large batches of addresses. Now that the term is drawing to a close, I’ll share some of my trials and tribulations. In this post, I’ll focus on my adventures in international geocoding.

First, it’s necessary to provide some context. As an academic librarian I’m primarily engaged with assisting students and faculty with their coursework and their research. My users are interested in getting coordinates for data so that they can do both analysis and visualization, which requires them to download the actual coordinate data in a batch and integrate it with the rest of their projects.

This is an important distinction to make, because in many cases the large web mapping companies (Google, Bing, Mapquest, etc) are not catering to this population – they provide services and APIs to web developers, so these folks can integrate geocoding services into the Google, Bing, etc maps they are embedding in their website. They geocoding providers specifically prohibit (in the fine print of their terms of use) anyone from using their services to create and download geocoded data. This essentially excludes a lot of academic use – which, is something I hadn’t fully grasped at the outset.

Google’s Geocoding API Perhaps?

My adventure began when a professor asked me for help in geocoding about 1 million addresses – in Turkey. Right from the beginning, many of the usual sources I would turn to (for US addresses) were out the window. I knew that I could do small scale batches of international addresses with the mmQGIS geocoding plugin, so I started testing there. The address file we had consisted of unparsed addresses, and the formating looked rather chaotic – but after doing some research I discovered that geocoding Turkish addresses was a tough proposition. The Open Street Map plugin (using Nominatim) returned no matches for our 1000 test cases. The Google results were much better, so we decided to investigate writing a script and using an API and to pay for the matching. According to the documentation, it would end up costing $500 to do 1 million addresses.

I searched around for some Python APIs and found what I believed was the official one for google maps geocoding. So I spent a day writing a script that would loop through the addresses, which we divided into batches of 100k records each (which is the max you can do per day with Google if you set up billing), and the professor obtained an API key and set up billing for the account. The interface for setting up and managing the Google APIs was ridiculously confusing. Eventually we were set and I let the script rip, and found that it wouldn’t rip for long. It would consistently stop after doing a few thousand records. I had written it to write results one by one as they were obtained, and to exit cleanly in the case of errors. Upon exit, it provides the index number of the record where it stopped, so I was able to pick up where it left off. But the server would constantly time out – sometimes it could do 10 to 12k records in a stretch, but often less, so I could never leave it unattended for long. The matches themselves were a mixed bag – you could throw absolute garbage at the Google geocoder and still get a match – if not to an address or property, then to a street segment, and beyond that to useless things like postal codes, administrative districts, and the country as a whole (i.e. I can’t find your address, so here are the coordinates for the geographic center of Istanbul, or for all of Turkey. Have a nice day).

It seemed like it was going to be a long climb to get to 1 million – but after about 100k we could go no further. Google simply refused every additional request. A new API key would get us a little further, but soon after that nothing would work and we wouldn’t get any useful error messages to explain why. Having never done anything like this before, I started to investigate why, and eventually discovered the problem: these web mapping geocoding services, even if you pay for them, are not meant to be used this way. Buried in the documentation I found the license restrictions, which stipulate that you are not allowed to download any of the data, and you had to plot every coordinate you retrieved onto a Google map. This is a service for web mapping developers, not researchers.

Why hadn’t I realized this before? One, I simply had never made this distinction as I thought geocoding was geocoding, and in my world of course people are going to want to download the coordinates. Two, the Internet is full of thousands of little blog posts and tutorials which demonstrate how to use the Google Maps APIs, so I thought this was possible. But they never mention any of the caveats about what you can and can’t do with these services. In addition to violating the service terms, what I was doing was akin to yelling in the back of a crowded room, as I was hammering their server, sending requests as fast as I could with no limit. A normal web mapping application (which is what the service is designed for) would send a fraction of those requests in that amount of time. No wonder the requests were refused. Thus ended my Google geocoding experiment.

Nope – How About ESRI Instead?

So what to do next? I found that most of the other commercial web mapping services didn’t provide anything near the maximum caps and low prices that Google was offering. Mapquest for example requires that you subscribe to an account on a monthly or annual basis, and 100k is the amount you could do in a month. Most of the other commercial services also prohibit any downloads.

The big exception is ESRI – they are one of the few that understand and cater to the academic market, and they do allow downloads: they say quite plainly: “Take your Coordinates with you. Once you have the results of a Geocode operation, they’re yours to take anywhere.” My university has a site license for ArcGIS, but it doesn’t include geocoding. You can create an account and have a certain number of free credits, and after that you pay. 1 mil records was going to cost about $4000 – substantially more than Google, but totally legal. ESRI provides lists of countries and ranks them according to how complete their street network coverage is. You can use their API via a script, or you can set up the service in ArcGIS Desktop and do the matching through the ArcToolbox. This would be painfully slow if you were doing a large job (like this one) but for the purpose of testing it out with a few hundred records this is what I tried. Unfortunately, in our case the results still weren’t good. Most of the addresses were to administrative or postal areas; not specific enough.

The Python Geocoder and a Wealth of Options

What often happens in librarianship when a patron makes an initial request (this should be a piece of cake, right?) and then discovers that what they’re looking for is more involved (ahhh this will be tougher than we thought), is that they reframe the question. He went back through the addresses with a research partner and winnowed them down based on what they really, absolutely needed, so now we were down from 1 million to just finding a match for about 300k. His colleague also suggested that we use Yandex, the Russian search and mapping engine. The structure of Russian addresses is quite similar to Turkish ones, and since Russia is closer to Turkey geographically and economically Yandex might do a better job.

I was dubious of this at first, but was quickly surprised. I found the Python Geocoder module, which provides a common, uniform API to over two dozen different geocoding services – including Google. Given the simplicity and flexibility of this module, it’s the one I should have used in the first place. And while Google limits you to 2500 free matches in one day, Yandex allows you 25k – that’s 25,000 – free matches in one day, without having to request an API key! I modified the original script I wrote to use the Python Geocoder module with Yandex, and the initial small-batch tests were successful. Here’s a small portion of the code – it loops through a file where the address is stored in one field (unparsed):

for index, line in enumerate(readfile):
        address=line.strip().split(delim)
        result=geocoder.yandex(address[add]).json

And it spits you back this JSON result (you could also do XML if you prefer):

{‘quality’: ‘street’, ‘address’: ‘Türkiye, İstanbul, Fatih, Cankurtaran Mh., Ayasofya Meydanı’, ‘location’: ‘Hagia Sophia Museum, Sultanahmet Mh., Ayasofya Meydanı, Fatih/İstanbul’, ‘state’: ‘İstanbul’, ‘lng’: ‘28.979031’, ‘accuracy’: ‘street’, ‘encoding’: ‘utf-8’, ‘provider’: ‘yandex’, ‘country_code’: ‘TR’, ‘ok’: True, ‘status_code’: 200, ‘lat’: ‘41.00772’, ‘country’: ‘Türkiye’, ‘county’: ‘Fatih’, ‘confidence’: 10, ‘bbox’: {‘northeast’: [41.008156, 28.979714], ‘southwest’: [41.007285, 28.978349]}, ‘street’: ‘Ayasofya Meydanı’, ‘status’: ‘OK’}

If the result you get back is not OK (ok is False – nothing matched), then write the record to the unmatched file. Otherwise, get the bits and pieces out of the json object that you want, append them to the record, and write the whole record out to a matched file.

        if result.get('ok')==False:
                nomatch.append(address)
                nomatchfile.writelines('t'.join(address)+'n')
        else:
                lng=result.get('lng')
                lat=result.get('lat')
                qual=result.get('quality')
                accu=result.get('accuracy')
                matchadd=result.get('address')
                newitems=lng,lat,qual,accu,matchadd
                address.extend(newitems)
                matched.append(address)
                matchfile.writelines('t'.join(address)+'n')

But is it legal? It was unclear to me; they specify that map data is meant for personal/noncommercial use and in the same sentence: “Any copying of the Data, their reproduction, conversion, distribution, promulgation (publication) in the Internet, any use of the Data in mass media and/or for commercial purposes without a prior written consent of the right holder, shall be prohibited”. Does that mean any copying, or just copying for commercial use or for redistributing the data? In our case, this is for academic non-profit use and the data (individual geocoded records) wasn’t going to be republished – it would be used for plotting distances between locations and making highly generalized static dot maps for an article. At this stage we seemed to be out of options – if you need to geocode a large batch of international addresses, AND you are willing to pay for it, where on Earth can you go?

Ultimately, I left it up to the professor to contact them or not, and we decided to roll the dice. For my part, I engineered the script to put a minimum load on their servers – essentially I could take 24 hours to do 25k records. I used the time and random modules in Python to build pauses in between records to slow things down. In sharp contrast to Google, the Yandex servers were amazingly reliable – they were able to do batches of 25k records every single time without timing out – not even once – and in less than a couple weeks we were finished. About 50% of the matches were good, and for the others he and a research assistant went back and cleaned up unmatched records, and I gave them the script so they could try again.

International Geocoding: The Take-Aways

  1. If you need to geocode a large batch of foreign addresses for academic or research purposes, forget Google. Their service was less than stellar (to put it mildly) and anyway it’s a violation of their license agreement. And all those lousy little blog posts out there that show you how to use the Google Map APIs with Python and say “Gee isn’t this great!” are largely useless for practical purposes.
  2. The Python Geocoder module is simple to use and let’s you write a single script to access a ton of different geocoding services, including Open Streetmap, Yandex, and ESRI. But you still need to review the terms of service for each one to see what’s allowed and what the daily limits are.
  3. If you have funding for your research project, and ESRI geocoding has good coverage for your geographic area (based on their documentation but also on your own testing) then go with them, as you’re free and clear to download data under their terms. Arc Desktop will be too sluggish for large batches so write a script – you can use the Python Geocoder.
  4. Otherwise – the Open Street Map / Nominatim services are worth a try but your success will vary by country. I had used them before for addresses in France with fair success, but it didn’t help me with Turkey.
  5. You can also crawl through the GIS Stackexchange for advice. I’ve found that most of the suggestions are either for US geocoding, or are companies that are answering posts saying “Hey you can try my service!”

Happy geocoding, comrades! In my next post I’ll discuss my experience with batch geocoding addresses here in the US of A with Python.