Anything about GIS software

OSM Merida

Extracting OpenStreetMap Data in QGIS 3

The OpenStreetMap (OSM) can be a good source of geospatial data for all sorts of features, particularly for countries where the government doesn’t provide publicly accessible GIS data, and for features that most governments don’t publish data for. In this post I’ll demonstrate how to download a specific feature set for a relatively small area using QGIS 3.x. Instead of simply adding OSM as a web service base map we’ll extract features from OSM to create vector layers.

In the past I followed some straightforward instructions for doing this in QGIS 2.x, but of course with the movement to 3.x the core OSM plugin I previously used is no longer included, and no updated version was released. It’s a miracle that anyone can figure out what’s going on between one version of QGIS and the next. Fortunately, there’s another plugin called QuickOSM that’s quite good, and works fine with 3.x.

Use QuickOSM to Extract Features

Let’s say that we want to create a layer of churches for the city Merida in Mexico. First we launch QGIS, go to the Plugins menu, and choose Manage and Install plugins. Select plugins that are not installed, do a search for QuickOSM, select it, and install it. This adds a couple buttons to the plugins toolbar and a new sub-menu under the Vector menu called Quick OSM.

Next, we add a layer to serve as a frame of reference. We’re going to use the extent of the QGIS window to grab OSM features that fall within that area. We could download some vector files from GADM or Natural Earth; GADM provides several layers of administrative divisions which can be useful for locating and delineating our area. Or we can add a web service like OSM and simply zoom in to our area of interest. Adjust the zoom so that the entire city of Merida fits within the window.

Merida in QGIS

OSM XYZ Tiles in QGIS – Zoomed into Merida

Now we can launch the Quick OSM tool. The default tab is Quick query, which allows us to select features directly from an OSM server (you need to be connected to the internet to do this). OSM data is stored in an XML format, so to extract the data we want we’ll need to specify the correct elements and tags. Ample documentation for all the map features is available. In our example, churches are referred to as places of worship and are classified as an amenity. So we choose amenity as the key and place_of_worship as the value. The drop down box allows us to search for features in or around a place, but as discussed in my previous post place names can be ambiguous. Choose the option for canvas extent, and that will capture any churches in our map window. Hit the advanced drop down arrow, and you have the option to select specific types of geometry (keep them all). Hit the run query button to execute.

Quick OSM Interface

Quick OSM Interface

We’ll see there are two results: one for places of worship that are points, and another for polygons. If you right click on one of these layers and open the attribute table, you’ll see a number of tags that have been extracted and saved as columns, such as the name, religion, and denomination. The Quick query tools pulls a series of pre-selected attributes that are appropriate for the type of feature.

Places of Worship

The data is saved temporarily in memory, so to keep it you need to save each as a shapefile or geopackage (right click, Export, Save Features As). But before we do that – why do have two separate layers to begin with? In some cases the OSM has the full shape of the building saved as a polygon, while in other cases the church is saved as a point feature, with a cross or other religious symbol appropriate for the type of worship space. It simply depends on the level of detail that was available when the feature was added.

Polygon versus Point

Church as polygon (lower left-hand corner) and as point (upper right-hand corner)

If we needed a single unified layer we would need to merge the two, but this process can be a pain. Using the vector menu you can convert the polygons to points using the centroid tool, and then use the merge tool to combine the two point layers. This is problematic as the number of fields in each file is different, and because the centroid tool changes the data type of the polygon’s id number to a type that doesn’t match the points. I think the easiest solution is to load both layers into a Spatialite database and create a unified layer in the DB.

Use SpatiaLite to Create a Single Point Layer

To do that, right click on the SpatiaLite option in the Browser Panel, choose Create Database, and name it (merida_churches). Then select the church point file, right click, export, save features as. Choose SpatiaLite as the format, for the file select the database we just created, and for layer name call it church_points. The default CRS (used by OSM) is WGS 84. Hit OK. Then repeat the steps for the polygons, creating a layer called church_polygons in that same database.

Once the features are database layers, we can write a SQL script (see below) where you create one table that has columns that you want to capture from both tables. You load the data from each of the tables into the unified one, and as you are loading the polygons you convert their geometry to points. The brackets around the names like [addr:full] allows you to overcome the illegal character designation in the original files (you shouldn’t use colons in db column names). I like to manually insert a date so to remember when I downloaded the feature set.


CREATE TABLE all_churches (
osm_type TEXT,
name TEXT,
religion TEXT,
denomination TEXT,
addr_housenumber TEXT,
addr_street TEXT,
addr_city TEXT,
addr_full TEXT,
download_date TEXT);

SELECT AddGeometryColumn('all_churches','geom',4326,'POINT','XY');

INSERT INTO all_churches
SELECT full_id, osm_id, osm_type, name, religion, denomination,
[addr:housenumber], [addr:street], [addr:city], [addr:full],
'02/11/2019', ST_CENTROID(geometry)
FROM church_polygons;

INSERT INTO all_churches
SELECT full_id, osm_id, osm_type, name, religion, denomination,
[addr:housenumber], [addr:street], [addr:city], [addr:full],
'02/11/2019', geometry
FROM church_points;

SELECT CreateSpatialIndex('all_churches', 'geom');


Unfortunately the QGIS DB Browser does not allow you to run SQL transactions / scripts. You can paste the entire script into the window, highlight the first statement (CREATE TABLE), execute it, then highlight the next one (SELECT AddGeometryColumn), execute it, etc. Alternatively if you use the Spatialite CLI or GUI, you can save your script in a file, load it, and execute it in one go.

QGIS DB Browser

When finished we hit the refresh button and can see the new all_churches layer in the DB. We can preview the table and geometry and add it to the QGIS map window. If you prefer to work with a shapefile or geopackage you can always export it out of the db.

Other Options

The QuickOSM tool has a few other handy features. Under the Quick query tool is a plain old Query tool, which shows you the actual query being passed to the server. If you’re familiar with the map features and XML structure of OSM you can modify this query directly. Under the Query tool is the OSM File tool. Instead of grabbing features from the server, you can download an OSM pbf file (Geofabrik provides data for each country) and use this tool to load data from that file. It loads all features from the file for the geometries you choose, so the process can take awhile. You’ll want to load the data into a temporary file instead of saving in memory, to avoid a crash.


Place Names: Comparing Two Global Gazetteers

Gazetteers are directories of place names and locations, which are useful for:

  1. Identifying variations in place names
  2. Obtaining coordinates
  3. Locating a place within a hierarchy of places
  4. Generating lists of types of features

For example, if you’re working with data that’s associated with specific cities, mountains, or bodies of water, and you have the names of these features but not the coordinates or the country or state / province where they’re located, you can use a gazetteer to obtain all three. Or, if you want to create a map of a specific type of feature (i.e. populated places, ruins, mines) or want map labels for features (forests, bodies of water) you can extract and plot the gazetteer data in GIS.

In this post I’ll provide an overview of two major global gazetteers: the GEOnet Names Server and Geonames. Each one provides several different interfaces and services for exploring and accessing data which I’ll briefly mention, but I’ll focus on on the data files that you can download and what’s contained in them. I’ll conclude with a strategy for relating a small to medium place-based data file of your own to the gazetteer to obtain coordinates. If you have a file with hundreds or a few thousand records and were planning to get coordinates by eyeballing Google Maps and clicking one by one, try this instead.


File Downloads | Documentation and code book

The US National Geospatial-Intelligence Agency (NGA) maintains a vast gazetteer with data for all of the countries in the world (almost) and provides it to the public via the GEOnet Names Server (GNS). The GNS gazetteer does NOT include features in the United States or any of its territories; the US Geological Survey maintains a separate system called the Geographic Names Information System (GNIS) whose structure and organization is different.

The GNS is updated on a weekly basis and is provided through a number of interfaces that include a map-based and a text-based search, and Web Mapping (WMS) and Web Feature (WFS) Services that allow you to display data in a GIS or a web map.

Data files are packaged on a country by country basis. Alternatively you can download one file that has the whole world in it, or an archive with separate files for each country. The data is stored in tab-delimited text files that include a header row (i.e. the column names). ZIP files for each country include a primary file that contains all the country’s features, and a series of files that contain a subset of the primary file based on feature type. So, if you wanted to work with just populated places or with hydrographic features you can work with the specific file instead of having to filter them out of the primary one.

Each record in the GNS represents a name for a feature, as opposed to a feature itself. Thus, if a feature is known by more than one name it will appear multiple times in the file. Each record has a unique feature identifier (UFI) and a unique name identifier (UNI) which are large integers. The UFI number is repeated in the data, while the UNI is unique. The GNS files contain a number of different columns containing several feature names (short names, long ones, with and without diacritics) and a name type column (NT) that indicates whether the record is for a an approved (N), or variant name (V). If you want a list of features without duplicates, you would need to create a subset of the records that only includes the approved name.

Features are classified into nine broad classes (FC), which in turn are subdivided into many different designations (DSG). The nine classes are: administrative region, populated place, vegetation, locality or area, undersea, roads and railroads, hypsographic (terrain), hydrographic (water), and spot (point-based features). Additional columns include codes designating the size of a populated place (PC) and relative importance of the feature (DISPLAY) which is useful when mapping data at varying scales. The GNS does not contain information on actual population or elevation (this was included in the past but is no longer available).

The GNS includes a few geographic references that indicate where the feature is located. There is a global region code (RC) in the first column, a primary country code (cc1) and an administrative division (state or province) code for the primary country, and a secondary country code (cc2). Geographic features like rivers, seas, mountains, and forests may span the boundary of more than one country, so the cc1 and cc2 columns indicate this. Data in these fields may be stored as a comma-separated list or array with the different codes. The GNS uses two-letter FIPS 10-4 country codes created by the US government.

Country codes in the GNS

This SQL query illustrates how country and admin1 codes are stored in the GNS, and how some features (streams in this case) span several countries.

Lastly, longitude and latitude coordinates are provided in separate fields in two formats: decimal degrees (needed for plotting and mapping) and degrees-minutes-seconds. The coordinates are in the WGS 84 CRS (EPSG 4326).


File Downloads | Documentation and code book

Geonames is the Wikipedia or OpenStreetMap of gazetteers. It’s a collaborative, crowd-sourced project. Many users may contribute a few locations or make a correction or two, but by and large most of the data comes from public or government sources that is loaded into Geonames en masse and subsequently modified. Geonames provides a text and map-based search, and an API that let’s scripters and programmers directly access the data.

Data files are packaged country by country, or globally by certain types (i.e. all countries or the largest cities). The data is stored in tab-delimited text files without a header row, so you need to consult the documentation to identify the columns. All data for each country is packaged in a single file.

Unlike the GNS, each Geonames record represents a specific feature. There is a conventional name (name) and a variant that uses plain ascii characters (asciiname). Some variant names are included in a single list / array column called alternatenames; to get a full list of variants and spellings in different languages you would download a separate alternate names file that you could link to this one. Each feature is assigned a geonameid, which is simply a large unique integer.

Features are divided into the same nine classes that are used in the GNS, and the subdivisions are the same as well. Documentation for the classes and subdivisions is provided. Population and elevation data is provided when available and relevant, but there’s no information on timeliness or source in the data file (but you can view the full edit history for a record in the online interface).

Geonames goes to great lengths to provide the geographic framework or hierarchy for each feature, so you can get instant geographic context. They use two-letter ISO country codes to designate countries (country_code), a list of alternate or secondary countries (cc2), and for the primary country up to four different levels of administrative divisions (i.e. state / province, county, municipality, etc). There’s also a field that indicates what timezone each feature is in.

There is one set of longitude and latitude coordinates in decimal degrees in the WGS84 CRS.

Geonames Belize City

Geonames search result for Belize City, illustrating options and available data.

Summary Comparison

To compare the different files I downloaded data for Belize, since it has a small number of records. The GNS file had 2,801 records for names, but if you look at unique features the record count was 2,180. The Geonames file for Belize has a comparable number of 2,309.


  • Free and publicly available
  • Tab-delimited text in country-based files
  • Longitude and latitude coordinates in decimal degrees in WGS84
  • Same feature classification system with nine classes and multiple sub-classes


  • A single, official government source
  • A file of feature names: must filter out variants to get unique feature records
  • File comes with column header
  • Files are divided into sub-files for feature classes
  • Uses FIPS codes for countries
  • Useful fields for ranking features for mapping
  • Limited data on geographic hierarchy
  • No data on population or elevation
  • Lacks data for the United States and territories (obtainable via the USGS GNIS)


  • Collaborative project with data from many sources
  • A file of features, variant names included in separate column
  • Additional alternate names and spellings in most languages available in separate files
  • File lacks column header
  • Uses ISO codes for countries
  • Extensive information on geographic hierarchy
  • Has population, elevation, and timezone for certain features
  • No ranking columns for map display

Gazetteer Caveats

1. It’s important to recognize that each source uses different codes for classifying countries: the GNS uses FIPS and Geonames uses ISO. While they appear similar (two-letter abbreviations) they are NOT the same: The FIPS code for Belize is BH and the ISO Code if BZ; in the ISO system BH is for Bahrain while the FIPS system doesn’t use BZ as a code. The CIA World Factbook includes a table comparing different country code systems. The GNS will convert to ISO at some uncertain date in the future.

2. Gazetteer data must be imported using UTF-8 encoding to preserve all the characters from the various alphabets.

3. Each feature in a gazetteer will have longitude and latitude coordinates that represent the geographic center of a feature. That means that a large areal feature like a country, a linear feature like a road, and a small point feature like a monument will have one coordinate pair. The coordinates for the monument will be pretty precise, while the set for the road and country are broad generalizations. Long linear features like roads and rivers may appear in the datasets several times as distinct feature records at different points. While it’s possible to get bounding box coordinates from Geonames, this data is not included in the downloadable country files.

4. A place name may appear multiple times in a gazetteer because names are not unique. Several different places of the same type may have the same name, and several features of different types may have the same name. For example, the Geonames file for Belize has four places name Santa Elena; two are populated places in different parts of the country while the other two are spot features (a camp and an estate) that are located near each of the populated places. The GNS file has even more records for this place, some with the approved name Santa Elena and others with the variant Saint Helena.

GNS Names and Variants

GNS records for Santa Elena, Belize. Notice the UFI is duplicated for features that have multiple names while the UNI is unique. The NT field indicates approved names (N) versus other types like variants (V). Records are for a mix of populated places (P, PPL) and spot (S) types of various kinds (ancient site, campground, and estate).

For all these reasons, it rarely makes sense to use the files in their entirety for obtaining names and coordinates or plotting places. You’ll want to extract data just for the types of features that you need. If you’re trying to match a list of place names to the gazetteer you’ll need to insure that you’re matching the right name to the right place. You can use the feature classes and the administrative divisions of the country to narrow down the location, and when in doubt use the gazetteer map interfaces to locate a specific place.

Matching Your Own Data to a Gazetteer

Winnow down the gazetteer file to just the features you need. Make sure that all the place names in your own data file are standardized so you don’t have variant spellings for the same place. In your data add a column for a unique identifier at the beginning of the sheet. Locate each place in your file in the gazetteer, then copy the unique ID from that file into your sheet. Then, if you’re using a spreadsheet you can use the VLOOKUP formula to use the ID from your sheet to pull related data from the gazetteer sheet (the longitude and latitude coordinates, codes for the administrative divisions, etc). This saves you a lot of copying and pasting. Similarly, if you were using a relational database you can write a JOIN statement to tie the two tables together using the ID.

This approach saves you the time of manually clicking on Google Maps or OSM to look up coordinates for a place and transcribing them, and you get the added benefit of grabbing any extra useful information the gazetteer provides. If you haven’t started the process of gathering your own data, start with the gazetteer file: winnow it down and append your own data to it as your research progresses.

But what if you had tons of coordinates that you need to retrieve? Because of the ambiguity in place names using a VLOOKUP or JOIN based on the name will be imprecise, because there may be more than one place with the same name and you’ll have no way of knowing if you selected the right one. You could modify your own data and the data in the gazetteer by concatenating administrative codes to the place name (i.e. St. Elena, 02) to make the name more precise and increase the chances of an accurate join. This approach requires you to be familiar with the administrative subdivisions in the areas you’re researching.

If you were trying to identify coordinates for tens of thousands of towns, cities, and larger administrative divisions you could try using a geocoder instead of a gazetteer. Geocoders are designed primarily for obtaining coordinates for addresses, but if an exact match can’t be found many will return coordinates for the smallest possible area that’s part of the address. If you provided a list of cities that also include a state / province and country, you could obtain the coordinates for just the city.

A final alternative where you can get a wider range of features in a geospatial format in bulk is the OpenStreetMap. I’ll return to this in a future post, but there’s an excellent OSM – QGIS tutorial that can help get you started.

Interested in learning more? If you’re in the spatial sciences or digital humanities check out this book: Placing Names: Enriching and Integrating Gazetteers.

LISA map of Broad Band Subscription by Household

Mapping US Census Data on Internet Access

ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

Figure 1. Percentage of Households With Subscription to Any Broadband Service: 2013-2017[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above $50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below $50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA map of Broad Band Subscription by Household

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.

Final PostGIS Result

Measuring Polygon Overlap in QGIS and PostGIS

I was helping someone with a project this semester where we wanted to calculate overlap between two different polygon layers (postal code areas and grid cells) for over forty countries throughout the world. The process involved calculating the area of overlap and percentage of total overlap between each postal area and grid cell. We began our experiment in QGIS and perfected the process, but ultimately failed because the software was not able to handle the large number of polygons: almost 2 million postal codes and over 60k grid cells. Ultimately we employed PostGIS, which was more efficient and able to do the job.

In this post I’ll outline the steps for calculating area and polygon overlap in both QGIS (as an example of desktop GIS software) and PostGIS (as an example of a spatial database); I’ll assume you have some familiarity with both. For this example I’ll use two layers from the Census Bureau’s TIGER Line Shapefiles: Congressional Districts (CDs) and ZIP Code Tabulation Areas (ZCTAs). We’ll calculate how ZCTAs overlap with CD boundaries.

Before we begin, I should say that overlap is a technical term for a specific type of spatial selection. Overlapping features must share some interior space, and the geometry of one feature is not entirely enclosed within the geometry of another. I am NOT using the term overlap in this technical sense here – I’m using it more generally to refer to features that share any interior space with another, including areas that are entirely enclosed with another (i.e. 100% overlap).


Since we’re measuring areas, the first step is to reproject our layers to a projected coordinate system that preserves area (an equal area projection). If we were working in a local area we could use a UTM or (in the US) a State Plane Zone. For continents and large countries like the US we could use Albers Equal Area Conic. If we were working globally we could use Mollweide or a Cylindrical Equal Area projection. The US Census layers are in the geographic coordinate system NAD 83. To reproject them, we select each one in the layers panel, right click, and choose save as. Browse and save them as new files, hit the CRS button, search for North America Albers Equal Area (AEA), select it, and save the new layers in that system. In the map window we select one of the new layers, right click, and choose Set Project CRS from Layer to apply the new system to the map window.

Map of CDs and ZCTAs in NAD 83

Congressional Districts (red) and ZCTAs (orange) in NAD 83

Map of CDs and ZCTAs in AEA

Congressional Districts (red) and ZCTAs (orange) in North America Albers Equal Area Conic

Next, we need to create a new field where we calculate the area for the ZCTAs. The census layers already come with pre-calculated area attributes, but we’ll need to calculate our own. Open the attribute table for the ZCTAs and hit the field calculator button (looks like an abacus). In the menu we create a new field called areatotal and populate it with the expression:

$area * 0.00000038610

$area is a geometry function that calculates the area of each polygon. Since the AEA projection uses square meters as its unit, the area will be in square meters. Multiplying by this fraction gives us square miles (or if you prefer, divide by 1000000 to get square kilometers). It’s important that we set the field type to a real / decimal number and specify a meaningful length (total number of digits) and precision (number of digits right of the decimal place). A length of 20 and a precision of 5 gives us 15 places to the left of the decimal point and 5 to the right, which should be plenty. Hit Calculate, exit out of the edit mode, and save changes.

QGIS Field Calculator

Calculating area in the QGIS Field Calculator

Before calculating the overlap it’s a good idea to check the geometry of each layer to make sure all of the polygons are valid (i.e. properly constructed), otherwise we will run into errors. Use Vector – Geometry Tools – Check Validity to check geometry, and if anything is broken open the Processing box and search for the Fix Geometry Tool. In this example both layers have valid geometry.

Use Vector – Geoprocessing – Union to meld the ZCTA and CD layers together. This will create unique polygons that consist of geometry occupied by a unique ZCTA and CD combination. So in instances where there is overlap between layers the polygon will be split into two (or more) pieces. See the image below, which illustrates CDs and ZCTAs before and after unioning in the Philadelphia area.

Congressional Disticts and ZCTAs in Philly

CDs and ZCTAs in Philly

ZCTAs in Philly after union with Congressional Districts

Split ZCTAs after union with Congressional Districts

Processing time will vary based on the number of features, their level of detail (nodes per polygon), the number of overlaps, and the number of attributes (columns) per layer. There are 444 CD features and about 33k ZCTAs. While these numbers aren’t huge, the polygons are very detailed and there is a fair amount of overlap: it took me approx 1.5 hours to run. To minimize processing time you could create copies of these layers, modify them by deleting attribute columns, and run the process on this modified layer. You should strip everything out except some unique identifiers and the totalarea field; you can always join the results back to the larger body of attributes later if you need them.

Once the process is complete, open the attribute table for the unioned layer and create a new calculated field called piecearea, where you calculate the area for these smaller pieces. At this stage you have what you need to calculate overlap: for these pieces you have columns with the total area of the original ZCTA and the area of this ZCTA piece that overlaps with a particular CD. You can add an additional calculated field called pct_in (length 5 precision 2) where you divide one by the other to get a percentage:

( “piecearea” / “totalarea” ) * 100

If a ZCTA record appears once in the table that means it’s fully inside one CD, and it should have a percentage of 100%. Otherwise it will appear multiple times, which means there is overlap and this will be reflected in the percentages. The output below is for ZCTAs 19138 through 19141 in Philadelphia, PA. Compare this to the maps above (these ZCTAs are located towards the center of the map). 19138 and 19139 are wholly within one CD, while 19140 and 19141 are split across two CDs. Unfortunately, QGIS doesn’t provide a simple way for hiding columns, so I can’t clearly represent the result in the image below – you’ll see a clearer picture from the PostGIS process. But you’ll end up with the attributes from both layers, so you can see what CD each ZCTA falls in.

QGIS Attribute Table

Attribute table with areas and percentages


The QGIS method is fine if you don’t have many polygons to calculate, but if you have a large number of features the process will either take a long time, or will crash (incidentally ArcGIS would be no different).

PostGIS to the rescue. For this approach, first you create a spatial database and activate the PostGIS extension with the command CREATE EXTENSION postgis. Then you can load the shapefiles into PostGIS using the shapefile loader that is bundled with PostGIS, or you could use the QGIS DB Manager to load them. During the import process you need to specify that the layers are in NAD 83 by specifying the correct EPSG code, changing the SRID from 0 to 4269.

PostGIS doesn’t have many global or continental projected coordinate system definitions, so we’ll have to add one for North America Albers Equal Area to its spatial reference table. A quick visit to Spatial Reference and a search for this system yields the definition, and we can get a PostGIS Insert statement that we can copy and paste into a SQL query window in our database. Before executing it, we have to change the SRID number in the statement from 9102008 to 102008 to avoid violating a check restraint that prevents IDs from being larger than 6 digits.

With the definition in place, we create a series of blank tables that will hold our two layers, and then run an insert statement where we take columns we want from the original tables and bring them into the new tables. In the course of doing this, we also transform the geometry from NAD 83 to Albers. At the end it’s important to create a spatial index on the geometry, as it will really speed up spatial selections.


CREATE TABLE zctas_aea (
zcta5 varchar(5) PRIMARY KEY,
geom geometry (Multipolygon, 102008)

INSERT INTO zctas_aea (zcta5, geom)
SELECT zcta5ce10, ST_Transform(geom, 102008)
FROM tl_2018_us_zcta510;

CREATE INDEX zctas_aea_geom_gist
ON zctas_aea
USING gist (geom);

CREATE TABLE cds_aea (
geoid varchar(4) PRIMARY KEY,
statefp varchar(2),
name text,
session varchar(3),
geom geometry (Multipolygon, 102008)

INSERT INTO cds_aea (geoid, statefp, name, session, geom)
SELECT geoid, statefp, namelsad, cdsessn, ST_Transform(geom, 102008)
FROM tl_2018_us_cd116;

CREATE INDEX cds_aea_geom_gist
ON cds_aea
USING gist (geom);


Once the data is inserted we can check the geometry validity with ST_IsValid, and if there is bad geometry we can fix it with another statement using ST_MakeValid, where IN contains identifiers for bad geometry discovered in the previous statement.

SELECT geoid, ST_IsValid(geom) AS notvalid,
ST_IsValidReason(geom) AS reason
FROM cds_aea
WHERE NOT ST_IsValid(geom);
UPDATE cds_aea
SET geom=ST_MakeValid(geom)

We can execute the overlap operation with a single statement. PostGIS allows you to calculate area on the fly with the ST_Area function, and there are two functions for overlap: ST_Intersects acts as a spatial join that relates one layer to the other by selecting all features that Intersect, while ST_Intersection selects the actual pieces of each feature’s geometry that intersect. This example is just for Pennsylvania, which we select using the state FIPS code ’42’ from the CD layer.  It’s a good idea to get the statement right on a sample of records before executing it on the entire set. The double colons are a PostgreSQL shortcut for casting data types from one type to the other. This is necessary when using the ROUND function to produce a non-integer result (as ROUND can’t be used to round real decimal numbers produced from the AREA function to a fixed number of decimal places).

SELECT z.zcta5 AS zcta, c.geoid AS cd, AS cdname,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) AS area_piece,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in
FROM zctas_aea z, cds_aea c
WHERE ST_Intersects(z.geom, c.geom) AND c.statefp = '42'
ORDER BY z.zcta5, c.geoid, pct_in DESC;

This statement took me about 20 seconds to run. The results (see below) include several records that QGIS didn’t return, where the area and overlap is 0, either due to an infinitely small area of overlap that rounds to zero or strict interpretation of intersect (which includes areas that overlap and touch). While there is an ST_Overlap function, it will not return geometries where one geometry is completely contained within another (so we can’t use that). For example, ZCTAs 19138 and 19139 appear within one district but there are two records for them, one with a 100% value and another with a 0% value.

Query results in PostgreSQL

Result of intersect operations and area calculations in pgAdmin / PostGIS

We can toss these records by either deleting them from the final result when the process is finished, or we can add another statement to our WHERE clause to filter them out:

AND ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) > 0

This lengthened the execution time to 30 seconds and dropped the number of records from 2,523 to 2,061.

Once the statement looks good, we can drop the AND filter for Pennsylvania and generate a result for the entire country. Using pgAdmin 4 we can write the result directly out as a CSV. Or, you can preface the statement with CREATE VIEW overlap AS to save the statement as a query which you can call up any time. Or, you can preface the statement with CREATE TABLE overlap AS and the result of the query will be saved in a new table. This takes longer than the other two options, but gives you the ability to query and modify the resulting table. Exporting the table out as a CSV can be accomplished quickly, giving you the best of options 1 and 3. The final code and result is shown below.

CREATE TABLE zcta_cd_overlap AS
SELECT z.zcta5 AS zcta, c.geoid AS cdistrict, AS cdname,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) AS area_piece,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in
FROM zctas_aea z, cds_aea c
WHERE ST_Intersects(z.geom, c.geom) AND
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) > 0
ORDER BY z.zcta5, c.geoid, pct_in DESC;
Final PostGIS Result

Final Result in PostGIS / pgAdmin

Conclusion – which is best?

I’m using a 64-bit Lenovo Thinkpad laptop that has 4 Intel processors at 2.3Ghz and 8 gigs of memory. I’m running Xubuntu 18.04 and am using QGIS 3.4 Madeira, PostgreSQL 10, PostGIS 2.4, and pgAdmin 4. With 444 CDs and 33k ZCTAs it took me over 1.5 hours to run the union operation in QGIS, and that’s without altering the attribute tables to delete unnecessary columns. Executing the PostGIS statement, simply writing the output to the screen with the caveat to exclude areas with 0, took only 12 minutes. Writing the result to a new table took 22 minutes.

For the larger project that I mentioned at the beginning of this post, neither QGIS nor ArcGIS was able to complete the union process between 2 million polygons and 60k grid areas without crashing, even when we reduced the number of attribute columns to a bare minimum. It took PostGIS about 50 minutes to execute the overlap query and print the output to the screen or directly to a CSV, and about 3 hours to write the results to a new table.

I think the PostGIS approach is more straightforward and gives you more control over the process. There’s no need calculate area in advance or to delete attribute columns, as you can simply choose to include or exclude the ones you want. Finding and fixing invalid geometry in PostGIS is simpler, and the process is faster to execute. Rest assured you can handle layers with large numbers of features. I’ve wondered if the problems with QGIS and ArcGIS might be mitigated by using something other than a shapefile, like the newer geopackage format which is built on SQLite. I have no idea but it would be worth trying if you really wanted or needed to go the desktop GIS route for large files.

Sedona Hike

XYZ Tiles and WMS Layers in QGIS 3

I did a lot of hiking around Sedona, Arizona a few weeks ago, and wanted to map my GPS way points and tracks in QGIS over some WMS (web mapping service) base map layers. I recently switched to QGIS 3 since I need to use that in my book (by the time it comes out 2.18 will be old news), and had to spend time starting from scratch since the plugin I always used was no longer available (ahhh the pitfalls of relying on 3rd party plugins – see my last post on SQLite). I thought I’d share what I learned here.

I was using the OpenLayers plugin in QGIS 2.x as an easy resource to add base maps to my projects. You could pull in layers from OSM, Google, Bing, and others. It turns out that plugin is no longer available for QGIS 3.x. So I searched around and found some suggestions for a different plugin called QuickMapServices which was a great replacement. But alas, that worked in QGIS 3.0 but is not compatible (as of now) for QGIS 3.2.

So I’m back to adding WMS layers manually. There is a new feature in QGIS for adding XYZ Tiles; this is a little better than WMS because the base map can be rendered a bit quicker. I found a tip in the Stack Exchange that you can add an OSM tiles layer with this url:{z}/{x}/{y}.png 

Select XYZ Tiles in the Browser, right click, New connection, give it a name, add the URL. You can modify the X Y Z coordinates where the map centers and zooms by default. Once you’ve created the connection, you can simply drag the OSM layer into the map window to render it.

Adding the OSM XYZ Tiles in QGIS

One problem that always creeps up: when you add other layers and adjust the zoom, sometimes the rendering of the base map looks poor, i.e. the features and labels look blurry or blocky. When you’re pulling data from a web map layer, as you zoom in it swaps out the tiles for more detailed ones appropriate for that scale. But when you’re zooming in QGIS things can get out of synch, as your map window zoom may not be enough to trigger the switch in the map tiles, or those map tiles are just not meant to be rendered at that scale. If you right click on a blank area of the toolbar, you can activate the Tile Scale panel and can use the slider to adjust the window zoom in synch with the tiles, so you can operate at the scales that are appropriate for the tiles. The way points and track for our hike alongside Schnebly Hill Road are shown below, and the labels for the points represent our elevation in feet.

OSM Tile Layer with Tile Scale Panel

If the slider is grayed out, select the OSM layer in the Layers menu, right click, and select Set CRS  – Set Project CRS From Layer. Web mapping services typically use EPSG 3857 Pseudo Mercator as the coordinate reference system / map projection by default. If your other vectors layers aren’t in that system, you can have the base map draw to their system or vice versa by selecting the layer, right clicking, and choosing Set CRS. But for the tile scale to work properly EPSG 3857 must be the project CRS.

Lastly, I’ve always liked the USGS WMS layers, which are never included in the plugins that I’ve seen. The USGS provides layers for: imagery, imagery with topographic features, shaded relief, and the USGS topographic map layer:

USGS Link for WMS Layer for Topographic Maps

You can click on one of the services, and at the top (in small print) are urls for their services in WMS and WMTS. The last one is a web mapping tile service, which is a bit faster than WMS. Click on the WMTS link, and copy the url from the address in the browser. Then in QGIS select WMS / WMTS layers, right click, add a new connection, give it a name and paste the url. This is url for the topographic map:

Once again, you can drag the layer into the map window to render it, and you can use the Tile Scale panel to adjust the zoom. Here’s our hike with the topo map as the base:

USGS Topographic Map in QGIS

The Map Reliability Calculator for Classifying ACS Data

The staff at the Population Division at NYC City Planning take the limitations of the American Community Survey (ACS) data seriously. Census estimates for tract-level data tend to be unreliable; to counter this, they aggregate tracts into larger Neighborhood Tabulation Areas (NTAs) to produce estimates that have better precision. In their Census Factfinder tool, they display but grey-out variables where the margin of error (MOE) is unacceptably large. If users want to aggregate geographies, the Factfinder does the work of re-computing the margins of error.

Now they’ve released a new tool for census mappers. The Map Reliability Calculator is an Excel spreadsheet for measuring the reliability of classification schemes for making choropleth maps. Because each ACS estimate is published with a MOE, it’s possible that certain estimates may fall outside their designated classification range.

For example, we’re 90% confident that 60.5% plus or minus 1.5% of resident workers 16 years and older in Forest Hills, Queens took public transit to work during 2011-2015. The actual value could be as low as 59% or as high as 62%. Now let’s say we have a classification scheme that has a class with a range from 60% to 80%. Forest Hills would be placed in this class since its estimate is 60.5%, but it’s possible that it could fall into the class below it given the range of the margin of error (as the value could be as low as 59%).

The tool determines how good your classification scheme is by calculating the percent of estimates that could fall outside their assigned class, based on each MOE and the break point of the class. On the left of the sheet you paste your estimates and MOEs, and then type the number of classes you want. On the right, the reliability of classifying that data is calculated for equal intervals (equal range of values in each class) and quantiles (equal number of data points in each class). You can see the reliability of each class and the overall reliability of the scheme. The scheme is classified as reliable if: no individual class has more than 20% of its values identified as possibly falling outside the class, and less than 10% of all the scheme’s values possibly fall outside their classes.

I pasted some 5-year ACS data for NYC PUMAs below (the percentage of workers 16 years and older who take public transit to work in 2011-2015) under STEP 1. In STEP 2 I entered 5 for the number of classes. In the classification schemes on the right, equal intervals is reliable; only 6.6% of the values may fall outside their class. Quantiles was not reliable; 11.9% fell outside. If I reduce the number of classes to 4, reliability improves and both schemes fall under 10%; although unreliability for one of the classes for quantiles is high at 18%, but still below the 20% threshold. Equal intervals should usually perform better than quantiles, as the latter scheme can make rather arbitrary breaks that result in small differences in value ranges between classes (in order to insure that each class has the same number of data points).

Map reliability calculator with 5 classes

Map reliability calculator with 4 classes

You can also enter custom-defined schemes. For example let’s say you use natural breaks (classes determined by gaps in value ranges). There’s a 2-step process here; first you classify the data in GIS and determine what the breaks are, and then you enter them in the spreadsheet. If you’re using QGIS there’s a snag in doing this; QGIS doesn’t show you the “true” breaks of your data based on the actual values, and when you classify data it displays clean breaks that overlap. For example, natural breaks of this data with 5 classes appears like this:

24.4 – 29.0
29.0 – 45.9
45.9 – 55.8
55.8 – 65.1
65.1 – 73.3

So, does the value for 29.0 fall in the first class or the second? The answer is, the first (test it by selecting that record in the attribute table and see where it is on the map, and what color it is). So you need to adjust the values appropriately, paying attention to the precision and scale of your numbers. In this case I bump the first value of each class up by .1, except for the bottom class which you leave alone:

24.4 – 29.0
29.1 – 45.9
46.0 – 55.8
55.9 – 65.1
65.2 – 73.3

In the calculator you have to enter the top class value first, and just the first value in the range:


Map reliability calculator with user defined classes

In this case only 7.1% of the total values may fall outside their class so things look good – but my bottom class barely makes the minimum class threshold at 19.4%. I can try dropping the classes down to 4 or I can manually adjust this class to see if I can improve reliability.

If you’re unsure if you made the right adjustments to the classes in translating them from QGIS to the calculator, in QGIS turn on the Show Feature Count option for the layer to see how many data points are in each class, and compare that to the class counts in the calculator. If they don’t match, you need to re-adjust.

QGIS natural breaks and feature count

This is a great tool for census mappers who want or need to account for issues with ACS reliability. It’s an Excel spreadsheet but I used it in LibreOffice Calc with no problem. In addition to the calculator sheet there’s a second sheet with instructions and background info. Download the Map Reliability Calculator here. You can try it out with this test data,  workers who commute with mass transit, 2011-2015 ACS for NYC PUMAs.