OSM Merida

Extracting OpenStreetMap Data in QGIS 3

The OpenStreetMap (OSM) can be a good source of geospatial data for all sorts of features, particularly for countries where the government doesn’t provide publicly accessible GIS data, and for features that most governments don’t publish data for. In this post I’ll demonstrate how to download a specific feature set for a relatively small area using QGIS 3.x. Instead of simply adding OSM as a web service base map we’ll extract features from OSM to create vector layers.

In the past I followed some straightforward instructions for doing this in QGIS 2.x, but of course with the movement to 3.x the core OSM plugin I previously used is no longer included, and no updated version was released. It’s a miracle that anyone can figure out what’s going on between one version of QGIS and the next. Fortunately, there’s another plugin called QuickOSM that’s quite good, and works fine with 3.x.

Use QuickOSM to Extract Features

Let’s say that we want to create a layer of churches for the city Merida in Mexico. First we launch QGIS, go to the Plugins menu, and choose Manage and Install plugins. Select plugins that are not installed, do a search for QuickOSM, select it, and install it. This adds a couple buttons to the plugins toolbar and a new sub-menu under the Vector menu called Quick OSM.

Next, we add a layer to serve as a frame of reference. We’re going to use the extent of the QGIS window to grab OSM features that fall within that area. We could download some vector files from GADM or Natural Earth; GADM provides several layers of administrative divisions which can be useful for locating and delineating our area. Or we can add a web service like OSM and simply zoom in to our area of interest. Adjust the zoom so that the entire city of Merida fits within the window.

Merida in QGIS

OSM XYZ Tiles in QGIS – Zoomed into Merida

Now we can launch the Quick OSM tool. The default tab is Quick query, which allows us to select features directly from an OSM server (you need to be connected to the internet to do this). OSM data is stored in an XML format, so to extract the data we want we’ll need to specify the correct elements and tags. Ample documentation for all the map features is available. In our example, churches are referred to as places of worship and are classified as an amenity. So we choose amenity as the key and place_of_worship as the value. The drop down box allows us to search for features in or around a place, but as discussed in my previous post place names can be ambiguous. Choose the option for canvas extent, and that will capture any churches in our map window. Hit the advanced drop down arrow, and you have the option to select specific types of geometry (keep them all). Hit the run query button to execute.

Quick OSM Interface

Quick OSM Interface

We’ll see there are two results: one for places of worship that are points, and another for polygons. If you right click on one of these layers and open the attribute table, you’ll see a number of tags that have been extracted and saved as columns, such as the name, religion, and denomination. The Quick query tools pulls a series of pre-selected attributes that are appropriate for the type of feature.

Places of Worship

The data is saved temporarily in memory, so to keep it you need to save each as a shapefile or geopackage (right click, Export, Save Features As). But before we do that – why do have two separate layers to begin with? In some cases the OSM has the full shape of the building saved as a polygon, while in other cases the church is saved as a point feature, with a cross or other religious symbol appropriate for the type of worship space. It simply depends on the level of detail that was available when the feature was added.

Polygon versus Point

Church as polygon (lower left-hand corner) and as point (upper right-hand corner)

If we needed a single unified layer we would need to merge the two, but this process can be a pain. Using the vector menu you can convert the polygons to points using the centroid tool, and then use the merge tool to combine the two point layers. This is problematic as the number of fields in each file is different, and because the centroid tool changes the data type of the polygon’s id number to a type that doesn’t match the points. I think the easiest solution is to load both layers into a Spatialite database and create a unified layer in the DB.

Use SpatiaLite to Create a Single Point Layer

To do that, right click on the SpatiaLite option in the Browser Panel, choose Create Database, and name it (merida_churches). Then select the church point file, right click, export, save features as. Choose SpatiaLite as the format, for the file select the database we just created, and for layer name call it church_points. The default CRS (used by OSM) is WGS 84. Hit OK. Then repeat the steps for the polygons, creating a layer called church_polygons in that same database.

Once the features are database layers, we can write a SQL script (see below) where you create one table that has columns that you want to capture from both tables. You load the data from each of the tables into the unified one, and as you are loading the polygons you convert their geometry to points. The brackets around the names like [addr:full] allows you to overcome the illegal character designation in the original files (you shouldn’t use colons in db column names). I like to manually insert a date so to remember when I downloaded the feature set.

BEGIN;

CREATE TABLE all_churches (
full_id TEXT NOT NULL PRIMARY KEY,
osm_id INTEGER NOT NULL,
osm_type TEXT,
name TEXT,
religion TEXT,
denomination TEXT,
addr_housenumber TEXT,
addr_street TEXT,
addr_city TEXT,
addr_full TEXT,
download_date TEXT);

SELECT AddGeometryColumn('all_churches','geom',4326,'POINT','XY');

INSERT INTO all_churches
SELECT full_id, osm_id, osm_type, name, religion, denomination,
[addr:housenumber], [addr:street], [addr:city], [addr:full],
'02/11/2019', ST_CENTROID(geometry)
FROM church_polygons;

INSERT INTO all_churches
SELECT full_id, osm_id, osm_type, name, religion, denomination,
[addr:housenumber], [addr:street], [addr:city], [addr:full],
'02/11/2019', geometry
FROM church_points;

SELECT CreateSpatialIndex('all_churches', 'geom');

COMMIT;

Unfortunately the QGIS DB Browser does not allow you to run SQL transactions / scripts. You can paste the entire script into the window, highlight the first statement (CREATE TABLE), execute it, then highlight the next one (SELECT AddGeometryColumn), execute it, etc. Alternatively if you use the Spatialite CLI or GUI, you can save your script in a file, load it, and execute it in one go.

QGIS DB Browser

When finished we hit the refresh button and can see the new all_churches layer in the DB. We can preview the table and geometry and add it to the QGIS map window. If you prefer to work with a shapefile or geopackage you can always export it out of the db.

Other Options

The QuickOSM tool has a few other handy features. Under the Quick query tool is a plain old Query tool, which shows you the actual query being passed to the server. If you’re familiar with the map features and XML structure of OSM you can modify this query directly. Under the Query tool is the OSM File tool. Instead of grabbing features from the server, you can download an OSM pbf file (Geofabrik provides data for each country) and use this tool to load data from that file. It loads all features from the file for the geometries you choose, so the process can take awhile. You’ll want to load the data into a temporary file instead of saving in memory, to avoid a crash.

iceland_placename

Place Names: Comparing Two Global Gazetteers

Gazetteers are directories of place names and locations, which are useful for:

  1. Identifying variations in place names
  2. Obtaining coordinates
  3. Locating a place within a hierarchy of places
  4. Generating lists of types of features

For example, if you’re working with data that’s associated with specific cities, mountains, or bodies of water, and you have the names of these features but not the coordinates or the country or state / province where they’re located, you can use a gazetteer to obtain all three. Or, if you want to create a map of a specific type of feature (i.e. populated places, ruins, mines) or want map labels for features (forests, bodies of water) you can extract and plot the gazetteer data in GIS.

In this post I’ll provide an overview of two major global gazetteers: the GEOnet Names Server and Geonames. Each one provides several different interfaces and services for exploring and accessing data which I’ll briefly mention, but I’ll focus on on the data files that you can download and what’s contained in them. I’ll conclude with a strategy for relating a small to medium place-based data file of your own to the gazetteer to obtain coordinates. If you have a file with hundreds or a few thousand records and were planning to get coordinates by eyeballing Google Maps and clicking one by one, try this instead.

NGA GNS

File Downloads | Documentation and code book

The US National Geospatial-Intelligence Agency (NGA) maintains a vast gazetteer with data for all of the countries in the world (almost) and provides it to the public via the GEOnet Names Server (GNS). The GNS gazetteer does NOT include features in the United States or any of its territories; the US Geological Survey maintains a separate system called the Geographic Names Information System (GNIS) whose structure and organization is different.

The GNS is updated on a weekly basis and is provided through a number of interfaces that include a map-based and a text-based search, and Web Mapping (WMS) and Web Feature (WFS) Services that allow you to display data in a GIS or a web map.

Data files are packaged on a country by country basis. Alternatively you can download one file that has the whole world in it, or an archive with separate files for each country. The data is stored in tab-delimited text files that include a header row (i.e. the column names). ZIP files for each country include a primary file that contains all the country’s features, and a series of files that contain a subset of the primary file based on feature type. So, if you wanted to work with just populated places or with hydrographic features you can work with the specific file instead of having to filter them out of the primary one.

Each record in the GNS represents a name for a feature, as opposed to a feature itself. Thus, if a feature is known by more than one name it will appear multiple times in the file. Each record has a unique feature identifier (UFI) and a unique name identifier (UNI) which are large integers. The UFI number is repeated in the data, while the UNI is unique. The GNS files contain a number of different columns containing several feature names (short names, long ones, with and without diacritics) and a name type column (NT) that indicates whether the record is for a an approved (N), or variant name (V). If you want a list of features without duplicates, you would need to create a subset of the records that only includes the approved name.

Features are classified into nine broad classes (FC), which in turn are subdivided into many different designations (DSG). The nine classes are: administrative region, populated place, vegetation, locality or area, undersea, roads and railroads, hypsographic (terrain), hydrographic (water), and spot (point-based features). Additional columns include codes designating the size of a populated place (PC) and relative importance of the feature (DISPLAY) which is useful when mapping data at varying scales. The GNS does not contain information on actual population or elevation (this was included in the past but is no longer available).

The GNS includes a few geographic references that indicate where the feature is located. There is a global region code (RC) in the first column, a primary country code (cc1) and an administrative division (state or province) code for the primary country, and a secondary country code (cc2). Geographic features like rivers, seas, mountains, and forests may span the boundary of more than one country, so the cc1 and cc2 columns indicate this. Data in these fields may be stored as a comma-separated list or array with the different codes. The GNS uses two-letter FIPS 10-4 country codes created by the US government.

Country codes in the GNS

This SQL query illustrates how country and admin1 codes are stored in the GNS, and how some features (streams in this case) span several countries.

Lastly, longitude and latitude coordinates are provided in separate fields in two formats: decimal degrees (needed for plotting and mapping) and degrees-minutes-seconds. The coordinates are in the WGS 84 CRS (EPSG 4326).

Geonames

File Downloads | Documentation and code book

Geonames is the Wikipedia or OpenStreetMap of gazetteers. It’s a collaborative, crowd-sourced project. Many users may contribute a few locations or make a correction or two, but by and large most of the data comes from public or government sources that is loaded into Geonames en masse and subsequently modified. Geonames provides a text and map-based search, and an API that let’s scripters and programmers directly access the data.

Data files are packaged country by country, or globally by certain types (i.e. all countries or the largest cities). The data is stored in tab-delimited text files without a header row, so you need to consult the documentation to identify the columns. All data for each country is packaged in a single file.

Unlike the GNS, each Geonames record represents a specific feature. There is a conventional name (name) and a variant that uses plain ascii characters (asciiname). Some variant names are included in a single list / array column called alternatenames; to get a full list of variants and spellings in different languages you would download a separate alternate names file that you could link to this one. Each feature is assigned a geonameid, which is simply a large unique integer.

Features are divided into the same nine classes that are used in the GNS, and the subdivisions are the same as well. Documentation for the classes and subdivisions is provided. Population and elevation data is provided when available and relevant, but there’s no information on timeliness or source in the data file (but you can view the full edit history for a record in the online interface).

Geonames goes to great lengths to provide the geographic framework or hierarchy for each feature, so you can get instant geographic context. They use two-letter ISO country codes to designate countries (country_code), a list of alternate or secondary countries (cc2), and for the primary country up to four different levels of administrative divisions (i.e. state / province, county, municipality, etc). There’s also a field that indicates what timezone each feature is in.

There is one set of longitude and latitude coordinates in decimal degrees in the WGS84 CRS.

Geonames Belize City

Geonames search result for Belize City, illustrating options and available data.

Summary Comparison

To compare the different files I downloaded data for Belize, since it has a small number of records. The GNS file had 2,801 records for names, but if you look at unique features the record count was 2,180. The Geonames file for Belize has a comparable number of 2,309.

Commonalities

  • Free and publicly available
  • Tab-delimited text in country-based files
  • Longitude and latitude coordinates in decimal degrees in WGS84
  • Same feature classification system with nine classes and multiple sub-classes

GNS

  • A single, official government source
  • A file of feature names: must filter out variants to get unique feature records
  • File comes with column header
  • Files are divided into sub-files for feature classes
  • Uses FIPS codes for countries
  • Useful fields for ranking features for mapping
  • Limited data on geographic hierarchy
  • No data on population or elevation
  • Lacks data for the United States and territories (obtainable via the USGS GNIS)

Geonames

  • Collaborative project with data from many sources
  • A file of features, variant names included in separate column
  • Additional alternate names and spellings in most languages available in separate files
  • File lacks column header
  • Uses ISO codes for countries
  • Extensive information on geographic hierarchy
  • Has population, elevation, and timezone for certain features
  • No ranking columns for map display

Gazetteer Caveats

1. It’s important to recognize that each source uses different codes for classifying countries: the GNS uses FIPS and Geonames uses ISO. While they appear similar (two-letter abbreviations) they are NOT the same: The FIPS code for Belize is BH and the ISO Code if BZ; in the ISO system BH is for Bahrain while the FIPS system doesn’t use BZ as a code. The CIA World Factbook includes a table comparing different country code systems. The GNS will convert to ISO at some uncertain date in the future.

2. Gazetteer data must be imported using UTF-8 encoding to preserve all the characters from the various alphabets.

3. Each feature in a gazetteer will have longitude and latitude coordinates that represent the geographic center of a feature. That means that a large areal feature like a country, a linear feature like a road, and a small point feature like a monument will have one coordinate pair. The coordinates for the monument will be pretty precise, while the set for the road and country are broad generalizations. Long linear features like roads and rivers may appear in the datasets several times as distinct feature records at different points. While it’s possible to get bounding box coordinates from Geonames, this data is not included in the downloadable country files.

4. A place name may appear multiple times in a gazetteer because names are not unique. Several different places of the same type may have the same name, and several features of different types may have the same name. For example, the Geonames file for Belize has four places name Santa Elena; two are populated places in different parts of the country while the other two are spot features (a camp and an estate) that are located near each of the populated places. The GNS file has even more records for this place, some with the approved name Santa Elena and others with the variant Saint Helena.

GNS Names and Variants

GNS records for Santa Elena, Belize. Notice the UFI is duplicated for features that have multiple names while the UNI is unique. The NT field indicates approved names (N) versus other types like variants (V). Records are for a mix of populated places (P, PPL) and spot (S) types of various kinds (ancient site, campground, and estate).

For all these reasons, it rarely makes sense to use the files in their entirety for obtaining names and coordinates or plotting places. You’ll want to extract data just for the types of features that you need. If you’re trying to match a list of place names to the gazetteer you’ll need to insure that you’re matching the right name to the right place. You can use the feature classes and the administrative divisions of the country to narrow down the location, and when in doubt use the gazetteer map interfaces to locate a specific place.

Matching Your Own Data to a Gazetteer

Winnow down the gazetteer file to just the features you need. Make sure that all the place names in your own data file are standardized so you don’t have variant spellings for the same place. In your data add a column for a unique identifier at the beginning of the sheet. Locate each place in your file in the gazetteer, then copy the unique ID from that file into your sheet. Then, if you’re using a spreadsheet you can use the VLOOKUP formula to use the ID from your sheet to pull related data from the gazetteer sheet (the longitude and latitude coordinates, codes for the administrative divisions, etc). This saves you a lot of copying and pasting. Similarly, if you were using a relational database you can write a JOIN statement to tie the two tables together using the ID.

This approach saves you the time of manually clicking on Google Maps or OSM to look up coordinates for a place and transcribing them, and you get the added benefit of grabbing any extra useful information the gazetteer provides. If you haven’t started the process of gathering your own data, start with the gazetteer file: winnow it down and append your own data to it as your research progresses.

But what if you had tons of coordinates that you need to retrieve? Because of the ambiguity in place names using a VLOOKUP or JOIN based on the name will be imprecise, because there may be more than one place with the same name and you’ll have no way of knowing if you selected the right one. You could modify your own data and the data in the gazetteer by concatenating administrative codes to the place name (i.e. St. Elena, 02) to make the name more precise and increase the chances of an accurate join. This approach requires you to be familiar with the administrative subdivisions in the areas you’re researching.

If you were trying to identify coordinates for tens of thousands of towns, cities, and larger administrative divisions you could try using a geocoder instead of a gazetteer. Geocoders are designed primarily for obtaining coordinates for addresses, but if an exact match can’t be found many will return coordinates for the smallest possible area that’s part of the address. If you provided a list of cities that also include a state / province and country, you could obtain the coordinates for just the city.

A final alternative where you can get a wider range of features in a geospatial format in bulk is the OpenStreetMap. I’ll return to this in a future post, but there’s an excellent OSM – QGIS tutorial that can help get you started.

Interested in learning more? If you’re in the spatial sciences or digital humanities check out this book: Placing Names: Enriching and Integrating Gazetteers.

LISA map of Broad Band Subscription by Household

Mapping US Census Data on Internet Access

ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

Figure 1. Percentage of Households With Subscription to Any Broadband Service: 2013-2017[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above $50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below $50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA map of Broad Band Subscription by Household

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.

Final PostGIS Result

Measuring Polygon Overlap in QGIS and PostGIS

I was helping someone with a project this semester where we wanted to calculate overlap between two different polygon layers (postal code areas and grid cells) for over forty countries throughout the world. The process involved calculating the area of overlap and percentage of total overlap between each postal area and grid cell. We began our experiment in QGIS and perfected the process, but ultimately failed because the software was not able to handle the large number of polygons: almost 2 million postal codes and over 60k grid cells. Ultimately we employed PostGIS, which was more efficient and able to do the job.

In this post I’ll outline the steps for calculating area and polygon overlap in both QGIS (as an example of desktop GIS software) and PostGIS (as an example of a spatial database); I’ll assume you have some familiarity with both. For this example I’ll use two layers from the Census Bureau’s TIGER Line Shapefiles: Congressional Districts (CDs) and ZIP Code Tabulation Areas (ZCTAs). We’ll calculate how ZCTAs overlap with CD boundaries.

Before we begin, I should say that overlap is a technical term for a specific type of spatial selection. Overlapping features must share some interior space, and the geometry of one feature is not entirely enclosed within the geometry of another. I am NOT using the term overlap in this technical sense here – I’m using it more generally to refer to features that share any interior space with another, including areas that are entirely enclosed with another (i.e. 100% overlap).

QGIS

Since we’re measuring areas, the first step is to reproject our layers to a projected coordinate system that preserves area (an equal area projection). If we were working in a local area we could use a UTM or (in the US) a State Plane Zone. For continents and large countries like the US we could use Albers Equal Area Conic. If we were working globally we could use Mollweide or a Cylindrical Equal Area projection. The US Census layers are in the geographic coordinate system NAD 83. To reproject them, we select each one in the layers panel, right click, and choose save as. Browse and save them as new files, hit the CRS button, search for North America Albers Equal Area (AEA), select it, and save the new layers in that system. In the map window we select one of the new layers, right click, and choose Set Project CRS from Layer to apply the new system to the map window.

Map of CDs and ZCTAs in NAD 83

Congressional Districts (red) and ZCTAs (orange) in NAD 83

Map of CDs and ZCTAs in AEA

Congressional Districts (red) and ZCTAs (orange) in North America Albers Equal Area Conic

Next, we need to create a new field where we calculate the area for the ZCTAs. The census layers already come with pre-calculated area attributes, but we’ll need to calculate our own. Open the attribute table for the ZCTAs and hit the field calculator button (looks like an abacus). In the menu we create a new field called areatotal and populate it with the expression:

$area * 0.00000038610

$area is a geometry function that calculates the area of each polygon. Since the AEA projection uses square meters as its unit, the area will be in square meters. Multiplying by this fraction gives us square miles (or if you prefer, divide by 1000000 to get square kilometers). It’s important that we set the field type to a real / decimal number and specify a meaningful length (total number of digits) and precision (number of digits right of the decimal place). A length of 20 and a precision of 5 gives us 15 places to the left of the decimal point and 5 to the right, which should be plenty. Hit Calculate, exit out of the edit mode, and save changes.

QGIS Field Calculator

Calculating area in the QGIS Field Calculator

Before calculating the overlap it’s a good idea to check the geometry of each layer to make sure all of the polygons are valid (i.e. properly constructed), otherwise we will run into errors. Use Vector – Geometry Tools – Check Validity to check geometry, and if anything is broken open the Processing box and search for the Fix Geometry Tool. In this example both layers have valid geometry.

Use Vector – Geoprocessing – Union to meld the ZCTA and CD layers together. This will create unique polygons that consist of geometry occupied by a unique ZCTA and CD combination. So in instances where there is overlap between layers the polygon will be split into two (or more) pieces. See the image below, which illustrates CDs and ZCTAs before and after unioning in the Philadelphia area.

Congressional Disticts and ZCTAs in Philly

CDs and ZCTAs in Philly

ZCTAs in Philly after union with Congressional Districts

Split ZCTAs after union with Congressional Districts

Processing time will vary based on the number of features, their level of detail (nodes per polygon), the number of overlaps, and the number of attributes (columns) per layer. There are 444 CD features and about 33k ZCTAs. While these numbers aren’t huge, the polygons are very detailed and there is a fair amount of overlap: it took me approx 1.5 hours to run. To minimize processing time you could create copies of these layers, modify them by deleting attribute columns, and run the process on this modified layer. You should strip everything out except some unique identifiers and the totalarea field; you can always join the results back to the larger body of attributes later if you need them.

Once the process is complete, open the attribute table for the unioned layer and create a new calculated field called piecearea, where you calculate the area for these smaller pieces. At this stage you have what you need to calculate overlap: for these pieces you have columns with the total area of the original ZCTA and the area of this ZCTA piece that overlaps with a particular CD. You can add an additional calculated field called pct_in (length 5 precision 2) where you divide one by the other to get a percentage:

( “piecearea” / “totalarea” ) * 100

If a ZCTA record appears once in the table that means it’s fully inside one CD, and it should have a percentage of 100%. Otherwise it will appear multiple times, which means there is overlap and this will be reflected in the percentages. The output below is for ZCTAs 19138 through 19141 in Philadelphia, PA. Compare this to the maps above (these ZCTAs are located towards the center of the map). 19138 and 19139 are wholly within one CD, while 19140 and 19141 are split across two CDs. Unfortunately, QGIS doesn’t provide a simple way for hiding columns, so I can’t clearly represent the result in the image below – you’ll see a clearer picture from the PostGIS process. But you’ll end up with the attributes from both layers, so you can see what CD each ZCTA falls in.

QGIS Attribute Table

Attribute table with areas and percentages

PostGIS

The QGIS method is fine if you don’t have many polygons to calculate, but if you have a large number of features the process will either take a long time, or will crash (incidentally ArcGIS would be no different).

PostGIS to the rescue. For this approach, first you create a spatial database and activate the PostGIS extension with the command CREATE EXTENSION postgis. Then you can load the shapefiles into PostGIS using the shapefile loader that is bundled with PostGIS, or you could use the QGIS DB Manager to load them. During the import process you need to specify that the layers are in NAD 83 by specifying the correct EPSG code, changing the SRID from 0 to 4269.

PostGIS doesn’t have many global or continental projected coordinate system definitions, so we’ll have to add one for North America Albers Equal Area to its spatial reference table. A quick visit to Spatial Reference and a search for this system yields the definition, and we can get a PostGIS Insert statement that we can copy and paste into a SQL query window in our database. Before executing it, we have to change the SRID number in the statement from 9102008 to 102008 to avoid violating a check restraint that prevents IDs from being larger than 6 digits.

With the definition in place, we create a series of blank tables that will hold our two layers, and then run an insert statement where we take columns we want from the original tables and bring them into the new tables. In the course of doing this, we also transform the geometry from NAD 83 to Albers. At the end it’s important to create a spatial index on the geometry, as it will really speed up spatial selections.

BEGIN;

CREATE TABLE zctas_aea (
zcta5 varchar(5) PRIMARY KEY,
geom geometry (Multipolygon, 102008)
);

INSERT INTO zctas_aea (zcta5, geom)
SELECT zcta5ce10, ST_Transform(geom, 102008)
FROM tl_2018_us_zcta510;

CREATE INDEX zctas_aea_geom_gist
ON zctas_aea
USING gist (geom);

COMMIT;
BEGIN;
CREATE TABLE cds_aea (
geoid varchar(4) PRIMARY KEY,
statefp varchar(2),
name text,
session varchar(3),
geom geometry (Multipolygon, 102008)
);

INSERT INTO cds_aea (geoid, statefp, name, session, geom)
SELECT geoid, statefp, namelsad, cdsessn, ST_Transform(geom, 102008)
FROM tl_2018_us_cd116;

CREATE INDEX cds_aea_geom_gist
ON cds_aea
USING gist (geom);

COMMIT;

Once the data is inserted we can check the geometry validity with ST_IsValid, and if there is bad geometry we can fix it with another statement using ST_MakeValid, where IN contains identifiers for bad geometry discovered in the previous statement.

SELECT geoid, ST_IsValid(geom) AS notvalid,
ST_IsValidReason(geom) AS reason
FROM cds_aea
WHERE NOT ST_IsValid(geom);
UPDATE cds_aea
SET geom=ST_MakeValid(geom)
WHERE geoid IN (INSERT LIST OF IDS HERE);

We can execute the overlap operation with a single statement. PostGIS allows you to calculate area on the fly with the ST_Area function, and there are two functions for overlap: ST_Intersects acts as a spatial join that relates one layer to the other by selecting all features that Intersect, while ST_Intersection selects the actual pieces of each feature’s geometry that intersect. This example is just for Pennsylvania, which we select using the state FIPS code ’42’ from the CD layer.  It’s a good idea to get the statement right on a sample of records before executing it on the entire set. The double colons are a PostgreSQL shortcut for casting data types from one type to the other. This is necessary when using the ROUND function to produce a non-integer result (as ROUND can’t be used to round real decimal numbers produced from the AREA function to a fixed number of decimal places).

SELECT z.zcta5 AS zcta, c.geoid AS cd, c.name AS cdname,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) AS area_piece,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in
FROM zctas_aea z, cds_aea c
WHERE ST_Intersects(z.geom, c.geom) AND c.statefp = '42'
ORDER BY z.zcta5, c.geoid, pct_in DESC;

This statement took me about 20 seconds to run. The results (see below) include several records that QGIS didn’t return, where the area and overlap is 0, either due to an infinitely small area of overlap that rounds to zero or strict interpretation of intersect (which includes areas that overlap and touch). While there is an ST_Overlap function, it will not return geometries where one geometry is completely contained within another (so we can’t use that). For example, ZCTAs 19138 and 19139 appear within one district but there are two records for them, one with a 100% value and another with a 0% value.

Query results in PostgreSQL

Result of intersect operations and area calculations in pgAdmin / PostGIS

We can toss these records by either deleting them from the final result when the process is finished, or we can add another statement to our WHERE clause to filter them out:

AND ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) > 0

This lengthened the execution time to 30 seconds and dropped the number of records from 2,523 to 2,061.

Once the statement looks good, we can drop the AND filter for Pennsylvania and generate a result for the entire country. Using pgAdmin 4 we can write the result directly out as a CSV. Or, you can preface the statement with CREATE VIEW overlap AS to save the statement as a query which you can call up any time. Or, you can preface the statement with CREATE TABLE overlap AS and the result of the query will be saved in a new table. This takes longer than the other two options, but gives you the ability to query and modify the resulting table. Exporting the table out as a CSV can be accomplished quickly, giving you the best of options 1 and 3. The final code and result is shown below.

CREATE TABLE zcta_cd_overlap AS
SELECT z.zcta5 AS zcta, c.geoid AS cdistrict, c.name AS cdname,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) AS area_piece,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in
FROM zctas_aea z, cds_aea c
WHERE ST_Intersects(z.geom, c.geom) AND
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) > 0
ORDER BY z.zcta5, c.geoid, pct_in DESC;
Final PostGIS Result

Final Result in PostGIS / pgAdmin

Conclusion – which is best?

I’m using a 64-bit Lenovo Thinkpad laptop that has 4 Intel processors at 2.3Ghz and 8 gigs of memory. I’m running Xubuntu 18.04 and am using QGIS 3.4 Madeira, PostgreSQL 10, PostGIS 2.4, and pgAdmin 4. With 444 CDs and 33k ZCTAs it took me over 1.5 hours to run the union operation in QGIS, and that’s without altering the attribute tables to delete unnecessary columns. Executing the PostGIS statement, simply writing the output to the screen with the caveat to exclude areas with 0, took only 12 minutes. Writing the result to a new table took 22 minutes.

For the larger project that I mentioned at the beginning of this post, neither QGIS nor ArcGIS was able to complete the union process between 2 million polygons and 60k grid areas without crashing, even when we reduced the number of attribute columns to a bare minimum. It took PostGIS about 50 minutes to execute the overlap query and print the output to the screen or directly to a CSV, and about 3 hours to write the results to a new table.

I think the PostGIS approach is more straightforward and gives you more control over the process. There’s no need calculate area in advance or to delete attribute columns, as you can simply choose to include or exclude the ones you want. Finding and fixing invalid geometry in PostGIS is simpler, and the process is faster to execute. Rest assured you can handle layers with large numbers of features. I’ve wondered if the problems with QGIS and ArcGIS might be mitigated by using something other than a shapefile, like the newer geopackage format which is built on SQLite. I have no idea but it would be worth trying if you really wanted or needed to go the desktop GIS route for large files.

Factorio forest landscape

Exploring New Worlds in Factorio

The first draft is finished and I sent my book off for review earlier this month, and I’ve been back to work full-time for two months now. It’s been a difficult transition, so I thought I’d write a more lighthearted post this month about imaginary geographic worlds (as luck would have it, the Geo NYC Meetup group is discussing fictional mapping next week).

I’ve always enjoyed top-down simulation games; I still have my original copy of SimCity from 1989, in the box with the diskettes. More recently, I started playing a top-down, world-exploration, operations management, logistical simulator game called Factorio. The premise is you are the sole survivor of a team of scientists and engineers who have crash landed on an unexplored world. Using the scrap metal of your ship, a few simple tools, and the abundant resources on the planet, your goal is to build a rocket to launch a satellite into space to alert the crew of a successive spaceship of your presence. Scattered across the planet are concentrations of resources: water, trees (for wood), stone, iron ore, copper ore, oil, coal, and uranium. With an ax and a few scavenged plates from the ship, you begin by building a stone furnace for smelting metals. You use your ax to mine some stone to build the furnace, some iron for smelting, and some coal for fuel. Once you’ve smelted some metal you can construct a drill to mine the materials and insert them into the smelter automatically.

Smelting the ore converts it into refined material: stone to bricks, iron ore to iron plates, and copper ore to copper plates. Initially you can take these materials and manually craft them to make products: iron plates become iron gear wheels, copper plates become copper wire, which in turn can be crafted to create higher order parts like electronic circuits and finished machine products. Ultimately you’ll construct assembly plants that take the necessary materials and build the products for you, and the outputs can be used as inputs for other products.

Factorio mining drills

Mining ore

Factorio smelters

Smelting ore to plates

Factorio assemblers

Assembling products

 

 

 

 

 

 

The game becomes a logistical puzzle, where you mine ores from various deposits and move them to be smelted, and then move the refined materials to different assembly plants to create higher-order products. You transport everything using conveyor belts and inserters, which grab materials from belts and insert them into the smelters, assemblers, and other structures. You construct pumps, boilers, and steam engines powered by coal or wood to generate electricity to power the entire factory, and in order to keep developing higher-order goods you combine certain materials to produce “science”; little colorful beakers of liquid that you move on belts to laboratories to keep research humming.

As the game and your research progresses you develop technology that allows you to better explore the world and access additional resources, as you’ll eventually deplete the original deposits near the factory. You can develop solar and nuclear power as cleaner electricity alternatives, drill and refine oil to create fuels and plastics, build cars to explore the landscape, and construct railroads to transport more distant materials to your base. As your factory expands you have to grapple with the logistical hurdles of moving products created at disparate ends of the plant together in order to create new products, forcing you to either plan ahead or reconfigure your layout as time passes (or build some drones to fly the materials around). The clock is always ticking as the game is played in real time (it’s not turn-based).

Factorio factory

Main Factorio screen showing portion of a factory and map layout

At some point you face a new problem: you are not alone on this planet. There are some large, scary-looking insect creatures living there that don’t like all the pollution that’s coming from your factory, and they don’t particularly like you. Once they become irritated enough they attack and chew up your factory, and you along with it. Sadly there is no negotiating with them (they’re not sentient), so some of your attention and resources must be spent on weapons. You can take a purely defensive approach, building walls and gun turrets to protect your base as well as armor and shields to protect yourself. Or you can barrel out in a tank or use artillery to destroy them as they encroach on your operations. You can also develop cleaner, less polluting technologies to irritate them less.

An additional challenge is that the game keeps changing. Even though it’s been out for several years Factorio is still in a beta phase, but given it’s maturity and update cycle it’s super stable. The developers are part of a small company in the Czech Republic who focus primarily on this game. Factorio is available for purchase via their website and via Steam for all operating systems, and has been downloaded over a millions of times. The fanatical fan base appreciates the ability to mod practically every aspect of the game, and they form a community that’s crucial to the game’s development through testing and feedback. Factorio is definitely a member of a new generation of games where part of the challenge is learning how to play it. I’ve crawled through the extensive wiki, scoured Reddit for advice, and watched several YouTube series to figure out how it works.

Regardless of how many times you’ve played it, there is always something new to tinker with. Many players enjoy the engineering and mathematical side of the game. Their goal is to build the most efficient system, perfectly balance inputs and outputs, and create the best ratios for production. Others go for scale, building the largest possible factory with the most throughput. There are railroaders who enjoy building the trains, and warriors who focus on combat with the voracious bugs. Beyond building the rocket, the game has a number of challenges that players attempt to master, and it can be played solo or multiplayer for gamers who want to work together or simply explore each other’s layouts and solutions.

As a geographer, I enjoy the actual worlds themselves and the unique challenge each environment presents. While you can create blueprints and use the same design for a railroad station or solar power generator over and over again, you’re forced to change your overall factory layout based on the location of resources and configuration of the terrain. Prior to launching a new game, you specify the general size, frequency, and richness of resources, trees, water, and enemy bugs, and you can keep generating maps until you find one you like. While many of the efficiency aficionados want flat playing surfaces, I enjoy the complexity of fitting your factory in around the oceans and forests, and the challenge of exploring and shipping in materials from far flung places.

The world itself is quite beautiful. The developers provide extensive details about the development and inner-workings of the game, including the processes for generating logical and realistic looking landscapes. There are lush deciduous forests in vibrant autumn colors, desert wastelands strewn with rocks, and clusters of baobab-like trees on the dry plains. Even though they’re just bits and bytes I limit what I harvest, because I hate chopping them down. Unlike the real world, mining ore is much less destructive and the material is simply scraped off the surface, leaving unblemished soil behind. A finite portion of the world is visible on the map when you begin the game, and the surrounding area is cloaked in darkness. You can reveal more of the terrain by building radar stations at your base, and can explore on foot or go further afield once you’ve constructed vehicles. The world has no end, and stretches into infinity.

Factorio forest landscape

Factorio has sparked my curiosity in unexpected ways. As I’m mining ores and moving them into smelters to produce metals, I started to wonder: what is smelting anyway? How do you actually extract metals from rocks? My exposure to chemistry was limited to my junior year of high school where I struggled with balancing formulas and memorizing the periodic table. Fortunately I discovered some fascinating books and videos that made the subject engaging. Material scientist Mark Miodownik’s Stuff Matters: Exploring the Marvelous Materials That Shape Our Man-Made World, is an accessible, informative, and often hilarious exploration of the materials we use everyday. You’ll learn the basic chemistry behind paper, iron, ceramics, even chocolate! Periodic Tales: A Cultural History of the Elements, from Arsenic to Zinc by Hugh Aldersey-Williams is perfect if you want to learn some basics about chemistry and material science from a historical science perspective. NOVA aired a solid three-part series a few years back called Treasures of the Earth that revealed the secrets behind gems, metals, and power sources.

I resisted the temptation to play for the year I was on sabbatical, as it’s too easy to get sucked into it. A few hours here and there throughout a month, and by the time I launch that rocket into space 30 hours have gone by! Initially I feel a bit guilty, sinking so much time into a game. But when you consider how much time the average person spends watching TV or looking at stupid stuff on their phone (4 hours and 2.5 hours respectively, EVERY DAY!), enjoying the occasional game that challenges your mind and sparks your imagination is a good alternative. Similar to the Minecraft phenomena, I think it has great potential as an educational tool for learning about logistics, planning, geology and materials science, and engineering. And for the geographers out there, there are infinite worlds to explore.

Factorio desert landscape

Washington DC street

Using the ACS to Calculate Daytime Population

I’m in the home stretch for getting the last chapter of the first draft of my census book completed. The next to last chapter of the book provides an overview of a number of derivatives that you can create from census data, and one of them is the daytime population.

There are countless examples of using census data for site selection analysis and for comparing and ranking places for locating new businesses, providing new public services, and generally measuring potential activity or population in a given area. People tend to forget that census data measures people where they live. If you were trying to measure service or business potential for residents, the census is a good source.

Counts of residents are less meaningful if you wanted to gauge how crowded or busy a place was during the day. The population of an area changes during the day as people leave their homes to go to work or school, or go shopping or participate in social activities. Given the sharp divisions in the US between residential, commercial, and industrial uses created by zoning, residential areas empty out during the weekdays as people travel into the other two zones, and then fill up again at night when people return. Some places function as job centers while others serve as bedroom communities, while other places are a mixture of the two.

The Census Bureau provides recommendations for calculating daytime population using a few tables from the American Community Survey (ACS). These tables capture where workers live and work, which is the largest component of the daytime population.

Using these tables from the ACS:

Total resident population
B01003: Total Population
Total workers living in area and Workers who lived and worked in same area
B08007: Sex of Workers by Place of Work–State and County Level (‘Total:’ line and ‘Worked in county of residence’ line)
B08008: Sex of Workers by Place of Work–Place Level (‘Total:’ line and ‘Worked in place of residence’ line)
B08009: Sex of Workers by Place of Work–Minor Civil Division Level (‘Total:’ line and ‘Worked in MCD of residence’ line)
Total workers working in area
B08604: Total Workers for Workplace Geography

They propose two different approaches that lead to the same outcome. The simplest approach: add the total resident population to the total number of workers who work in the area, and then subtract the total resident workforce (workers who live in the area but may work inside or outside the area):

Daytime Population = Total Residents + Total Workers in Area - Total Resident Workers

For example, according to the 2017 ACS Washington DC had an estimated 693,972 residents (from table B01003), 844,345 (+/- 11,107) people who worked in the city (table B08604), and 375,380 (+/- 6,102) workers who lived in the city. We add the total residents and total workers, and subtract the total workers who live in the city. The subtraction allows us to avoid double counting the residents who work in the city (as they are already included in the total resident population) while omitting the residents who work outside the city (who are included in the total resident workers). The result:

693,972 + 844,345 - 375,380 = 1,162,937

And to get the new margin of error:

SQRT(0^2 + 11,107^2 + 6,102^2) = 12,673

So the daytime population of DC is approx 468,965 people (68%) higher than its resident population. The district has a high number of jobs in the government, non-profit, and education sectors, but has a limited amount of expensive real estate where people can live. In contrast, I did the calculation for Philadelphia and its daytime population is only 7% higher than its resident population. Philadelphia has a much higher proportion of resident workers relative to total workers. Geographically the city is larger than DC and has more affordable real estate, and faces stiffer suburban competition for private sector jobs.

The variables in the tables mentioned above are also cross-tabulated in other tables by age, sex, race, Hispanic origin , citizenship status, language, poverty, and tenure, so it’s possible to estimate some characteristics of the daytime population. Margins of error will limit the usefulness of estimates for small population groups, and overall the 5-year period estimates are a better choice for all but the largest areas. Data for workers living in an area who lived and worked in the same area is reported for states, counties, places (incorporated cities and towns), and municipal civil divisions (MCDs) for the states that have them.

Data for the total resident workforce is available for other, smaller geographies but is reported for those larger places, i.e. we know how many people in a census tract live and work in their county or place of residence, but not how many live and work in their tract of residence. In contrast, data on the number of workers from B08604 is not available for smaller geographies, which limits the application of this method to larger areas.

Download or explore these ACS tables from your favorite source: the American Factfinder, the Census Reporter, or the Missouri Census Data Center.