I was helping someone with a project this semester where we wanted to calculate overlap between two different polygon layers (postal code areas and grid cells) for over forty countries throughout the world. The process involved calculating the area of overlap and percentage of total overlap between each postal area and grid cell. We began our experiment in QGIS and perfected the process, but ultimately failed because the software was not able to handle the large number of polygons: almost 2 million postal codes and over 60k grid cells. Ultimately we employed PostGIS, which was more efficient and able to do the job.
In this post I’ll outline the steps for calculating area and polygon overlap in both QGIS (as an example of desktop GIS software) and PostGIS (as an example of a spatial database); I’ll assume you have some familiarity with both. For this example I’ll use two layers from the Census Bureau’s TIGER Line Shapefiles: Congressional Districts (CDs) and ZIP Code Tabulation Areas (ZCTAs). We’ll calculate how ZCTAs overlap with CD boundaries.
Before we begin, I should say that overlap is a technical term for a specific type of spatial selection. Overlapping features must share some interior space, and the geometry of one feature is not entirely enclosed within the geometry of another. I am NOT using the term overlap in this technical sense here – I’m using it more generally to refer to features that share any interior space with another, including areas that are entirely enclosed with another (i.e. 100% overlap).
Since we’re measuring areas, the first step is to reproject our layers to a projected coordinate system that preserves area (an equal area projection). If we were working in a local area we could use a UTM or (in the US) a State Plane Zone. For continents and large countries like the US we could use Albers Equal Area Conic. If we were working globally we could use Mollweide or a Cylindrical Equal Area projection. The US Census layers are in the geographic coordinate system NAD 83. To reproject them, we select each one in the layers panel, right click, and choose save as. Browse and save them as new files, hit the CRS button, search for North America Albers Equal Area (AEA), select it, and save the new layers in that system. In the map window we select one of the new layers, right click, and choose Set Project CRS from Layer to apply the new system to the map window.
Next, we need to create a new field where we calculate the area for the ZCTAs. The census layers already come with pre-calculated area attributes, but we’ll need to calculate our own. Open the attribute table for the ZCTAs and hit the field calculator button (looks like an abacus). In the menu we create a new field called areatotal and populate it with the expression:
$area * 0.00000038610
$area is a geometry function that calculates the area of each polygon. Since the AEA projection uses square meters as its unit, the area will be in square meters. Multiplying by this fraction gives us square miles (or if you prefer, divide by 1000000 to get square kilometers). It’s important that we set the field type to a real / decimal number and specify a meaningful length (total number of digits) and precision (number of digits right of the decimal place). A length of 20 and a precision of 5 gives us 15 places to the left of the decimal point and 5 to the right, which should be plenty. Hit Calculate, exit out of the edit mode, and save changes.
Before calculating the overlap it’s a good idea to check the geometry of each layer to make sure all of the polygons are valid (i.e. properly constructed), otherwise we will run into errors. Use Vector – Geometry Tools – Check Validity to check geometry, and if anything is broken open the Processing box and search for the Fix Geometry Tool. In this example both layers have valid geometry.
Use Vector – Geoprocessing – Union to meld the ZCTA and CD layers together. This will create unique polygons that consist of geometry occupied by a unique ZCTA and CD combination. So in instances where there is overlap between layers the polygon will be split into two (or more) pieces. See the image below, which illustrates CDs and ZCTAs before and after unioning in the Philadelphia area.
Processing time will vary based on the number of features, their level of detail (nodes per polygon), the number of overlaps, and the number of attributes (columns) per layer. There are 444 CD features and about 33k ZCTAs. While these numbers aren’t huge, the polygons are very detailed and there is a fair amount of overlap: it took me approx 1.5 hours to run. To minimize processing time you could create copies of these layers, modify them by deleting attribute columns, and run the process on this modified layer. You should strip everything out except some unique identifiers and the totalarea field; you can always join the results back to the larger body of attributes later if you need them.
Once the process is complete, open the attribute table for the unioned layer and create a new calculated field called piecearea, where you calculate the area for these smaller pieces. At this stage you have what you need to calculate overlap: for these pieces you have columns with the total area of the original ZCTA and the area of this ZCTA piece that overlaps with a particular CD. You can add an additional calculated field called pct_in (length 5 precision 2) where you divide one by the other to get a percentage:
( “piecearea” / “totalarea” ) * 100
If a ZCTA record appears once in the table that means it’s fully inside one CD, and it should have a percentage of 100%. Otherwise it will appear multiple times, which means there is overlap and this will be reflected in the percentages. The output below is for ZCTAs 19138 through 19141 in Philadelphia, PA. Compare this to the maps above (these ZCTAs are located towards the center of the map). 19138 and 19139 are wholly within one CD, while 19140 and 19141 are split across two CDs. Unfortunately, QGIS doesn’t provide a simple way for hiding columns, so I can’t clearly represent the result in the image below – you’ll see a clearer picture from the PostGIS process. But you’ll end up with the attributes from both layers, so you can see what CD each ZCTA falls in.
The QGIS method is fine if you don’t have many polygons to calculate, but if you have a large number of features the process will either take a long time, or will crash (incidentally ArcGIS would be no different).
PostGIS to the rescue. For this approach, first you create a spatial database and activate the PostGIS extension with the command CREATE EXTENSION postgis. Then you can load the shapefiles into PostGIS using the shapefile loader that is bundled with PostGIS, or you could use the QGIS DB Manager to load them. During the import process you need to specify that the layers are in NAD 83 by specifying the correct EPSG code, changing the SRID from 0 to 4269.
PostGIS doesn’t have many global or continental projected coordinate system definitions, so we’ll have to add one for North America Albers Equal Area to its spatial reference table. A quick visit to Spatial Reference and a search for this system yields the definition, and we can get a PostGIS Insert statement that we can copy and paste into a SQL query window in our database. Before executing it, we have to change the SRID number in the statement from 9102008 to 102008 to avoid violating a check restraint that prevents IDs from being larger than 6 digits.
With the definition in place, we create a series of blank tables that will hold our two layers, and then run an insert statement where we take columns we want from the original tables and bring them into the new tables. In the course of doing this, we also transform the geometry from NAD 83 to Albers. At the end it’s important to create a spatial index on the geometry, as it will really speed up spatial selections.
BEGIN; CREATE TABLE zctas_aea ( zcta5 varchar(5) PRIMARY KEY, geom geometry (Multipolygon, 102008) ); INSERT INTO zctas_aea (zcta5, geom) SELECT zcta5ce10, ST_Transform(geom, 102008) FROM tl_2018_us_zcta510; CREATE INDEX zctas_aea_geom_gist ON zctas_aea USING gist (geom); COMMIT;
BEGIN; CREATE TABLE cds_aea ( geoid varchar(4) PRIMARY KEY, statefp varchar(2), name text, session varchar(3), geom geometry (Multipolygon, 102008) ); INSERT INTO cds_aea (geoid, statefp, name, session, geom) SELECT geoid, statefp, namelsad, cdsessn, ST_Transform(geom, 102008) FROM tl_2018_us_cd116; CREATE INDEX cds_aea_geom_gist ON cds_aea USING gist (geom); COMMIT;
Once the data is inserted we can check the geometry validity with ST_IsValid, and if there is bad geometry we can fix it with another statement using ST_MakeValid, where IN contains identifiers for bad geometry discovered in the previous statement.
SELECT geoid, ST_IsValid(geom) AS notvalid, ST_IsValidReason(geom) AS reason FROM cds_aea WHERE NOT ST_IsValid(geom);
UPDATE cds_aea SET geom=ST_MakeValid(geom) WHERE geoid IN (INSERT LIST OF IDS HERE);
We can execute the overlap operation with a single statement. PostGIS allows you to calculate area on the fly with the ST_Area function, and there are two functions for overlap: ST_Intersects acts as a spatial join that relates one layer to the other by selecting all features that Intersect, while ST_Intersection selects the actual pieces of each feature’s geometry that intersect. This example is just for Pennsylvania, which we select using the state FIPS code ’42’ from the CD layer. It’s a good idea to get the statement right on a sample of records before executing it on the entire set. The double colons are a PostgreSQL shortcut for casting data types from one type to the other. This is necessary when using the ROUND function to produce a non-integer result (as ROUND can’t be used to round real decimal numbers produced from the AREA function to a fixed number of decimal places).
SELECT z.zcta5 AS zcta, c.geoid AS cd, c.name AS cdname, ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) * 0.00000038610)::numeric,2) AS area_piece, ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in FROM zctas_aea z, cds_aea c WHERE ST_Intersects(z.geom, c.geom) AND c.statefp = '42' ORDER BY z.zcta5, c.geoid, pct_in DESC;
This statement took me about 20 seconds to run. The results (see below) include several records that QGIS didn’t return, where the area and overlap is 0, either due to an infinitely small area of overlap that rounds to zero or strict interpretation of intersect (which includes areas that overlap and touch). While there is an ST_Overlap function, it will not return geometries where one geometry is completely contained within another (so we can’t use that). For example, ZCTAs 19138 and 19139 appear within one district but there are two records for them, one with a 100% value and another with a 0% value.
We can toss these records by either deleting them from the final result when the process is finished, or we can add another statement to our WHERE clause to filter them out:
AND ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) * 0.00000038610)::numeric,2) > 0
This lengthened the execution time to 30 seconds and dropped the number of records from 2,523 to 2,061.
Once the statement looks good, we can drop the AND filter for Pennsylvania and generate a result for the entire country. Using pgAdmin 4 we can write the result directly out as a CSV. Or, you can preface the statement with CREATE VIEW overlap AS to save the statement as a query which you can call up any time. Or, you can preface the statement with CREATE TABLE overlap AS and the result of the query will be saved in a new table. This takes longer than the other two options, but gives you the ability to query and modify the resulting table. Exporting the table out as a CSV can be accomplished quickly, giving you the best of options 1 and 3. The final code and result is shown below.
CREATE TABLE zcta_cd_overlap AS SELECT z.zcta5 AS zcta, c.geoid AS cdistrict, c.name AS cdname, ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) * 0.00000038610)::numeric,2) AS area_piece, ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in FROM zctas_aea z, cds_aea c WHERE ST_Intersects(z.geom, c.geom) AND ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) * 0.00000038610)::numeric,2) > 0 ORDER BY z.zcta5, c.geoid, pct_in DESC;
Conclusion – which is best?
I’m using a 64-bit Lenovo Thinkpad laptop that has 4 Intel processors at 2.3Ghz and 8 gigs of memory. I’m running Xubuntu 18.04 and am using QGIS 3.4 Madeira, PostgreSQL 10, PostGIS 2.4, and pgAdmin 4. With 444 CDs and 33k ZCTAs it took me over 1.5 hours to run the union operation in QGIS, and that’s without altering the attribute tables to delete unnecessary columns. Executing the PostGIS statement, simply writing the output to the screen with the caveat to exclude areas with 0, took only 12 minutes. Writing the result to a new table took 22 minutes.
For the larger project that I mentioned at the beginning of this post, neither QGIS nor ArcGIS was able to complete the union process between 2 million polygons and 60k grid areas without crashing, even when we reduced the number of attribute columns to a bare minimum. It took PostGIS about 50 minutes to execute the overlap query and print the output to the screen or directly to a CSV, and about 3 hours to write the results to a new table.
I think the PostGIS approach is more straightforward and gives you more control over the process. There’s no need calculate area in advance or to delete attribute columns, as you can simply choose to include or exclude the ones you want. Finding and fixing invalid geometry in PostGIS is simpler, and the process is faster to execute. Rest assured you can handle layers with large numbers of features. I’ve wondered if the problems with QGIS and ArcGIS might be mitigated by using something other than a shapefile, like the newer geopackage format which is built on SQLite. I have no idea but it would be worth trying if you really wanted or needed to go the desktop GIS route for large files.