# Calculating Mean Income for Groups of Geographies with Census ACS Data

When aggregating small census geographies to larger ones (census tracts to neighborhoods for example) when you’re working with American Community Survey (ACS) data, you need to sum estimates and calculate new margins of error. This is straightforward for most estimates; you simply sum them, and take the square root of the sum of squares for the margins of error (MOEs) for each estimate that you’re aggregating. But what if you need to group and summarize derived estimates like means or medians? In this post, I’ll demonstrate how to calculate mean household income by aggregating ZCTAs to United Hospital Fund neighborhoods (UHF), which is a type of public health area in NYC created by aggregating ZIP Codes.

I’m occasionally asked how to summarize median household income from tracts to neighborhood-like areas. You can’t simply add up the medians and divide them, the result would be completely erroneous. Calculating a new median requires us to sort individual household-level records and choose the middle-value, which we cannot do as those records are confidential and not public. There are a few statistical interpolation methods that we can use with interval data (number of households summarized by income brackets) to estimate a new median and MOE, but the calculations are rather complex. The State Data Center in California provides an excellent tutorial that demonstrates the process, and in my new book I’ll walk through these steps in the supplemental material.

While a mean isn’t as desirable as a median (as it can be skewed by outliers), it’s much easier to calculate. The ACS includes tables on aggregate income, including the sum of all income earned by households and other population group (like families or total population). If we sum aggregate household income and number of households for our small geographic areas, we can divide the total income by total households to get mean income for the larger area, and can use the ACS formula for computing the MOE for ratios to generate a new MOE for the mean value. The Census Bureau publishes all the ACS formulas in a detailed guidebook for data users, and I’ll cover many of them in the ACS chapter of my book (to be published by the end of 2019).

### Calculating a Derived Mean in Excel

Let’s illustrate this with a simple example. I’ve gathered 5-year 2017 ACS data on number of households (table B11001) and aggregate household income (table B19025) by ZCTA, and constructed a sheet to correlate individual ZCTAs to the UHF neighborhoods they belong to. UHF 101 Kingsbridge-Riverdale in the Bronx is composed of just two ZCTAs, 10463 and 10471. We sum the households and aggregate income to get totals for the neighborhood. To calculate a new MOE, we take the square root of the sum of squares for each of the estimate’s MOEs:

Calculate margin of error for new sum

To calculate mean income, we simply divide the total aggregate household income by total households. Calculating the MOE is more involved. We use the ACS formula for derived ratios, where aggregate income is the numerator of the ratio and households is the denominator. We multiply the square of the ratio (mean income) by the square of the MOE of the denominator (households MOE), add that product to the square of the MOE of the numerator (aggregate income MOE), take the square root, and divide the result by the denominator (households):

```=(SQRT((moe_ratio_numerator^2)+(ratio^2*moe_ratio_denominator^2))/ratio_denominator)
```

Calculate margin of error for ratio (mean income)

The 2013-2017 mean household income for UHF 101 is \$88,040, +/- \$4,223. I always check my math using the Cornell Program on Applied Demographic’s ACS Calculator to make sure I didn’t make a mistake.

This is how it works in principle, but life is more complicated. When I downloaded this data I had number of households by ZCTA and aggregate household income by ZCTA in two different sheets, and the relationship between ZCTAs and UHFs in a third sheet. There are 42 UHF neighborhoods and 211 ZCTAs in the city, of which 182 are actually assigned to UHFs; the others have no household population. I won’t go into the difference between ZIP Codes and ZCTAs here, as it isn’t a problem in this particular example.

Tying them all together would require using the ZCTA in the third sheet in a VLOOKUP formula to carry over the data from the other two sheets. Then I’d have to aggregate the data to UHF using a pivot table. That would easily give me sum of households and aggregate income by UHF, but getting the MOEs would be trickier. I’d have to square them all first, take the sum of these squares when pivoting, and take the square root after the pivot to get the MOEs. Then I could go about calculating the means one neighborhood at a time.

Spreadsheet-wise there might be a better way of doing this, but I figured why do that when I can simply use a database? PostgreSQL to the rescue!

### Calculating a Derived Mean in PostgreSQL

In PostgreSQL I created three empty tables for: households, aggregate income, and the ZCTA to UHF relational table, and used pgAdmin to import ZCTA-level data from CSVs into those tables (alternatively you could use SQLite instead of PostgreSQL, but you would need to have the optional math module installed as SQLite doesn’t have the capability to do square roots).

Portion of households table. A separate aggregate household income table is structured the same way, with income stored as bigint type.

Portion of the ZCTA to UHF relational table.

In my first run through I simply tried to join the tables together using the 5-digit ZCTA to get the sum of households and aggregate incomes. I SUM the values for both and use GROUP BY to do the aggregation to UHF. In PostgreSQL pipe-forward slash: |/ is the operator for square root. I sum the squares for each ZCTA MOE and take the root of the total to get the UHF MOEs. I omit ZCTAs that have zero households so they’re not factored into the formulas:

```SELECT z.uhf42_code, z.uhf42_name, z.borough,
SUM(h.households) AS hholds,
ROUND(|/(SUM(h.households_me^2))) AS hholds_me,
SUM(a.agg_hhold_income) AS agghholds_inc,
ROUND(|/(SUM(a.agg_hhold_income_me^2))) AS agghholds_inc_me
FROM zcta_uhf42 z, hsholds h, agg_income a
WHERE z.zcta=h.gid2 AND z.zcta=a.gid2 AND h.households !=0
GROUP BY z.uhf42_code, z.uhf42_name, z.borough
ORDER BY uhf42_code;```

Portion of query result, households and income aggregated from ZCTA to UHF district.

Once that was working, I modified the statement to calculate mean income. Calculating the MOE for the mean looks pretty rough, but it’s simply because we have to repeat the calculation for the ratio over again within the formula. This could be avoided if we turned the above query into a temporary table, and then added two columns and populated them with the formulas in an UPDATE – SET statement. Instead I decided to do everything in one go, and just spent time fiddling around to make sure I got all the parentheses in the right place. Once I managed that, I added the ROUND function to each calculation:

```SELECT z.uhf42_code, z.uhf42_name, z.borough,
SUM(h.households) AS hholds,
ROUND(|/(SUM(h.households_me^2))) AS hholds_me,
SUM(a.agg_hhold_income) AS agghholds_inc,
ROUND(|/(SUM(a.agg_hhold_income_me^2))) AS agghholds_inc_me,
ROUND(SUM(a.agg_hhold_income) / SUM(h.households)) AS hhold_mean_income,
ROUND((|/ (SUM(a.agg_hhold_income_me^2) + ((SUM(a.agg_hhold_income)/SUM(h.households))^2 * SUM(h.households_me^2)))) / SUM(h.households)) AS hhold_meaninc_me
FROM zcta_uhf42 z, hsholds h, agg_income a
WHERE z.zcta=h.gid2 AND z.zcta=a.gid2 AND h.households !=0
GROUP BY z.uhf42_code, z.uhf42_name, z.borough
ORDER BY uhf42_code;```

Query in pgAdmin and portion of result for calculating mean household income

I chose a couple examples where a UHF had only one ZCTA, and another that had two, and tested them in the Cornell ACS calculator to insure the results were correct. Once I got it right, I added:

`CREATE VIEW household_sums AS`

To the top of the statement and executed again to save it as a view. Mission accomplished! To make doubly sure that the values were correct, I connected my db to QGIS and joined this view to a UHF shapefile to visually verify that the results made sense (could also have imported the shapefile into the DB as a spatial table and incorporated it into the query).

Mean household income by UHF neighborhood in QGIS

### Conclusion

While it would be preferable to have a median, calculating a new mean for an aggregated area is a fair alternative, if you simply need some summary value for the variable and don’t have the time to spend doing statistical interpolation. Besides income, the Census Bureau also publishes aggregate tables for other variables like: travel time to work, hours worked, number of vehicles, rooms, rent, home value, and various subsets of income (earnings, wages or salary, interest and dividends, social security, public assistance, etc) that makes it possible to calculate new means for aggregated areas. Just make sure you use the appropriate denominator, whether it’s total population, households, owner or renter occupied housing units, etc.

# Mapping US Census Data on Internet Access

## ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above \$50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below \$50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

## Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

## Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.

# Measuring Polygon Overlap in QGIS and PostGIS

I was helping someone with a project this semester where we wanted to calculate overlap between two different polygon layers (postal code areas and grid cells) for over forty countries throughout the world. The process involved calculating the area of overlap and percentage of total overlap between each postal area and grid cell. We began our experiment in QGIS and perfected the process, but ultimately failed because the software was not able to handle the large number of polygons: almost 2 million postal codes and over 60k grid cells. Ultimately we employed PostGIS, which was more efficient and able to do the job.

In this post I’ll outline the steps for calculating area and polygon overlap in both QGIS (as an example of desktop GIS software) and PostGIS (as an example of a spatial database); I’ll assume you have some familiarity with both. For this example I’ll use two layers from the Census Bureau’s TIGER Line Shapefiles: Congressional Districts (CDs) and ZIP Code Tabulation Areas (ZCTAs). We’ll calculate how ZCTAs overlap with CD boundaries.

Before we begin, I should say that overlap is a technical term for a specific type of spatial selection. Overlapping features must share some interior space, and the geometry of one feature is not entirely enclosed within the geometry of another. I am NOT using the term overlap in this technical sense here – I’m using it more generally to refer to features that share any interior space with another, including areas that are entirely enclosed with another (i.e. 100% overlap).

## QGIS

Since we’re measuring areas, the first step is to reproject our layers to a projected coordinate system that preserves area (an equal area projection). If we were working in a local area we could use a UTM or (in the US) a State Plane Zone. For continents and large countries like the US we could use Albers Equal Area Conic. If we were working globally we could use Mollweide or a Cylindrical Equal Area projection. The US Census layers are in the geographic coordinate system NAD 83. To reproject them, we select each one in the layers panel, right click, and choose save as. Browse and save them as new files, hit the CRS button, search for North America Albers Equal Area (AEA), select it, and save the new layers in that system. In the map window we select one of the new layers, right click, and choose Set Project CRS from Layer to apply the new system to the map window.

Congressional Districts (red) and ZCTAs (orange) in NAD 83

Congressional Districts (red) and ZCTAs (orange) in North America Albers Equal Area Conic

Next, we need to create a new field where we calculate the area for the ZCTAs. The census layers already come with pre-calculated area attributes, but we’ll need to calculate our own. Open the attribute table for the ZCTAs and hit the field calculator button (looks like an abacus). In the menu we create a new field called areatotal and populate it with the expression:

\$area * 0.00000038610

\$area is a geometry function that calculates the area of each polygon. Since the AEA projection uses square meters as its unit, the area will be in square meters. Multiplying by this fraction gives us square miles (or if you prefer, divide by 1000000 to get square kilometers). It’s important that we set the field type to a real / decimal number and specify a meaningful length (total number of digits) and precision (number of digits right of the decimal place). A length of 20 and a precision of 5 gives us 15 places to the left of the decimal point and 5 to the right, which should be plenty. Hit Calculate, exit out of the edit mode, and save changes.

Calculating area in the QGIS Field Calculator

Before calculating the overlap it’s a good idea to check the geometry of each layer to make sure all of the polygons are valid (i.e. properly constructed), otherwise we will run into errors. Use Vector – Geometry Tools – Check Validity to check geometry, and if anything is broken open the Processing box and search for the Fix Geometry Tool. In this example both layers have valid geometry.

Use Vector – Geoprocessing – Union to meld the ZCTA and CD layers together. This will create unique polygons that consist of geometry occupied by a unique ZCTA and CD combination. So in instances where there is overlap between layers the polygon will be split into two (or more) pieces. See the image below, which illustrates CDs and ZCTAs before and after unioning in the Philadelphia area.

CDs and ZCTAs in Philly

Split ZCTAs after union with Congressional Districts

Processing time will vary based on the number of features, their level of detail (nodes per polygon), the number of overlaps, and the number of attributes (columns) per layer. There are 444 CD features and about 33k ZCTAs. While these numbers aren’t huge, the polygons are very detailed and there is a fair amount of overlap: it took me approx 1.5 hours to run. To minimize processing time you could create copies of these layers, modify them by deleting attribute columns, and run the process on this modified layer. You should strip everything out except some unique identifiers and the totalarea field; you can always join the results back to the larger body of attributes later if you need them.

Once the process is complete, open the attribute table for the unioned layer and create a new calculated field called piecearea, where you calculate the area for these smaller pieces. At this stage you have what you need to calculate overlap: for these pieces you have columns with the total area of the original ZCTA and the area of this ZCTA piece that overlaps with a particular CD. You can add an additional calculated field called pct_in (length 5 precision 2) where you divide one by the other to get a percentage:

( “piecearea” / “totalarea” ) * 100

If a ZCTA record appears once in the table that means it’s fully inside one CD, and it should have a percentage of 100%. Otherwise it will appear multiple times, which means there is overlap and this will be reflected in the percentages. The output below is for ZCTAs 19138 through 19141 in Philadelphia, PA. Compare this to the maps above (these ZCTAs are located towards the center of the map). 19138 and 19139 are wholly within one CD, while 19140 and 19141 are split across two CDs. Unfortunately, QGIS doesn’t provide a simple way for hiding columns, so I can’t clearly represent the result in the image below – you’ll see a clearer picture from the PostGIS process. But you’ll end up with the attributes from both layers, so you can see what CD each ZCTA falls in.

Attribute table with areas and percentages

## PostGIS

The QGIS method is fine if you don’t have many polygons to calculate, but if you have a large number of features the process will either take a long time, or will crash (incidentally ArcGIS would be no different).

PostGIS to the rescue. For this approach, first you create a spatial database and activate the PostGIS extension with the command CREATE EXTENSION postgis. Then you can load the shapefiles into PostGIS using the shapefile loader that is bundled with PostGIS, or you could use the QGIS DB Manager to load them. During the import process you need to specify that the layers are in NAD 83 by specifying the correct EPSG code, changing the SRID from 0 to 4269.

PostGIS doesn’t have many global or continental projected coordinate system definitions, so we’ll have to add one for North America Albers Equal Area to its spatial reference table. A quick visit to Spatial Reference and a search for this system yields the definition, and we can get a PostGIS Insert statement that we can copy and paste into a SQL query window in our database. Before executing it, we have to change the SRID number in the statement from 9102008 to 102008 to avoid violating a check restraint that prevents IDs from being larger than 6 digits.

With the definition in place, we create a series of blank tables that will hold our two layers, and then run an insert statement where we take columns we want from the original tables and bring them into the new tables. In the course of doing this, we also transform the geometry from NAD 83 to Albers. At the end it’s important to create a spatial index on the geometry, as it will really speed up spatial selections.

```BEGIN;

CREATE TABLE zctas_aea (
zcta5 varchar(5) PRIMARY KEY,
geom geometry (Multipolygon, 102008)
);

INSERT INTO zctas_aea (zcta5, geom)
SELECT zcta5ce10, ST_Transform(geom, 102008)
FROM tl_2018_us_zcta510;

CREATE INDEX zctas_aea_geom_gist
ON zctas_aea
USING gist (geom);

COMMIT;
```
```BEGIN;
CREATE TABLE cds_aea (
geoid varchar(4) PRIMARY KEY,
statefp varchar(2),
name text,
session varchar(3),
geom geometry (Multipolygon, 102008)
);

INSERT INTO cds_aea (geoid, statefp, name, session, geom)
SELECT geoid, statefp, namelsad, cdsessn, ST_Transform(geom, 102008)
FROM tl_2018_us_cd116;

CREATE INDEX cds_aea_geom_gist
ON cds_aea
USING gist (geom);

COMMIT;
```

Once the data is inserted we can check the geometry validity with ST_IsValid, and if there is bad geometry we can fix it with another statement using ST_MakeValid, where IN contains identifiers for bad geometry discovered in the previous statement.

```SELECT geoid, ST_IsValid(geom) AS notvalid,
ST_IsValidReason(geom) AS reason
FROM cds_aea
WHERE NOT ST_IsValid(geom);
```
```UPDATE cds_aea
SET geom=ST_MakeValid(geom)
WHERE geoid IN (INSERT LIST OF IDS HERE);
```

We can execute the overlap operation with a single statement. PostGIS allows you to calculate area on the fly with the ST_Area function, and there are two functions for overlap: ST_Intersects acts as a spatial join that relates one layer to the other by selecting all features that Intersect, while ST_Intersection selects the actual pieces of each feature’s geometry that intersect. This example is just for Pennsylvania, which we select using the state FIPS code ’42’ from the CD layer.Â  It’s a good idea to get the statement right on a sample of records before executing it on the entire set. The double colons are a PostgreSQL shortcut for casting data types from one type to the other. This is necessary when using the ROUND function to produce a non-integer result (as ROUND can’t be used to round real decimal numbers produced from the AREA function to a fixed number of decimal places).

```SELECT z.zcta5 AS zcta, c.geoid AS cd, c.name AS cdname,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) AS area_piece,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in
FROM zctas_aea z, cds_aea c
WHERE ST_Intersects(z.geom, c.geom) AND c.statefp = '42'
ORDER BY z.zcta5, c.geoid, pct_in DESC;
```

This statement took me about 20 seconds to run. The results (see below) include several records that QGIS didn’t return, where the area and overlap is 0, either due to an infinitely small area of overlap that rounds to zero or strict interpretation of intersect (which includes areas that overlap and touch). While there is an ST_Overlap function, it will not return geometries where one geometry is completely contained within another (so we can’t use that). For example, ZCTAs 19138 and 19139 appear within one district but there are two records for them, one with a 100% value and another with a 0% value.

Result of intersect operations and area calculations in pgAdmin / PostGIS

We can toss these records by either deleting them from the final result when the process is finished, or we can add another statement to our WHERE clause to filter them out:

`AND ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) > 0`

This lengthened the execution time to 30 seconds and dropped the number of records from 2,523 to 2,061.

Once the statement looks good, we can drop the AND filter for Pennsylvania and generate a result for the entire country. Using pgAdmin 4 we can write the result directly out as a CSV. Or, you can preface the statement with CREATE VIEW overlap AS to save the statement as a query which you can call up any time. Or, you can preface the statement with CREATE TABLE overlap AS and the result of the query will be saved in a new table. This takes longer than the other two options, but gives you the ability to query and modify the resulting table. Exporting the table out as a CSV can be accomplished quickly, giving you the best of options 1 and 3. The final code and result is shown below.

```CREATE TABLE zcta_cd_overlap AS
SELECT z.zcta5 AS zcta, c.geoid AS cdistrict, c.name AS cdname,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) AS area_piece,
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) / ST_Area(z.geom) * 100)::numeric,1) AS pct_in
FROM zctas_aea z, cds_aea c
WHERE ST_Intersects(z.geom, c.geom) AND
ROUND((ST_Area(ST_Intersection(z.geom, c.geom)) *  0.00000038610)::numeric,2) > 0
ORDER BY z.zcta5, c.geoid, pct_in DESC;
```

Final Result in PostGIS / pgAdmin

## Conclusion – which is best?

I’m using a 64-bit Lenovo Thinkpad laptop that has 4 Intel processors at 2.3Ghz and 8 gigs of memory. I’m running Xubuntu 18.04 and am using QGIS 3.4 Madeira, PostgreSQL 10, PostGIS 2.4, and pgAdmin 4. With 444 CDs and 33k ZCTAs it took me over 1.5 hours to run the union operation in QGIS, and that’s without altering the attribute tables to delete unnecessary columns. Executing the PostGIS statement, simply writing the output to the screen with the caveat to exclude areas with 0, took only 12 minutes. Writing the result to a new table took 22 minutes.

For the larger project that I mentioned at the beginning of this post, neither QGIS nor ArcGIS was able to complete the union process between 2 million polygons and 60k grid areas without crashing, even when we reduced the number of attribute columns to a bare minimum. It took PostGIS about 50 minutes to execute the overlap query and print the output to the screen or directly to a CSV, and about 3 hours to write the results to a new table.

I think the PostGIS approach is more straightforward and gives you more control over the process. There’s no need calculate area in advance or to delete attribute columns, as you can simply choose to include or exclude the ones you want. Finding and fixing invalid geometry in PostGIS is simpler, and the process is faster to execute. Rest assured you can handle layers with large numbers of features. I’ve wondered if the problems with QGIS and ArcGIS might be mitigated by using something other than a shapefile, like the newer geopackage format which is built on SQLite. I have no idea but it would be worth trying if you really wanted or needed to go the desktop GIS route for large files.

# Using the ACS to Calculate Daytime Population

I’m in the home stretch for getting the last chapter of the first draft of my census book completed. The next to last chapter of the book provides an overview of a number of derivatives that you can create from census data, and one of them is the daytime population.

There are countless examples of using census data for site selection analysis and for comparing and ranking places for locating new businesses, providing new public services, and generally measuring potential activity or population in a given area. People tend to forget that census data measures people where they live. If you were trying to measure service or business potential for residents, the census is a good source.

Counts of residents are less meaningful if you wanted to gauge how crowded or busy a place was during the day. The population of an area changes during the day as people leave their homes to go to work or school, or go shopping or participate in social activities. Given the sharp divisions in the US between residential, commercial, and industrial uses created by zoning, residential areas empty out during the weekdays as people travel into the other two zones, and then fill up again at night when people return. Some places function as job centers while others serve as bedroom communities, while other places are a mixture of the two.

The Census Bureau provides recommendations for calculating daytime population using a few tables from the American Community Survey (ACS). These tables capture where workers live and work, which is the largest component of the daytime population.

Using these tables from the ACS:

Total resident population
B01003: Total Population
Total workers living in area and Workers who lived and worked in same area
B08007: Sex of Workers by Place of Work–State and County Level (‘Total:’ line and ‘Worked in county of residence’ line)
B08008: Sex of Workers by Place of Work–Place Level (‘Total:’ line and ‘Worked in place of residence’ line)
B08009: Sex of Workers by Place of Work–Minor Civil Division Level (‘Total:’ line and ‘Worked in MCD of residence’ line)
Total workers working in area
B08604: Total Workers for Workplace Geography

They propose two different approaches that lead to the same outcome. The simplest approach: add the total resident population to the total number of workers who work in the area, and then subtract the total resident workforce (workers who live in the area but may work inside or outside the area):

`Daytime Population = Total Residents + Total Workers in Area - Total Resident Workers`

For example, according to the 2017 ACS Washington DC had an estimated 693,972 residents (from table B01003), 844,345 (+/- 11,107) people who worked in the city (table B08604), and 375,380 (+/- 6,102) workers who lived in the city. We add the total residents and total workers, and subtract the total workers who live in the city. The subtraction allows us to avoid double counting the residents who work in the city (as they are already included in the total resident population) while omitting the residents who work outside the city (who are included in the total resident workers). The result:

`693,972 + 844,345 - 375,380 = 1,162,937`

And to get the new margin of error:

`SQRT(0^2 + 11,107^2 + 6,102^2) = 12,673`

So the daytime population of DC is approx 468,965 people (68%) higher than its resident population. The district has a high number of jobs in the government, non-profit, and education sectors, but has a limited amount of expensive real estate where people can live. In contrast, I did the calculation for Philadelphia and its daytime population is only 7% higher than its resident population. Philadelphia has a much higher proportion of resident workers relative to total workers. Geographically the city is larger than DC and has more affordable real estate, and faces stiffer suburban competition for private sector jobs.

The variables in the tables mentioned above are also cross-tabulated in other tables by age, sex, race, Hispanic origin , citizenship status, language, poverty, and tenure, so it’s possible to estimate some characteristics of the daytime population. Margins of error will limit the usefulness of estimates for small population groups, and overall the 5-year period estimates are a better choice for all but the largest areas. Data for workers living in an area who lived and worked in the same area is reported for states, counties, places (incorporated cities and towns), and municipal civil divisions (MCDs) for the states that have them.

Data for the total resident workforce is available for other, smaller geographies but is reported for those larger places, i.e. we know how many people in a census tract live and work in their county or place of residence, but not how many live and work in their tract of residence. In contrast, data on the number of workers from B08604 is not available for smaller geographies, which limits the application of this method to larger areas.

Download or explore these ACS tables from your favorite source: the American Factfinder, the Census Reporter, or the Missouri Census Data Center.

# Recent Migration Trends for New York City and Metro

The Baruch GIS lab crew just published a paper: New Yorkers on the Move: Recent Migration Trends for the City and Metro Area. The paper (no. 15 Feb 2018) is part of the Weissman Center for International Business Occasional Paper Series, which focuses on New York City’s role in the international and domestic economy.

## Findings

We analyzed recent population trends (2010 to 2016) in New York City and the greater metropolitan area using the US Census Bureau’s Population Estimates to study components of population change (births, deaths, domestic and international migration) and the IRS Statistics of Income division’s county to county migration data to study domestic migration flows.

Here are the main findings:

1. The population of New York City and the New York Metropolitan Area increased significantly between 2010 and 2016, but annually growth has slowed due to greater domestic out-migration.
2. Compared to other large US cities and metro areas, New York’s population growth depends heavily on foreign immigration and natural increase (the difference between births and deaths) to offset losses from domestic out-migration.
3. Between 2011 and 2015 the city had few relationships where it was a net receiver of migrants (receiving more migrants than it sends) from other large counties. The New York metro area had no net-receiver relationships with any major metropolitan area.
4. The city was a net sender (sending more migrants than it received) to all of its surrounding suburban counties and to a number of large urban counties across the US. The metro area was a net sender to metropolitan areas throughout the country.

For the domestic migration portion of the analysis we were interested in seeing the net flows between places. For example, the NYC metro area sends migrants to and receives migrants from the Miami metro. What is the net balance between the two – who receives more versus who sends more?

The answer is: the NYC metro is a net sender to most of the major metropolitan areas in the country, and has no significant net receiver relationships with any other major metropolitan area. For example, for the period from 2011 to 2015 the NYC metro’s largest net sender relationship was with the Miami metro. About 88,000 people left the NYC metro for metro Miami while 58,000 people moved in the opposite direction, resulting in a net gain of 30,000 people for Miami (or in other words, a net loss of 30k people for NYC). The chart below shows the top twenty metros where the NYC metro had a deficit in migration (sending more migrants to these areas than it received). A map of net out-migration from the NYC metro to other metros appears at the top of this post. In contrast, NYC’s largest net receiver relationship (where the NYC metro received more migrants than it sent) was with Ithaca, New York, which lost a mere 300 people to the NYC metro.

## Process

For the IRS data we used the county to county migration SQLite database that Janine meticulously constructed over the course of the last year, which is freely available on the Baruch Geoportal. Anastasia employed her Python and Pandas wizardry to create Jupyter notebooks that we used for doing our analysis and generating our charts, all of which are available on github. I used an alternate approach with Python and the SQLite and prettytable modules to generate estimates independently of Anastasia, so we could compare the two and verify our numbers (we were aggregating migration flows across years and geographies from several tables, and calculating net flows between places).

One of our goals for this project was to use modern tools and avoid the clunky use of email. With the Jupyter notebooks, git and github for storing and syncing our work, and ShareLaTeX for writing the paper, we avoided using email for constantly exchanging revised versions of scripts and papers. Ultimately I had to use latex2rtf to convert the paper to a word processing format that the publisher could use. This post helped me figure out which bibliography packages to choose (in order for latex2rtf to interpret citations and references, you need to use the older natbib & bibtex combo and not biblatex & biber).

If you are doing similar research, Zillow has an excellent post that dicusses the merits of the different datasets. There are also good case studies on Washington DC and Philadelphia that employ the same datasets.

# Note Taking for Academic Research

I’ve been reviewing a lot of literature over the past year in preparation for writing my book, so note taking is at the forefront of my mind. Grad students occasionally ask me for suggestions on how to effectively take notes, so I’ll share some pointers here. I’ll begin with my quest to find the right note taking software, followed by my actual process for taking notes.

## Finding the Right Tool

Ten years ago, I suddenly found myself back in a position where I needed to write academic papers, something I hadn’t done since I wrote my master’s thesis about eight years before that. At that time, I was still using the techniques I had learned in high school (a much longer time ago…). Back then, you were either an index card person or a binder person. The card people would write one note on each card, while binder people kept a ledger of notes and would add additional pages as needed. You’d classify your notes as summaries, paraphrases, or quotations.

I assumed that my high school methods must be outdated by now, so I cast around to see what note taking software was available. I knew I wanted to go open source, as I didn’t want my notes tethered to a specific tool and stored in proprietary format. There were a lot of options, and I quickly became bogged down and frustrated with trying them all. I felt that much of the software forced me to conform to it, and I was spending too much time fidgeting and figuring things out.

I abandoned the search and recorded my notes in a simple text (aka notepad) document. I had always been a binder person, so the single document approach appealed to me. I could copy and paste, use spell check, and search for keyword terms that I assigned (the Linux editors like gedit, leafpad, and xed are lightweight but more robust than MS Notepad). This worked fine for a stand-alone paper and I still use this approach for small projects. But as my research became on-going I needed to rely on these notes for many future projects. The single notepad document grew unwieldy and browsing and searching became difficult.

A few years later, I made a second attempt at searching for note taking software, and this time I broadened the search to include more general-purpose options. My solution: use a wiki! With a wiki, every single source can have it’s own page, the sources can be grouped together under thematic categories, you can assign tags, and you can search across all the pages. I could also add links between pages and out to the web, and could link the notes to the source documents. The wiki was so open ended that I didn’t feel constrained in writing my notes to fit a particular interface, nor did I have to waste a lot of time sifting though buttons and tools.

I opted for a desktop wiki called Zim, which has been actively maintained since 2008. All of the pages in Zim are saved as individual text files in a basic wiki mark-up, which insures that they can be accessed outside the program. Pages are stored in a notebook which is essentially just a folder. If you create hierarchies of pages, these categories become folders and sub-folders. Zim has a ton of extra plugins so you can do spell checking, concept mapping, you can create formulas, calendars, and more. You can also export your entire notebook or portions of it out as HTML or LaTeX files.

Most importantly, the wiki solved one of my most vexing problems. I found that a lot of the note taking software was geared towards just taking notes, and couldn’t handle keeping track of citations. Citation software is it’s own genre, and I found that those packages were poor for taking notes. With Zim, I create a page dedicated to each source, and at the top of each page I embed some BibTeX code for storing the citation data. BibTeX is a format that’s used for creating LaTeX bibliographies, but it has become a common standard and can be used by word processors too. I have a template page (see below) with several BibTeX document types that I just copy and paste when I have a new source to add. Since the pages are saved as plain text, I wrote a short Python script (appears at the end of this post) that loops through my note pages, scrapes out the BibTeX records, and creates a BibTeX file that I can use in LaTeX. Within the BibTeX record I store a link to the source: either to a PDF I have locally, or a web page (if it’s a site), or a WorldCat catalog record (if it’s a book). So all my notes, citations, and the source material are kept together in one place!

Zim is desktop software that you have to download and install locally. Since the notebook consists of text files in folders, it’s easy to back it up into Box or DropBox or whatever you use. Zim doesn’t save histories or have version control, but there’s a plugin that lets you sync your files with Git and other systems.

## Relying on Tried and True Methods

While the right tool is important, it’s really the method that counts. I learned that I had to jettison the idea that the note taking process has to be 100% efficient. While you certainly don’t want to flail around and waste time, note taking is not supposed to be quick and easy. The only way you can truly learn new material is to spend time with it: reading, re-reading, taking notes, and reading the notes. The process of note taking is equally if not more important than the actual notes themselves, as the process is what helps you to synthesize and learn the material. While I left the binder and note cards behind, the actual note taking process was similar to what I did in high school.

I always download articles and bookmark websites or catalog records as I’m doing my searches. Once I complete a series of searches and have gathered material from the web and library databases, I sift through the files and rename them using the first author’s last name and the year of publication (i.e. Jones2017). I’ll also use this file name as the BibTeX key that uniquely identifies the article. I create a documents folder with sub-folders for articles, books, and reports, and I keep these folders in the same location as the ZIM notebook. There’s no reason to create lots of topical or thematic folders, as you can use the wiki to categorize and tag the notes, and the wiki becomes the vehicle for searching or browsing through the documents.

As I sort through the sources I identify what’s essential and what’s ancillary. High priority sources will be read thoroughly and covered in detail, while the low-priority stuff will be skimmed and summarized. High priority sources are critical to your research and include touchstone articles in your field, excellent case studies, relevant background material, and any past research that remotely resembles what you’re working on. Low priority sources may have one important fact or concept that you need to remember; these materials are more tangential to your work and ultimately you might cite them in passing, or even not at all.

I always print out the high-priority articles. I’ll read it first, and then I’ll go back and do a second read and mark passages with a high-lighter. Next, I’ll create and type notes directly into the wiki. I might read and mark up a couple articles before I start note taking, but I don’t wait too long as I want the articles fresh in my memory. For essential books, I’ll read a chapter or two at a time and mark passages with little sticky flags. Then I’ll go back and take notes in a paper notebook, and will keep doing that until I finish the book. Then I transcribe all the notes for the book onto my laptop. This takes longer, but once again it’s not all about efficiency. I get to spend more time with the material and it helps me absorb it. This approach also separates the computer from the reading, which cuts down distractions and provides more flexibility in terms of where I can work. Reading in a comfortable chair or outside is preferable to reading while sitting at a table with a laptop.

I never print out or mark up low-priority articles; I skim through the digital copies and write a summary directly in the wiki. For books, I read the book in one go and may use sticky flags here and there, and when I’m done I type the notes directly into the wiki.

For the notes themselves, each note page has the title and author prominently at the top followed by a summary of the source, and then the BibTeX citation (see below). Low-priority materials usually get nothing but a summary and a citation. High-priority materials get detailed notes. Each note is written as a bullet point, and can represent one important fact or insight, or can be a summary of a paragraph or several pages, or even a summary of a chapter. It depends on how important the material is relative to my work.

Taking notes is not like writing a book report. I’m not writing an even or objective summary of the material in it’s entirety. Instead, I’m picking out the pieces that are of interest to me and to the work I’ll be doing, and I skip the rest. Sometimes I’ll editorialize (this is great, or this stinks) but I write in such a way that my thoughts are distinct from what the author is saying. This is where efficiency comes into the picture: identify sources that are high versus low priority, and summarize the source and identify just the specific details that are relevant to you. You’re writing these notes with specific research goals in mind, so don’t waste time writing a generic book report.

I always summarize or paraphrase the material as I take notes, putting concepts in my own words. Doing this forces you to wrestle with the concepts and internalize them, which improves your understanding of the material and your memory for it. It also helps guard against plagiarism; once you start writing the paper, you’ll know your notes are already in your own words and you can use them freely. If I do quote something directly, I always surround it with quotation marks. Lastly, at the end of my note I provide the page numbers to indicate what’s been summarized, so I can go back if need be.

Note taking is an idiosyncratic process. What works for you may not work for someone else and vice versa. The key is to figure out what works best for you; create a system, try it out, and once you’re happy go with it. You can always tweak things as you move along. The notes will help you when it comes time to pull your ideas together into a cohesive paper, but it’s the reading and note taking process that helps you to become proficient with the subject matter.

As I was re-learning how to take notes, I found the handouts from the University of Melbourne’s Academic Skills Unit to be particularly valuable. This is their latest version of Taking Notes From Texts, and this is the older version that I stumbled on years ago.

(Python code for scraping BibTeX records out of wiki notes to create a bibliography is posted below).

```#Parse notes stored in zim wiki to extract all bibtex records and write them
#to a new bibtex file named with today's date.

#Script must be stored directly above the notes folder where the wiki data
#is stored. It will ignore the empty bibtex template files and will only
#read wiki files stored as .txt.

#Within the wiki, all bibtex records in the notes are enclosed in a bibtex tag.
#The script reads each line and ignores them until it finds the open
#tag. Then it starts writing each line until it reads the close  tag.
#A line return is appended so records are separated in the output file.

#A list and count of extracted records is provided as a diagnostic

import os, datetime

now=datetime.date.today()
path='.'
outfile='sources_'+str(now)+'.bib'

writefile=open(outfile,'w')

counter=0
titles=[]

for (subdir,dirs,files) in os.walk(path):
if 'Templates' in dirs:
dirs.remove('Templates')
if 'documents' in dirs:
dirs.remove('documents')
for f in files:
if f[-4:]=='.txt':
if line.startswith('&lt;bibtex&gt;'):
break
if line.startswith('&lt;/bibtex'):
titles.append(f)
writefile.write('\n')
counter=counter+1
break
else:
writefile.write(line)