spatial analysis

Raster Temperature Jan 1, 2020 Southern NE

Summarizing Raster Data for Areas and Assigning Values to Points

It’s been a busy few months, but I have a few days to catch my breath now that it’s spring break and most people (except me) have gone away! One question that’s come up quite a bit this semester is how to associate raster data with coinciding vector data. I’ll summarize some approaches in this post using ArcGIS Pro and QGIS, to summarize raster values for polygons (zonal statistics) and to assign raster values to points (aka raster sampling).

Zonal Statistics: Summarize Rasters by Area

Imagine that you have quantitative values such as temperature or a vegetation index in a raster grid, and you want to use this data to calculate an average for counties or metro areas. The goal is to have a new attribute column in the vector layer that contains the summarized raster value, perhaps because you want to make thematic maps of that value, or you want to use it in conjunction with other variables to run spatial statistics, or you just want a plain and simple summary for given places.

The term zonal statistics is used to define any operation that calculates statistics on cell values of a raster within an area or zone defined by another dataset, either a raster or a vector. The ArcGIS Pro toolbox has a Zonal Statistics tool where the output is a new raster file with cells that are summarized by the input zones. That’s not desirable for the use case I’m presenting here; the better choice is the Zonal Statistics as Table tool. The output is a table containing the unique identifiers of the raster and vector, the summary stats you’ve generated (average, sum, min, max, etc), and a count of the number of cells used to generate the summary. You can join this resulting table back to the vector file using their common unique identifier in a table join.

In the example below, I’m using counties from the census TIGER files for southern New England as my Input Feature Zone, the AFFGEOID (Census ANSI / FIPS code) to identify the Zone Field, and a temperature grid for January 1, 2020 from PRISM as the Input Value Raster. I’m calculating the mean temperature for the counties on that day.

ArcGIS Zonal Statistics as Table Tool
ArcGIS Pro Zonal Statistics as Table; Temperature Grid and Southern New England Counties

The output table consists of one record for each zone / county, with the count of the cells used to create the average, and the mean temperature (in degrees Celsius). This table can be joined back to the original vector feature (select the county feature in the Contents, right click, Joins and Relates – Join) to thematically map the average temp.

ArcGIS Zonal Statistics Result
ArcGIS Pro Zonal Statistics; Table Output and Join to Show Average Temperature per County

In QGIS, this tool is simply called Zonal Statistics; search for it in the Processing toolbox. The vector with the zones is the Input layer, and the Raster layer is the grid with the values. By default the summary stats are the count, sum, and mean, but you can check the Statistics to calculate box to select others. Unlike ArcGIS, QGIS allows you to write output as a table or a new shapefile / geopackage, which carries along the feature geometry from the Input zones and adds the summaries, allowing you to skip the step of having to do a table join (if you opted to create a table, you could join it to the zones using the Joins tab under the Properties menu for the vector features).

QGIS Zonal Stats
QGIS Zonal Statistics

Extract Raster Values for Point Features

Zonal stats allows you to summarize raster data within a polygon. But what if you had point features, and wanted to assign each point the value of the raster cell that it falls within? In ArcGIS Pro, search the toolbox for the Extract Values to Points tool. You select your input points and raster, and a new point feature that will include the raster values. The default is to take the value for the cell that the point falls within, but there is an Interpolate option that will calculate the value from adjacent cells. The output point feature contains a new column called RASTERVALU. I created some phony point data and used it to generate the output below.

ArcGIS Extract Values to Points
ArcGIS Pro Extract Values to Points (assign raster cell values to points)

In QGIS the name of this tool is Sample raster values, which you can find in the Processing toolbox. Input the points, choose a raster layer, and write the output to a new vector point file. Unlike ArcGIS, there isn’t an option for interpolation from surrounding cells; you simply get the value for the cell that the point falls within. If you needed to interpolate, you can go to the Plugins menu, enable the SAGA plugin, and in the Processing toolbox try the SAGA tool Raster Values to Points instead.

QGIS Sample Raster Values
QGIS Sample Raster Values (assign raster cell values to points)

A variation on this theme would be to create and assign an average value around each point at a given distance, such as the average temperature within five miles. One way to achieve this would be to use the buffer tools in either ArcGIS or QGIS to create distinct buffers around each point at the specified distance. The buffer will automatically carry over all the attributes from the point features, including unique identifiers. Then you can run the zonal statistics tools against the buffer polygons and raster to compute the average, and if need be do a table join between the output table and the original point layer using their common identifier.

Wrap-up

In using any of these tools, it’s important to consider the resolution of the raster (i.e. the size of the grid cell):

1. Relative to the size of the zonal areas or number of points, and

2. In relation to the phenomena that you’re studying.

When larger grid cells or zonal areas are used for measurement, any phenomena becomes more generalized, and any variations within these large areas become masked. The temperature grid cells in this example had a resolution of 2.5 miles, which was suitable for creating county summaries. Summarizing data for census tracts at this resolution would be less ideal, as the tracts are much smaller than the cells, with the cell value characterizing a much larger area. This might be okay in the case of temperature, which tends not to vary considerably over a distance of a few miles. In contrast, averaging temperature data for states is not worthwhile, as states vary considerably in size and most are large enough that they contain multiple ecosystems and elevation levels.

The solutions I’ve described here are the desktop GIS solutions. You could also use either spatial SQL in a geodatabase or a spatial extension in a scripting language like Python or R to perform similar operations. In both cases a basic overlay and intersection statement is used, in conjunction with some grouping function for calculating summaries. I’ve been doing a lot more spatial Python work with geopandas these past few months – perhaps a topic for a subsequent post…

Noise Complaint Kernels and Contours

Kernel Density and Contours in QGIS: Noisy NYC

In spatial analysis, kernel density estimation (colloquially referred to as a type of “hot spot analysis”) is used to explore the intensity or clustering of point-based events. Crimes, parking tickets, traffic accidents, bird sightings, forest fires, incidents of infections disease, anything that you can plot as a point at a specific period in time can be studied using KDE. Instead of looking at these features as a distribution of discrete points, you generate a raster that represents a continuous surface of values. You can either measure the density of the incidents themselves, or the concentration of a specific attribute that is tied to those incidents (like the dollar amount of parking tickets or the number of injuries in traffic accidents).

In this post I’ll demonstrate how to do a KDE analysis in QGIS, but you can easily implement KDE in other software like ArcGIS Pro or R. Understanding the inputs you have to provide to produce a meaningful result is more important than the specific tool. This YouTube video produced by the SEER Lab at the University of Florida helped me understand what these inputs are. They used the SAGA kernel tool within QGIS, but I’ll discuss the regular QGIS tool and will cover some basic data preparation steps when working with coordinate data. The video illustrates a KDE based on a weight, where there were single points that had a count-based attribute they wanted to interpolate (number of flies in a trap). In this post I’ll cover simple density based on the number of incidents (individual noise complaints), and will conclude by demonstrating how to generate contour lines from the KDE raster.

For a summary of how KDE works, take a look at the entry for “Kernel” in the Encyclopedia of Geographic Information Science (2007) p 247-248. For a fuller treatment, I always recommend Christopher Lloyd’s Spatial Data Analysis: An Introduction to GIS Users (2010) p 93-97 by Oxford Press. There’s also an explanation in the ArcGIS Pro documentation.

Data Preparation

I visited the NYC Open Data page and pulled up the entry for 311 Service Requests. When previewing the data I used the filter option to narrow the records down to a small subset; I chose complaints that were created between June 1st and 30th 2022, where the complaint type began with “Noise”, which gave me about 75,000 records (it’s a noisy town). Then I hit the Export button and chose one of the CSV formats. CSV is a common export option from open data portals; as long as you have columns that contain latitude and longitude coordinates, you will be able to plot the records. The NYC portal allows you to filter up front; other data portals like the ones in Philly and DC package data into sets of CSV files for each year, so if you wanted to apply filters you’d use the GIS or stats package to do that post-download. If shapefiles or geoJSON are provided, that will save you the step of having to plot coordinates from a CSV.

NYC Open Data 311 Service Requests

With the CSV, I launched QGIS, went to the Data Source Manager, and selected Delimited Text. Browsed for the file I downloaded, gave the layer a common sense name, and under geometry specified Point coordinates, and confirmed that the X field was my longitude column and the Y field was latitude. Ran the tool, and the points were plotted in the basic WGS 84 longitude / latitude system in degrees, which is the system the coordinates in the data file were in (generally a safe bet for modern coordinate data, but not always the case).

QGIS Add Delimited Text and Plot Coordinates

The next step was to save these plotted points in a file format that stores geometry and allows us to do spatial analysis. In doing that step, I recommend taking two additional ones. First, verify that all of the plotted data have coordinates – if there are any records where lat and long are missing, those records will be carried along into the spatial file but there will be no geometry for them, which will cause problems. I used the Select Features by Expression tool, and in the expression window typed “Latitude” is not null to select all the features that have coordinates.

QGIS Select by Expression

Second, transform the coordinate reference system (CRS) of the layer to a projected system that uses meters or feet. When we run the kernel tool, it will ask us to specify a radius for defining the density, as well as the size of the pixels for the output raster. Using degrees doesn’t make sense, as it’s hard for us to conceptualize distances in degrees, and they are not a constant unit of measurement. If you’ve googled around and read Stack Exchange posts or watched videos where a person says “You just have to experiment and adjust these numbers until your map looks Ok”, they were working with units in fractions of degrees. This is not smart. Transform the system of your layers!

I selected the layer, right clicked, Export, Save Selected Features As. The default output is a geopackage, which is fine. Otherwise you could select ESRI shapefile, both are vector formats that store geometry. For file name I browse … and save the file in a specific folder. Beside CRS I hit the globe button, and in the CRS Selector window typed NAD83 Long Island in the filter at the top, and at the bottom I selected the NAD83 / New York Long Island (ftUS) EPSG 2263 option system in the list. Every state in the US has one or more state plane zones that you can select for making optimal maps for that area, in feet or meters. Throughout the world, you could choose an appropriate UTM zone that covers your area in meters. For countries or continents, look for an equidistant projection (meters again).

QGIS Export – Save As

Clicked a series of Oks to create the new file. To reset my map window to match CRS of the new file, I selected that file, right clicked, Layer CRS, Set Project CRS from Layer. Removed my original CSV to avoid confusion, and saved my project.

QGIS Noise Complaints in Projected CRS

Kernel Density Estimation

Now our data is ready. Under the Processing menu I opened the toolbox and searched for kernel to find Heatmap (Kernel Density Estimation) under the Interpolation tools. The tool asks for an input point layer, and then a radius. The radius is used to define an area for calculating a local density estimate around each point. We can use a formula to determine an ideal radius; the hopt method seems to be commonly employed for this purpose.

To use the hopt formula, we need to know the standard distance for our layer, which measures the degree to which features are dispersed around the spatial mean or center of the distribution. A nice 3rd party plugin was created for calculating this. I went to the the plugins menu, searched for the Standard Distance plugin, and added it. Searched for it in the Processing toolbox and launched it. I provided my point layer for input, and specified an output file. The other fields are optional (if we were measuring an attribute of the points instead of the density of the points, we could specify the attribute as a weight column). The output layer consists of a circle where the center is the mean center of the distribution, and the circle represents the standard deviation. The attribute table contains one record, with the standard distance attribute of 36,046.18 feet (if no feature was created, the likely problem is you have records in the point file that don’t have geometry – delete them and try again).

Output from the Standard Distance Plugin

Knowing this, I used the hopt formula:

=((2/(3N))^0.25)SD

Where N is the number of features and SD is the standard distance. I used Excel to plug in these values and do the calculation.

((2/(374526))^0.25)36046.18 = 1971.33

Finally, I launched the heatmap kernel tool, specified my noise points as input, and the radius as 1,971 feet. The output raster size does take some experimentation. The larger the pixel size, the coarser or more general the resolution will be. You want to choose something that makes sense based on the size of the area, the number of points, and / or some other contextual information. Just like the radius, the units are based on the map units of your layer. If I type in 100 feet for Pixel X, I see I’ll have a raster with 1,545 rows and 1,565 columns. Change it to 200 feet, and I get 773 by 783. I’ll go with 200 feet (the distance between a “standard” numbered street block in midtown Manhattan). I kept the defaults for the other options.

QGIS Heatmap Kernel Density Estimation Window

The resulting raster was initially displayed in black and white. I opened the properties and symbology menu and changed the render type from Singleband gray to Singleband pseudocolor, and kept the default yellow to red scheme. Voila!

Kernel Density Estimate of NYC Noise Complaints June 2022

In June 2022 there were high clusters of noise complaints in north central Brooklyn, northern Manhattan, and the southwest portion of the Bronx. There’s a giant red hot spot in the north central Bronx that looks like the storm on planet Jupiter. What on earth is going on there? I flipped back to the noise point layer and selected points in that area, and discovered a single address where over 2,700 noise complaints about a loud party were filed on June 18 and 19! There’s also an address on the adjacent block that registered over 900 complaints. And yet the records do not appear to be duplicates, as they have different time stamps and closing dates. A mistake in coding this address, multiple times? A vengeful person spamming the 311 system? Or just one helluva loud party? It’s hard to say, but beware of garbage in, garbage out. Beyond this demo, I would spend more time investigating, would try omitting these complaints as outliers and run the heatmap tool again, and compare this output to different months. It’s also worth experimenting with the color classification scheme, and some different pixel sizes.

Kernel Results Zoomed In

Contour Lines

Another interesting way to visualize this data would be to generate contour lines based on the kernel output. I did a search for contour in the processing toolbox, and in the contour tool I provided the kernel noise raster as the input. For intervals between contour lines I tried 20 feet, and changed the attribute name to reflect what the contour represents: COMPLAINT instead of ELEV. Generated the new file, overlaid on top of the kernel, and now you can see how it represents the “elevation” of complaints.

Noise Complaint Kernel Density with Contour Lines

Switch the kernel off, symbolize the contours and add some labels, and throw the OpenStreetMap underneath, and now you can explore New York’s hills and valleys of noise. Or more precisely, the hills and valleys of noise complainers! In looking at these contours, it’s important to remember that they’re generated from the kernel raster’s grid cells and not from the original point layer. The raster is a generalization of the point layer, so it’s possible that if you look within the center of some of the denser circles you may not find, say, 340 or 420 actual point complaints. To generate a more precise set of contours, you would need to decrease the pixel size in the kernel tool (from say 200 feet to 100).

Noise Complaint Contours in Lower Manhattan, Northwest Brooklyn, and Long Island City

It’s interesting what you can create with just one set of points as input. Happy mapping!

Census Tracts

Call for Proposals: Celebrating the Census in the Journal of Maps

I’m serving as a co-editor for a special issue for the Journal of Maps entitled “Celebrating the Census“. The Journal of Maps is an open access, peer reviewed journal published by the Taylor & Francis Group. The journal is distinct in that all articles feature maps and spatial diagrams as the focal point for studying geographic phenomena from both a physical / environmental and social science perspective.

Here’s the official synopsis for this census-themed special issue:

We invite contributions to a special issue of the Journal of Maps focused upon the evolving character and cartographic opportunities offered by traditional census statistics and the impact of transitioning from these sources of population data at a range of spatial scales into a new era of big data assembly. In so doing, the special issue marks two important events taking place in the UK during 2021 in the history of British Censuses and seeks contributions that reflect the past transition of population data cartography through the digital era of the last 50 years and anticipates its transformation into the big data era of the foreseeable future.

While the issue marks the 100th anniversary of the UK census, submissions concerning census mapping from around the world are welcome and encouraged in these topic areas, including but not limited to:

  • Spatial and statistical consistency over time
  • People on the move
  • Mapping people through space and time
  • Mapping morbidity and mortality
  • Politics and population data
  • International comparison of demographic mapping
  • Before and after population mapping using censuses and administrative sources
  • Population data and mapping human-environmental interaction
  • Transition and evolution in population mapping

Visit the special issue announcement for full details. Deadlines:

  • April 30, 2021: a short draft (500-word limit) outlining themes and scope of the paper, preferably with a sample map
  • June 14, 2021: abstracts will be selected by the editorial team by this date
  • Sept 5, 2021: completed paper (4000-word limit) is due

The issue will be published sometime in 2022.

NYC and NYMA Pop Change Graph 2000 to 2019

New York’s Population and Migration Trends in the 2010s

The Weissman Center for International Business at Baruch College just published my paper, “New York’s Population and Migration Trends in the 2010s“, as part of their Occasional Paper Series. In the paper I study population trends over the last ten years for both New York City (NYC) and the greater New York Metropolitan Area (NYMA) using annual population estimates from the Census Bureau (vintage 2019), county to county migration data (2011-2018) from the IRS SOI, and the American Community Survey (2014-2018). I compare NYC to the nine counties that are home to the largest cities in the US (cities with population greater than 1 million) and the NYMA to the 13 largest metro areas (population over 4 million) to provide some context. I conclude with a brief discussion of the potential impact of COVID-19 on both the 2020 census count and future population growth. Most of the analysis was conducted using Python and Pandas in Jupyter Notebooks available on my GitHub. I discussed my method for creating rank change grids, which appear in the paper’s appendix and illustrate how the sources and destinations for migrants change each year, in my previous post.

Terminology

  • Natural increase: the difference between births and deaths
  • Domestic migration: moves between two points within the United States
  • Foreign migration: moves between the United States and a US territory or foreign country
  • Net migration: the difference between in-migration and out-migration (measured separately for domestic and foreign)
  • NYC: the five counties / boroughs that comprise New York City
  • NYMA: the New York Metropolitan Area as defined by the Office of Management and Budget in Sept 2018, consists of 10 counties in NY State (including the 5 NYC counties), 12 in New Jersey, and one in Pennsylvania
Map of the New York Metropolitan Area
The New York-Newark-Jersey City, NY-NJ-PA Metropolitan Area

Highlights

  • Population growth in both NYC and the NYMA was driven by positive net foreign migration and natural increase, which offset negative net domestic migration.
  • Population growth for both NYC and the NYMA was strong over the first half of the decade, but population growth slowed as domestic out-migration increased from 2011 to 2017.
  • NYC and the NYMA began experiencing population loss from 2017 forward, as both foreign migration and natural increase began to decelerate. Declines in foreign migration are part of a national trend; between 2016 and 2019 net foreign migration for the US fell by 43% (from 1.05 million to 595 thousand).
  • The city and metro’s experience fit within national trends. Most of the top counties in the US that are home to the largest cities and many of the largest metropolitan areas experienced slower population growth over the decade. In addition to NYC, three counties: Cook (Chicago), Los Angeles, and Santa Clara (San Jose) experienced actual population loss towards the decade’s end. The New York, Los Angeles, and Chicago metro areas also had declining populations by the latter half of the decade.
  • Most of NYC’s domestic out-migrants moved to suburban counties within the NYMA (representing 38% of outflows and 44% of net out-migration), and to Los Angeles County, Philadelphia County, and counties in Florida. Out-migrants from the NYMA moved to other large metros across the country, as well as smaller, neighboring metros like Poughkeepsie NY, Fairfield CT, and Trenton NJ. Metro Miami and Philadelphia were the largest sources of both in-migrants and out-migrants.
  • NYC and the NYMA lack any significant relationships with other counties and metro areas where they are net receivers of domestic migrants, receiving more migrants from those places than they send to those places.
  • NYC and the NYMA are similar to the cities and metros of Los Angeles and Chicago, in that they rely on high levels foreign migration and natural increase to offset high levels of negative domestic migration, and have few substantive relationships where they are net receivers of domestic migrants. Academic research suggests that the absolute largest cities and metros behave this way; attracting both low and high skilled foreign migrants while redistributing middle and working class domestic migrants to suburban areas and smaller metros. This pattern of positive foreign migration offsetting negative domestic migration has characterized population trends in NYC for many decades.
  • During the 2010s, most of the City and Metro’s foreign migrants came from Latin America and Asia. Compared to the US as a whole, NYC and the NYMA have slightly higher levels of Latin American and European migrants and slightly lower levels of Asian and African migrants.
  • Given the Census Bureau’s usual residency concept and the overlap in the onset the of COVID-19 pandemic lock down with the 2020 Census, in theory the pandemic should not alter how most New Yorkers identify their usual residence as of April 1, 2020. In practice, the pandemic has been highly disruptive to the census-taking process, which raises the risk of an under count.
  • The impact of COVID-19 on future domestic migration is difficult to gauge. Many of the pandemic destinations cited in recent cell phone (NYT and WSJ) and mail forwarding (NYT) studies mirror the destinations that New Yorkers have moved to between 2011 and 2018. Foreign migration will undoubtedly decline in the immediate future given pandemic disruptions, border closures, and restrictive immigration policies. The number of COVID-19 deaths will certainly push down natural increase for 2020.

LISA map of Broad Band Subscription by Household

Mapping US Census Data on Internet Access

ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

Figure 1. Percentage of Households With Subscription to Any Broadband Service: 2013-2017[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above $50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below $50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA map of Broad Band Subscription by Household

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.

Washington DC street

Using the ACS to Calculate Daytime Population

I’m in the home stretch for getting the last chapter of the first draft of my census book completed. The next to last chapter of the book provides an overview of a number of derivatives that you can create from census data, and one of them is the daytime population.

There are countless examples of using census data for site selection analysis and for comparing and ranking places for locating new businesses, providing new public services, and generally measuring potential activity or population in a given area. People tend to forget that census data measures people where they live. If you were trying to measure service or business potential for residents, the census is a good source.

Counts of residents are less meaningful if you wanted to gauge how crowded or busy a place was during the day. The population of an area changes during the day as people leave their homes to go to work or school, or go shopping or participate in social activities. Given the sharp divisions in the US between residential, commercial, and industrial uses created by zoning, residential areas empty out during the weekdays as people travel into the other two zones, and then fill up again at night when people return. Some places function as job centers while others serve as bedroom communities, while other places are a mixture of the two.

The Census Bureau provides recommendations for calculating daytime population using a few tables from the American Community Survey (ACS). These tables capture where workers live and work, which is the largest component of the daytime population.

Using these tables from the ACS:

Total resident population
B01003: Total Population
Total workers living in area and Workers who lived and worked in same area
B08007: Sex of Workers by Place of Work–State and County Level (‘Total:’ line and ‘Worked in county of residence’ line)
B08008: Sex of Workers by Place of Work–Place Level (‘Total:’ line and ‘Worked in place of residence’ line)
B08009: Sex of Workers by Place of Work–Minor Civil Division Level (‘Total:’ line and ‘Worked in MCD of residence’ line)
Total workers working in area
B08604: Total Workers for Workplace Geography

They propose two different approaches that lead to the same outcome. The simplest approach: add the total resident population to the total number of workers who work in the area, and then subtract the total resident workforce (workers who live in the area but may work inside or outside the area):

Daytime Population = Total Residents + Total Workers in Area - Total Resident Workers

For example, according to the 2017 ACS Washington DC had an estimated 693,972 residents (from table B01003), 844,345 (+/- 11,107) people who worked in the city (table B08604), and 375,380 (+/- 6,102) workers who lived in the city. We add the total residents and total workers, and subtract the total workers who live in the city. The subtraction allows us to avoid double counting the residents who work in the city (as they are already included in the total resident population) while omitting the residents who work outside the city (who are included in the total resident workers). The result:

693,972 + 844,345 - 375,380 = 1,162,937

And to get the new margin of error:

SQRT(0^2 + 11,107^2 + 6,102^2) = 12,673

So the daytime population of DC is approx 468,965 people (68%) higher than its resident population. The district has a high number of jobs in the government, non-profit, and education sectors, but has a limited amount of expensive real estate where people can live. In contrast, I did the calculation for Philadelphia and its daytime population is only 7% higher than its resident population. Philadelphia has a much higher proportion of resident workers relative to total workers. Geographically the city is larger than DC and has more affordable real estate, and faces stiffer suburban competition for private sector jobs.

The variables in the tables mentioned above are also cross-tabulated in other tables by age, sex, race, Hispanic origin , citizenship status, language, poverty, and tenure, so it’s possible to estimate some characteristics of the daytime population. Margins of error will limit the usefulness of estimates for small population groups, and overall the 5-year period estimates are a better choice for all but the largest areas. Data for workers living in an area who lived and worked in the same area is reported for states, counties, places (incorporated cities and towns), and municipal civil divisions (MCDs) for the states that have them.

Data for the total resident workforce is available for other, smaller geographies but is reported for those larger places, i.e. we know how many people in a census tract live and work in their county or place of residence, but not how many live and work in their tract of residence. In contrast, data on the number of workers from B08604 is not available for smaller geographies, which limits the application of this method to larger areas.

Download or explore these ACS tables from your favorite source: the American Factfinder, the Census Reporter, or the Missouri Census Data Center.