Map of Ukraine Wheat Exports in 2021

Import and Export Data for Countries: Grain from Ukraine

I’ve been receiving more questions about geospatial data sources as the semester draws to a close. I’ll describe some sources that I haven’t used extensively before in the next couple of posts, beginning with data on bilateral trade: imports and exports between countries. We’ll look at the IMF’s Direction of Trade Statistics (DOTS) and the UN’s COMTRADE database. Both sources provide web-based portals, APIs, and bulk downloading. I’ll focus on the portals.

IMF Direction of Trade Statistics

IMF DOTS provides monthly, quarterly, and annual import and export data, represented as total dollar values for all goods exchanged. The annual data goes back to 1947, while the monthly / quarterly data goes back to 1960. All countries that are part of the IMF are included, plus a few others. Data for exports are published on a Free and On Board (FOB) price basis, while imports are published on a Cost, Insurance, Freight (CIF) price basis. Here are definitions for each term, quoted directly from the OECD’s Statistical Glossary:

The f.o.b. price (free on board price) of exports and imports of goods is the market value of the goods at the point of uniform valuation, (the customs frontier of the economy from which they are exported). It is equal to the c.i.f. price less the costs of transportation and insurance charges, between the customs frontier of the exporting (importing) country and that of the importing (exporting) country.

The c.i.f. price (i.e. cost, insurance and freight price) is the price of a good delivered at the frontier of the importing country, including any insurance and freight charges incurred to that point, or the price of a service delivered to a resident, before the payment of any import duties or other taxes on imports or trade and transport margins within the country.

OECD Glossary of Statistical Terms

There are a few different ways to browse and search for data. Start with the Data Tables tab at the top, and Exports and Imports by Areas and Countries. The default table displays monthly exports by region and country for the entire world (you could switch to imports by selecting the Imports CIF tab beside the Export sFOB tab). Hitting the Calendar dropdown allows you to change the date range and frequency. Hitting the Country dropdown lets you select a specific region or country. In the example below, I’ve changed the calendar from months to years, and the country to Ukraine. By doing so, the table now depicts the total US dollar value of exports and imports between Ukraine and all other countries. The Export button at the top allows you to save the report in a number of formats, Excel being the most data friendly option.

IMF DOTS Basic Report – Total Value of Exports from Ukraine, Last Five Years

While this is the quickest option, it comes with some downsides; the biggest one is that there are no unique identifiers for the countries, which is important if you wanted to join this table to a GIS vector file for mapping, or another country-level table in a database.

A better approach is to return to the home page and use the Query tab, which allows you to get a unique identifier and filter out countries and regions that are not of interest.

DOTS Query Tab
  1. Under Columns, select the time frame and interval. For example, check Years for Frequency at the top, and change the dropdowns at the bottom from Months to Years. From -5 to 0 would give you the last five years in ascending order.
  2. Rows allows you to filter out countries or regions that you don’t want to see in the results. You can also change the attribute that is displayed. Once the menu is open, right click in an empty area and choose Attribute. Here you can choose a variant country name, or an ISO country code. ISO codes are commonly used for uniquely identifying countries.
  3. Indicator lets you choose Exports (FOB), Imports (CIF or FOB), or Trade Balance, all in US dollars.
  4. Counterpart country is the country or region that you want to show trade for, such as Ukraine in our previous example.
  5. The tabs along the top allow you to produce graphs instead of a table (View – Table), to pivot the table (Adjust), and calculate summaries like sums or averages (Advanced).
  6. Export to produce an Excel file. By choosing the ISO codes you’ll lose the country names, but you can join the result to another country data table or shapefile and grab the names from there.
Modify Time
Modify Rows – Country – Change Attribute
DOTS Modified Table to Export: Total Value of Exports from Ukraine Last Five Years

UN COMTRADE

If you want data on the exchange of specific goods and services, quantities in addition to dollar values, and exchanges beyond simple imports and exports, then the UN’s COMTRADE database will be your source. You need to register to download data, but you can generate previews without having to log in. There is an extensive wiki that describes how to use the different database tools, and summaries of technical terms that you need to know for extracting and interpreting the data. You’ll need some understanding of the different systems for classifying commodities and goods. Your options (the links that follow lead to documentation and code lists) are: the Harmonized Classification System (HC), the Standard Industrial Trade Classification (SITC), and the Broad Economic Categories (BEC). What’s the difference? Here are some summaries, quoted directly from a UN report on the BEC:

The HS classification is maintained by the World Customs Organization. Its main purpose is to classify goods crossing the border for import tariffs or for application of some non-tariff measures for safety or health reasons. The HS classification is revised on a five-year cycle (p. 18)

The original SITC was designed in the 1950s as a tool for collection and dissemination of international merchandise trade statistics that would help in establishing internationally comparable trade statistics. By its introduction in 1988, the HS took over as collection and dissemination tool, and SITC was thereon used mostly as an analytical tool. (p. 19)

The Classification by Broad Economic Categories (BEC) is an international product classification. Its main purpose is to provide a set of broad product categories for the analysis of trade statistics. Since its adoption in 1971, statistical offices around the world have used BEC to report trade statistics in a concise and meaningful way (p. iii). The broad economic categories of BEC include all subheadings of the HS classification. Therefore, the total trade in terms of HS equals the total trade of the goods side of BEC. (p. 18)

Classification of Broad Economic Categories Rev 5 (2018)

In short, go with the BEC if you’re interested in high-level groupings, or the HS if you need detailed subdivisions. The SITC would be useful if you need to go further back in time, or if it facilitates looking at certain subdivisions or groupings that the other systems don’t capture.

From COMTRADE’s homepage, I suggest leaving the defaults in place and doing a basic, preliminary search for all global exports for the most recent year, so you can see basic output on the next screen. Then you can apply filters for a narrower search.

For example, let’s look at annual exports of wheat from Ukraine to other countries. Under the HS filter, remove the TOTAL code. Start typing wheat, and you’ll see various product categories: 6-digit codes are the most specific, while 4-digit codes are broader groups that encapsulate the 6-digit categories. We’ll choose wheat and meslin 1001. We’ll select Ukraine as the Reporter (the country that supplied the statistics and represents the origin point), and for the 1st partner we’ll choose All to get a list of all countries that Ukraine exported wheat to. The 2nd partner country we’ll leave as World (alternatively, you would add specific countries here if you wanted to know if there were intermediary nations between the origin and destination).

UN COMTARDE Refine Search with Filters

Hit Preview to see the results. You can click on a heading to sort by dollar value, weight, or country name. Like IMF DOTS, UN COMTRADE measures dollar amounts of exports as FOB and imports as CIF. At this point, you would need to log in to download the data as a CSV (creating an account is free). You would also need to be logged in if you generated an extract that has more than 500 records, otherwise the results will be truncated. You could always copy and paste data for shorter extracts directly from the screen to a spreadsheet, but you wouldn’t get any of the extra metadata fields that come with download, like ISO Country Codes and the classification codes for goods and merchandise.

COMTRADE Filtered Results – Exports of Wheat and Meslin from Ukraine 2021
Data Exported from COMTRADE to CSV with Identifiers

Mapping

For data from either source, if you wanted to map it you’d need to have a data table where there is one row for each country with columns of attributes, and with one column that has the ISO country code to serve as a unique identifier. Save the data table in an Excel file or as a table in a database. Download a country shapefile from Natural Earth. Add the shapefile and data table to a project and join them using the ISO code. Natural Earth shapefiles have several different ISO code columns that represent nations, sovereigns, and parent – child relationships; be sure you select the right one. Data table records that represent regions or groupings of countries (i.e. the EU, ASEAN, sum of smaller countries per continent not enumerated, etc.) will fall out of the dataset, as they won’t have a matching feature in the country shapefile. The map at the top of this post was created in QGIS, using COMTRADE and Natural Earth.

Topo Claymont

Digital USGS Historic Topographic and Scientific Investigation Maps

This semester we launched a project to inventory our USGS topographic map collection. Our holdings include tens of thousands (probably over a 100,000) of these maps that depict the nation’s physical terrain and built environment in great detail. One of my former students wrote a Python program using the tkinter module to create a GUI, which we’re using to filter a list of published maps in a SQLite database to match ones that we have in hand. Here’s a short guide that documents our process.

The list we’re using as our base table is what powers USGS topoView, which allows you to browse and download over 200,000 historic topos (1880 to 2006) that have been digitized and georferenced. The application also includes maps produced from 2009 forward that are part of the newer US Topo project; these maps are created on an on-going basis by pulling together a number of existing government data sources (unlike the historic maps, which were created by manual field surveys and updated over time using aerial photographs and satellite imagery).

You can search topoView using the name of a location or quadrangle (the grid cell that represents the area of each map, named after the most prominent feature in that area) to find all available maps for that location. There’s a set of filters that allows you to focus on the Historic Topographic Map Collection (HTMC) versus the US Topo Collection (2009 to present), or a specific scale. Choose a scale and zoom in, and you’ll see the grid cells for that series so you can identify map coverage. The 24k scale is the most familiar series; as the largest scale / smallest area maps that the USGS produced, it provides the most detail and covers every state and US territory. Each map covers an area of 7.5 x 7.5 minutes (think of a degree as 60 mins) and an inch on these maps represents 2,000 feet. This scale was introduced in the late 1940s, and replaced both the 63k scale map (a 15 x 15 minute map where 1 inch = 1 mile) that was the previous standard, and the less common 48k scale.

USGS topoView application

There are also smaller scale maps, which cover larger areas. The 100k series was introduced in the mid 1970s and covers the lower 48 states and Hawaii. Each map covers an area of 30 x 60 minutes and uses metric units (1 inch = 1.6 miles). The 250k series was introduced in the 1940s by the US Army Map Service and was eventually taken over by the USGS. These maps include all 50 states, cover an area of 1 x 2 degrees, and use imperial units (1 inch = 4 miles). There are about 1,800 quads for the 100k series and only 900 or so for the 250k, versus over 60,000 for the 24k series.

Once you search for an area or click on a quad, you’ll see all the maps available in that area over time. Applying the scale filter shows you just maps at that scale, plus some similar but odd scale maps that are not numerous enough to get their own filter. The predominate year listed for each record is the “map year”, which is when field work was done to either create the map or substantively update it. There’s also an edition or “print year” that indicates when the map was printed. If you look at the metadata (use the info button) or preview the map, there may be an edit or photo revision year, indicating if the map was updated back at headquarters using air photos or imagery. The image below illustrates where you can find this information on a standard 24k scale map.

Collar of USGS 24k Topo Map
1: Map Scale 2: Quad Name 3: Map Year and Revision Year 4: Print Year

Clicking on the thumbnail of the map in the results gives you a quick full screen preview. There are several download options, including a JPEG if you want a small compressed image, or a GeoTiff if you want a lossless format with the best resolution, and if you want to use it in GIS software as a raster layer.

The changes you can see over time on these maps can be striking, illustrating the suburban sprawl of the 20th century. Consider the snippets from a 24k map of the Orlando West, Florida quadrangle below.

Orland West 1957
Orlando West 1956
Orlando West 1980
Orlando West 1980

While many people are familiar with the topographic series, the USGS also publishes a number of other map and report series that cover topics like hydrography, oil and gas exploration, mining, land use and land cover, and special scientific investigations. They have digitized (but not georeferenced) many of these maps, from the 1950s to present. You can browse through a list of all these publications, or you can search across them in the Publications Warehouse. If you search, try the Advanced Search and specify publication type and subtype as filters. Most of the maps are classified as publication type: Report, and subtype: USGS Numbered Series.

For example, the IMAP series includes special investigation maps that cover tectonic, geologic, mineral, topographic, and bathymetric maps of specific small or regional areas in the US. They also include maps of Antarctica, special investigations in other countries, the moon, and other planets and moons. Every report / map has a landing page with a permanent URL and doi that uses the series number of the map, with links to a PDF of the map as well as a Dublin Core metadata record. For example, here’s a Geologic Map of Io from 1992, part of the IMAP series.

Portion of a Geologic Map of the Jovian Moon Io

This is great, as you can use these records and metadata for building other interactive finding aids, and can link directly to individual maps. The USGS has created different portals for accessing subsets of these materials, such as this special topics page for identifying different planetary maps in the SIM and IMAP series.

Some other gems I’ve discovered stashed away in the publications warehouse: a poster of map projections (with a flip side portrait of Gerardus Merctor) which should be familiar to most 1990s university geography students; it was often hung in classrooms and provided as an insert in cartography textbooks. Also, a digitized copy of the book Maps for America. Originally published for the USGS centenary in 1979, this book provides a comprehensive history and overview of the topographic map series. The scanned copy is the 3rd edition, printed in 1987. If you suddenly find yourself in the position of having to curate a hundred thousand 20th century topo maps, there is no better guide than this book.



IPUMS CPS Table

Creating Geographic Estimates with the IPUMS CPS Online Data Analysis System

Introduction

In this post I’ll demonstrate how to use the IPUMS CPS Online Data Analysis System to generate summary data from the US Census Bureau’s Current Population Survey (CPS). The tool employs the Survey Documentation and Analysis system (SDA) created at UC Berkeley.

The CPS is a monthly stratified sample survey of 60k households. It includes a wide array of statistics, some captured routinely each month, others at various intervals (such as voter registration and participation, captured every November in even-numbered years). The same households are interviewed over a four-month period, then rotated out for four months, then rotated back in for a final four months. Given its consistency, breadth, high response rate and accuracy (interviews are conducted in-person and over the phone), researchers use the CPS microdata (individual responses to surveys that have been de-identified) to study demographic and socio-economic trends among and between different population groups. It captures many of the same variables as the American Community Survey, but includes a fair number that are not.

I think the CPS is used less often by geographers, as the sample size is too small to produce reliable estimates below the state or metropolitan area levels. I find that students and researchers who are only familiar with working with summary data often don’t use it; generating your own estimates from microdata records can be time consuming.

The IPUMS project provides an online analyzer that lets you generate summary estimates from the CPS without having to handle the individual sample records. I’ve used it in undergraduate courses where students want to generate extracts of data at the regional or state level, and who are interested in variables not collected in the ACS, such as generational households for immigrants. The online analyzer doesn’t include the full CPS, but only the data that’s collected in March as part of the core CPS series, and the Annual Social & Economic Supplement (ASEC). It includes data from 1962 to the present.

To access any of the IPUMS tools, you must register and create an account, but it’s free and non-commercial. They provide an ample amount of documentation; I’ll give you the highlights for generating a basic geographic-based extract in this post.

Creating a Basic Geographic Summary Table

Once you launch the tool, the first thing you need to do is select some variables. You can use the drill-down folder menus at the bottom left, but I find it’s easier to hit the Codebook button and peruse the alphabetical list. Let’s say we want to generate state-level estimates for nativity for a recent year. If we go into the codebook and look-up nativity, we see it captures foreign birthplace or parentage. Also in the list is a variable called statefip, which are the two-digit codes that uniquely identify every state.

Codebook for Nativity – Foreign Birthplace or Parentage

Back on the main page for the Analyzer in the tables tab, we provide the following inputs:

  1. Row represents our records or observations. We want a record for every state, so we enter the variable: statefip.
  2. Column represents our attributes or variables. In this example, it’s: nativity.
  3. Selection filter is used to specify that we want to generate estimates from a subset of all the responses. To generate estimates for the most recent year, we enter year as the variable and specify the filter value in parentheses: year(2020). If we didn’t specify a year, the program would use all the responses back to 1962 to generate the estimates.
  4. Weight is the value that’s used to weight the samples to generate the estimates. The supplemental person weight sdawt is what we’ll use, as nativity is measured for individual persons. There is a separate weight for household-level variables.
  5. Under the output option dropdown, we change the Percentaging option from column to row. Row calculates the percentage of the population in each nativity category within the state (row). The column option would provide the percentage of the population in each nativity category between the states. In this example, the row option makes more sense.
  6. For the confidence interval, check the box to display it, as this will help us gauge the precision of the estimate. The default level is 95%; I often change this to 90% as that’s what the American Community Survey uses.
  7. At the bottom of the screen, run the table to see the result.
CPS Online Analyzer – Generate Basic Extract for Nativity by State for a Single Year

In the result, the summary of your parameters appears at the top, with the table underneath. At the top of the table, the Cells contain legend lists what appears in each of the cells in order. In this example, the row percent is first, in bold. For the first cell in Alabama: 91.0% of the total population have parents who were both born in the US, the confidence interval is 90.2% to 91.8% (and we’re 90% confident that the true value falls within this range), and the large number at the bottom is the estimated number of cases that fall in this category. Since we filtered our observations for one year, this represents the total population in that category for that year (if we check the totals in the last column against published census data, they are roughly equivalent to the total population of each state).

Output Table – Nativity by State in 2020

Glancing through the table, we see that Alabama and Alaska have more cases where both parents are born in the US (91.0% and 85.1%) relative to Arizona (68.7%). Arizona has a higher percentage of cases where both parents are foreign born, or the persons themselves are foreign-born. The color coding indicates the Z value (see bottom of the table for legend), which indicates how far a variable deviates from the mean, with dark red being higher than the expected mean and dark blue being lower than expected. Not surprisingly, states with fewer immigrants have higher than average values for both parents native born (Alabama, Alaska, Arkansas) while this value is lower than average for more diverse states (Arizona, California).

To capture the table, you could highlight / copy and paste the screen from the website into a spreadsheet. Or, if you return to the previous tab, at the bottom of the screen before running the table, you can choose the option to export to CSV.

Variations for Creating Detailed Crosstabs

To generate a table to show nativity for all races:

Input Parameters – Generate Tables for Nativity by State for each Race

In the control box, type the variable race. The control box will generate separate tables in the results for each category in the control variable. In this case, one nativity table per racial group.

To generate a table for nativity specifically Asians:

Input Parameters – Generate Table for Nativity by State for Asians

Remove race from the control box, and add it in the filter box after the year. In parentheses, enter the race code for Asians; you can find this in the codebook for race: year(2020), race(651).

Now that we’re drilling down to smaller populations, the reliability of the estimates is declining. For example, in Arkansas the estimate for Asians for both parents foreign born is 32.4%, but the value could be as low as 22.2% or as high as 44.5%. The confidence interval for California is much narrower, as the Asian population is much larger there. How can we get a better estimate?

Output Table – Nativity by State for Asians in 2020

Generate a table for nativity for Asians over a five year period:

Input Parameters – Generate Table for Nativity by State for Asians for 5-year Period

Add more years in the year filter, separated by commas. In this version, our confidence intervals are much narrower; notice for Asians for both parents foreign born in Arkansas is now 19.2% with a range of 14.1% to 25.6%. By increasing the sample pool with more years of data, we’ve increased the precision of the estimate at the cost of accepting a broader time period. One big caveat here: the Weighted N represents the total number of estimated cases, but since we are looking at five years of data instead of one it no longer represents a total population value for a single year. If we wanted to get an average annual estimate for this 5-year time period (similar to what the ACS produces), we’d have to divide each of estimates by five for a rough approximation. You can download the table and do these calculations in a spreadsheet or stats program.

Output Table – Nativity by State for Asians between 2016-2020 (weighted N = estimate of total cases over 5 years, not an average annual value)

You can also add control variables to a crosstab. For example, if you added sex as a control variable, you would generate separate female and male tables for the nativity by state for the Asian population over a given time period,

Example of a Profile Table

If we wanted to generate a profile for a given place as opposed to a comparison table, we can swap the variables we have in the rows and columns. For example, to see nativity for all Hispanic subgroups within California for a single year:

Input Parameters – Generate a Profile Table for California of Nativity by Hispanic Groups in 2020

In this case, you could opt to calculate percentages by column instead of row, if you wanted to see percent totals across groups for the categories. You could show both in the same chart, but I find it’s difficult to read. In this last example, note the large confidence intervals and thus lower precision for smaller Hispanic groups in California (Cuban, Dominican) versus larger groups (Mexican, Salvadoran).

Output Table – Nativity by Hispanic Groups in California 2020 (confidence interval is much larger for smaller groups)

In short – this is handy tool if you want to generate some quick estimates and crosstabs from the CPS without having to download and weight microdata records yourself. You can create geographic data for regions, divisions, states, and metro areas. Just be mindful of confidence intervals as you drill down to smaller subgroups; you can aggregate by year, geography, or category / group to get better precision. What I’ve demonstrated is the tip of the iceberg; read the documentation and experiment with creating charts, statistical summaries, and more.

BEA Population Change Map

Population and Economic Time Series Data from the BEA

In this post, I’ll demonstrate how to access and download multiple decades of annual population data for US states, counties, and metropolitan areas in a single table. Last semester, I was helping a student in a GIS course find data on tuberculosis cases by state and metro area that stretched back several decades. In order to make meaningful comparisons, we needed to calculate rates, which meant finding an annual time series for total population. Using data directly from the Census Bureau is tough going, as they don’t focus on time series and you’d have to stitch together several decades of population estimates. Metropolitan areas add more complexity, as they are modified at least a few times each decade.

In searching for alternatives, I landed at the Bureau of Economic Analysis (BEA). As part of their charge is studying the economy over time, they gather data from the Census Bureau, Bureau of Labor Statistics, and others to build time series, and they update them as geography and industrial classification schemes change. They publish national, state, metropolitan area, and county-level GDP, employment, income, wage, and population tables that span multiple decades. Their economic profile table for metros and counties covers 1969 to present, while the state profile table goes back to 1958. Metropolitan areas are aggregates of counties. As metro boundaries change, the BEA normalizes the data, adjusting the series by taking older county-level data and molding it to fit the most recent metro definitions.

Finding the population data was a bit tricky, as it is embedded as one variable in the Economic Profile table (identified as CAINC30) that includes multiple indicators. Here’s the path to get it:

  • From the BEA website, choose Tools – Interactive Data.
  • From the options on the next page, choose Regional from the National, Industry, International or Regional data options. There’s also a link to a video that illustrates how to use the BEA interactive data tool.
  • From the Regional Data page, click “Begin using the data” (but note you can alternatively “Begin mapping the data” to make some basic web maps too, like the one in the header of this post).
  • On the next page are categories, and under each category are data tables for specific series. In this case, Personal Income and Employment by County and Metropolitan Area was what I wanted, and under that the Economic Profile CAINC30 table (states appear under a different heading, but there’s a comparable Economic Profile table for them, SAINC30).
  • On the multi-screen table builder, you choose a type of geography (county or different metro area options), and on the next tab you can choose individual places, hold down CTRL and select several, or grab them all with the option at the top of the dropdown. Likewise for the Statistic, choose Population (persons), or grab a selection, or take all the stats. Under the Unit of Measure dropdown, Levels gives you the actual statistic, but you can also choose percent change, index values, and more. On the next tab, choose your years.
  • On the final page, if your selection is small enough you can preview the result and then download. In this case it’s too large, so you’re prompted to grab an Excel or CSV file to download.

And there you have it! One table with 50+ years of annual population data, using current metro boundaries. The footnotes at the bottom of the file indicate that the latest years of population data are based on the most recent vintage estimates from the Census Bureau’s population estimates. Once the final intercensal estimates for the 2010s are released, the data for that decade will be replaced a final time, and the estimates from the 2020s will be updated annually as each new vintage is released until we pass the 2030 census. Their documentation is pretty thorough.

BEA 5-decade Time Series of Population Data by Metro Area

The Interactive Data table approach allows you to assemble your series step by step. If you wanted to skip all the clicking you can grab everything in one download (all years for all places for all stats in a given table). Of course, that means some filtering and cleaning post-download to isolate what you need. There’s also an API, and several other data series and access options to explore. The steps for creating the map that appears at the top of this post were similar to the steps for building the table.

Noise Complaint Kernels and Contours

Kernel Density and Contours in QGIS: Noisy NYC

In spatial analysis, kernel density estimation (colloquially referred to as a type of “hot spot analysis”) is used to explore the intensity or clustering of point-based events. Crimes, parking tickets, traffic accidents, bird sightings, forest fires, incidents of infections disease, anything that you can plot as a point at a specific period in time can be studied using KDE. Instead of looking at these features as a distribution of discrete points, you generate a raster that represents a continuous surface of values. You can either measure the density of the incidents themselves, or the concentration of a specific attribute that is tied to those incidents (like the dollar amount of parking tickets or the number of injuries in traffic accidents).

In this post I’ll demonstrate how to do a KDE analysis in QGIS, but you can easily implement KDE in other software like ArcGIS Pro or R. Understanding the inputs you have to provide to produce a meaningful result is more important than the specific tool. This YouTube video produced by the SEER Lab at the University of Florida helped me understand what these inputs are. They used the SAGA kernel tool within QGIS, but I’ll discuss the regular QGIS tool and will cover some basic data preparation steps when working with coordinate data. The video illustrates a KDE based on a weight, where there were single points that had a count-based attribute they wanted to interpolate (number of flies in a trap). In this post I’ll cover simple density based on the number of incidents (individual noise complaints), and will conclude by demonstrating how to generate contour lines from the KDE raster.

For a summary of how KDE works, take a look at the entry for “Kernel” in the Encyclopedia of Geographic Information Science (2007) p 247-248. For a fuller treatment, I always recommend Christopher Lloyd’s Spatial Data Analysis: An Introduction to GIS Users (2010) p 93-97 by Oxford Press. There’s also an explanation in the ArcGIS Pro documentation.

Data Preparation

I visited the NYC Open Data page and pulled up the entry for 311 Service Requests. When previewing the data I used the filter option to narrow the records down to a small subset; I chose complaints that were created between June 1st and 30th 2022, where the complaint type began with “Noise”, which gave me about 75,000 records (it’s a noisy town). Then I hit the Export button and chose one of the CSV formats. CSV is a common export option from open data portals; as long as you have columns that contain latitude and longitude coordinates, you will be able to plot the records. The NYC portal allows you to filter up front; other data portals like the ones in Philly and DC package data into sets of CSV files for each year, so if you wanted to apply filters you’d use the GIS or stats package to do that post-download. If shapefiles or geoJSON are provided, that will save you the step of having to plot coordinates from a CSV.

NYC Open Data 311 Service Requests

With the CSV, I launched QGIS, went to the Data Source Manager, and selected Delimited Text. Browsed for the file I downloaded, gave the layer a common sense name, and under geometry specified Point coordinates, and confirmed that the X field was my longitude column and the Y field was latitude. Ran the tool, and the points were plotted in the basic WGS 84 longitude / latitude system in degrees, which is the system the coordinates in the data file were in (generally a safe bet for modern coordinate data, but not always the case).

QGIS Add Delimited Text and Plot Coordinates

The next step was to save these plotted points in a file format that stores geometry and allows us to do spatial analysis. In doing that step, I recommend taking two additional ones. First, verify that all of the plotted data have coordinates – if there are any records where lat and long are missing, those records will be carried along into the spatial file but there will be no geometry for them, which will cause problems. I used the Select Features by Expression tool, and in the expression window typed “Latitude” is not null to select all the features that have coordinates.

QGIS Select by Expression

Second, transform the coordinate reference system (CRS) of the layer to a projected system that uses meters or feet. When we run the kernel tool, it will ask us to specify a radius for defining the density, as well as the size of the pixels for the output raster. Using degrees doesn’t make sense, as it’s hard for us to conceptualize distances in degrees, and they are not a constant unit of measurement. If you’ve googled around and read Stack Exchange posts or watched videos where a person says “You just have to experiment and adjust these numbers until your map looks Ok”, they were working with units in fractions of degrees. This is not smart. Transform the system of your layers!

I selected the layer, right clicked, Export, Save Selected Features As. The default output is a geopackage, which is fine. Otherwise you could select ESRI shapefile, both are vector formats that store geometry. For file name I browse … and save the file in a specific folder. Beside CRS I hit the globe button, and in the CRS Selector window typed NAD83 Long Island in the filter at the top, and at the bottom I selected the NAD83 / New York Long Island (ftUS) EPSG 2263 option system in the list. Every state in the US has one or more state plane zones that you can select for making optimal maps for that area, in feet or meters. Throughout the world, you could choose an appropriate UTM zone that covers your area in meters. For countries or continents, look for an equidistant projection (meters again).

QGIS Export – Save As

Clicked a series of Oks to create the new file. To reset my map window to match CRS of the new file, I selected that file, right clicked, Layer CRS, Set Project CRS from Layer. Removed my original CSV to avoid confusion, and saved my project.

QGIS Noise Complaints in Projected CRS

Kernel Density Estimation

Now our data is ready. Under the Processing menu I opened the toolbox and searched for kernel to find Heatmap (Kernel Density Estimation) under the Interpolation tools. The tool asks for an input point layer, and then a radius. The radius is used to define an area for calculating a local density estimate around each point. We can use a formula to determine an ideal radius; the hopt method seems to be commonly employed for this purpose.

To use the hopt formula, we need to know the standard distance for our layer, which measures the degree to which features are dispersed around the spatial mean or center of the distribution. A nice 3rd party plugin was created for calculating this. I went to the the plugins menu, searched for the Standard Distance plugin, and added it. Searched for it in the Processing toolbox and launched it. I provided my point layer for input, and specified an output file. The other fields are optional (if we were measuring an attribute of the points instead of the density of the points, we could specify the attribute as a weight column). The output layer consists of a circle where the center is the mean center of the distribution, and the circle represents the standard deviation. The attribute table contains one record, with the standard distance attribute of 36,046.18 feet (if no feature was created, the likely problem is you have records in the point file that don’t have geometry – delete them and try again).

Output from the Standard Distance Plugin

Knowing this, I used the hopt formula:

=((2/(3N))^0.25)SD

Where N is the number of features and SD is the standard distance. I used Excel to plug in these values and do the calculation.

((2/(374526))^0.25)36046.18 = 1971.33

Finally, I launched the heatmap kernel tool, specified my noise points as input, and the radius as 1,971 feet. The output raster size does take some experimentation. The larger the pixel size, the coarser or more general the resolution will be. You want to choose something that makes sense based on the size of the area, the number of points, and / or some other contextual information. Just like the radius, the units are based on the map units of your layer. If I type in 100 feet for Pixel X, I see I’ll have a raster with 1,545 rows and 1,565 columns. Change it to 200 feet, and I get 773 by 783. I’ll go with 200 feet (the distance between a “standard” numbered street block in midtown Manhattan). I kept the defaults for the other options.

QGIS Heatmap Kernel Density Estimation Window

The resulting raster was initially displayed in black and white. I opened the properties and symbology menu and changed the render type from Singleband gray to Singleband pseudocolor, and kept the default yellow to red scheme. Voila!

Kernel Density Estimate of NYC Noise Complaints June 2022

In June 2022 there were high clusters of noise complaints in north central Brooklyn, northern Manhattan, and the southwest portion of the Bronx. There’s a giant red hot spot in the north central Bronx that looks like the storm on planet Jupiter. What on earth is going on there? I flipped back to the noise point layer and selected points in that area, and discovered a single address where over 2,700 noise complaints about a loud party were filed on June 18 and 19! There’s also an address on the adjacent block that registered over 900 complaints. And yet the records do not appear to be duplicates, as they have different time stamps and closing dates. A mistake in coding this address, multiple times? A vengeful person spamming the 311 system? Or just one helluva loud party? It’s hard to say, but beware of garbage in, garbage out. Beyond this demo, I would spend more time investigating, would try omitting these complaints as outliers and run the heatmap tool again, and compare this output to different months. It’s also worth experimenting with the color classification scheme, and some different pixel sizes.

Kernel Results Zoomed In

Contour Lines

Another interesting way to visualize this data would be to generate contour lines based on the kernel output. I did a search for contour in the processing toolbox, and in the contour tool I provided the kernel noise raster as the input. For intervals between contour lines I tried 20 feet, and changed the attribute name to reflect what the contour represents: COMPLAINT instead of ELEV. Generated the new file, overlaid on top of the kernel, and now you can see how it represents the “elevation” of complaints.

Noise Complaint Kernel Density with Contour Lines

Switch the kernel off, symbolize the contours and add some labels, and throw the OpenStreetMap underneath, and now you can explore New York’s hills and valleys of noise. Or more precisely, the hills and valleys of noise complainers! In looking at these contours, it’s important to remember that they’re generated from the kernel raster’s grid cells and not from the original point layer. The raster is a generalization of the point layer, so it’s possible that if you look within the center of some of the denser circles you may not find, say, 340 or 420 actual point complaints. To generate a more precise set of contours, you would need to decrease the pixel size in the kernel tool (from say 200 feet to 100).

Noise Complaint Contours in Lower Manhattan, Northwest Brooklyn, and Long Island City

It’s interesting what you can create with just one set of points as input. Happy mapping!

US Census Data ALA Tech Report

ALA Tech Report on Using Census Data for Research

I have written a new report that’s just been released: US Census Data: Concepts and Applications for Supporting Research, was published as the May / June 2022 issue of the American Library Association’s Library and Technology Reports. It’s available for purchase digitally or in hard copy from the ALA from now through next year. It will also be available via EBSCOhost as full text, sometime this month. One year from now, the online version will transition to become a free and open publication available via the tech report archives.

The report was designed to be a concise primer (about 30 pages) for librarians who want to be knowledgeable with assisting researchers and students with finding, accessing, and using public summary census data, or who want to apply it to their own work as administrators or LIS researchers. But I also wrote it in such a way that it’s relevant for anyone who is interested in learning more about the census. In some respects it’s a good distillation of my “greatest hits”, drawing on work from my book, technical census-related blog posts, and earlier research that used census data to study the distribution of public libraries in the United States.

Chapter Outline

  1. Introduction
  2. Roles of the Census: in American society, the open data landscape, and library settings
  3. Census Concepts: geography, subject categories, tables and universes
  4. Datasets: decennial census, American Community Survey, Population Estimates, Business Establishments
  5. Accessing Data: data.census.gov, API with python, reports and data summaries
  6. GIS, historical research, and microdata: covers these topics plus the Current Population Survey
  7. The Census in Library Applications: overview of the LIS literature on site selection analysis and studying library access and user populations

I’m pleased with how it turned out, and in particular I hope that it will be used by MLIS students in data services and government information courses.

Although… I must express my displeasure with the ALA. The editorial team for the Library Technology Reports was solid. But once I finished the final reviews of the copy edits, I was put on the spot to write a short article for the American Libraries magazine, primarily to promote the report. This was not part of the contract, and I was given little direction and a month at a busy time of the school year to turn it around. I submitted a draft and never heard about it again – until I saw it in the magazine last week. They cut and revised it to focus on a narrow aspect of the census that was not the original premise, and they introduced errors to boot! As a writer I have never had an experience where I haven’t been given the opportunity to review revisions. It’s thoroughly unprofessional, and makes it difficult to defend the traditional editorial process as somehow being more accurate or thorough compared to the web posting and tweeting masses. They were apologetic, and are posting corrections. I was reluctant to contribute to the magazine to begin with, as I have a low opinion of it and think it’s deteriorated in recent years, but that’s a topic for a different discussion.

Stepping off the soapbox… I’ll be attending the ALA annual conference in DC later this month, to participate on a panel that will discuss the 2020 census, and to reconnect with some old colleagues. So if you want to talk about the census, you can buy me some coffee (or beer) and check out the report.

A final research and publication related note – the map that appears at the top of my post on the distribution of US public libraries from several years back has also made it into print. It appears on page 173 of The Argument Toolbox by K.J. Peters, published by Broadview Press. It was selected as an example of using visuals for communicating research findings, making compelling arguments in academic writing, and citing underlying sources to establish credibility. I’m browsing through the complimentary copy I received and it looks excellent. If you’re an academic librarian or a writing center professional and are looking for core research method guides, I would recommend checking it out.