research

Census Book

Exploring the US Census Book Published!

My book, Exploring the US Census: Your Guide to America’s Data, has been published! You can purchase it directly from SAGE Publishing or from Barnes and Nobles, Amazon, or your bookstore of choice (it’s currently listed for pre-order on Amazon but its availability there is imminent). It’s $45 for the paperback, $36 for the ebook. Data for the exercises and supplemental material is available on the publisher’s website, and I’ve created a landing page for the book on this site.

Exploring the US Census is the definitive researcher’s guide to working with census data. I place the census within the context of: US society, the open data movement, and the big data universe, provide a crash course on using the new data.census.gov, and introduce the fundamental concepts of census geography and subject categories (aka universes). One chapter is devoted to each of the primary datasets: decennial census (with details about the 2020 census that’s just over the horizon), American Community Survey, Population Estimates Program, and business data from the Business Patterns, Economic Census, and BLS. Subsequent chapters demonstrate how to: integrate census data into writing and research, map census data in GIS, create derivative measures, and work with historic data and microdata with a focus on the Current Population Survey.

I wrote the book as a hybrid between a techie guidebook and an academic text. I provide hands-on exercises so that you learn by doing (techie) while supplying sufficient context so you can understand and evaluate why you’re doing it (academic). I demonstrate how to find and download data from several different sources, and how to work with the data using free and open source software: spreadsheets (LibreOffice Calc), SQL databases (DB Browser for SQLite), and GIS (QGIS). I point out the major caveats and pitfalls of working with the census, along with many helpful tools and resources.

The US census data ecosystem provides us with excellent statistics for describing, studying, and understanding our communities and our nation. It is a free and public domain resource that’s a vital piece of the country’s social, political, and economic infrastructure and a foundational element of American democracy. This book is your indispensable road map for navigating the census. Have a good trip!

See the series – census book tag for posts about the content of the book, additional material that expands on that content (but didn’t make it between the covers), and the writing process.

A-Train Classic

Neighborhood Research and the Census for Undergrads

Each semester I visit several undergraduate classes in public affairs and journalism, to introduce students to census data. They’re researching or reporting on particular issues and trends in neighborhoods in New York City, and they are looking for statistics to either support their work or generate ideas for a story. I usually showcase the NYC Population Factfinder as a starting point, mention the Census Reporter for areas outside the city , and provide background info on the decennial census, American Community Survey, and census geography and subjects. This year I included two new examples toward the beginning of the lecture to spark their interest.

I recently helped reporter Susannah Jacob navigate census data for an article she wrote on hyper-gentrification in the West Village for the New York Review of Books. A perfect example, as it’s what the students are expected to do for their assignment! Like any good journalist (and human geographer), Susannah pounded the pavement of the neighborhood, interviewing residents and small businesses and observing and documenting the urban landscape and how it was changing. But she also wanted to see what the data could tell her, and whether it would corroborate or refute what she was seeing and hearing.

NYRB Article on the West Village

Source: Jacob & Roye, New York Review of Books, Oct 2019. https://www.nybooks.com/daily/2019/10/09/what-happened-to-the-west-village/

We used the NYC Population Factfinder to assemble census tracts to approximate the neighborhood, and I did a little legwork to pull data from the County / ZIP Code Business Patterns so we could see how the business landscape was changing. The most surprising stat we discovered was that the number of 1-unit detached homes had doubled. This wouldn’t be odd in many rapidly growing places in the US, but it’s unusual for an old, built-out urban neighborhood. A 1-unit detached home is a free-standing single family structure that doesn’t share walls with other buildings. Most homes in Manhattan are either attached (row houses / town houses) or units in multi-unit buildings (apartments / condos / co-ops). How could this be? Uber-wealthy people are buying up adjoining row homes, knocking down the walls, and turning them into urban mansions. Seems extraordinary, but apparently is part of a trend.

We certainly ran up against the limitations of ACS data. The estimates for tracts have large margins of error, and when comparing two short time frames it’s difficult to detect actual change, as differences in estimates are clouded by sampling noise. Even after aggregating several tracts, many of the estimates for change weren’t reliable enough to report. When they were (as in the housing example) you could only say that there has been a relative increase without becoming wedded to a precise number. In this case, from 214 (+/- 127) detached units in 2006-2010 to 627 (+/-227) in 2013-2017, an increase of 386 (+/- 260). Not great estimates, but you can say it’s an increase as the low end for change is still positive at 126 units. Considering the time frame and character of the neighborhood, that’s still noteworthy (bearing in mind we’re working with a 90% confidence interval). In cases where the differences overlap and could represent either an increase or decrease there are few claims you can make, and it’s best to walk away (or look at larger area). I always discuss the margin of error with students and caution them about treating these numbers as counts.

While census data is invaluable for describing and studying individual places, it’s inherent geographic nature also allows us to study places in relation to each other, and to illustrate geographic patterns. For my second example, I zoom out and show them this map of racial-ethnic distribution in the United States:

Map of US Racial and Ethnic Diversity

Source: William H. Frey analysis of US Census population estimates, 2018. https://www.brookings.edu/research/americas-racial-diversity-in-six-maps/

This is one of a series of six maps by demographer William Frey at the Brookings Institute that highlights the geographic diversity of the United States. In this map, each county is shaded for a particular race / ethnicity if the population of that group in that county is greater than that group’s share of the national population. For example, Hispanics / Latinos represent 18.3% of the total US population, so counties where they represent more than this percentage are shaded.

For the purpose of the class, it helps make the census ‘pop’ and gets the students to think about the statistics as geospatial datasets that they can see and relate to, and that can form the basis for interesting research.

Some footnotes – if you like Frey’s maps, I highly recommend his book Diversity Explosion: How New Racial Demographics are Remaking America. It explores the evolving demographic and geographic landscape of the US with clear, accessible writing and more of these great maps (in color).

I used the pic at the top of this post as the background for my intro slide. It’s a screenshot of a city from A-Train, a 1992 city-building train simulator that was ported from Japan to the world by Artdink and Maxis, following the success of something called SimCity. It wasn’t nearly as successful, but I always liked the graphics which have now attained a retro-gaming vibe.

FRED Chart - Pesronal Savings Rate

Finding Economic Data with FRED

I attended ALA’s annual conference in DC last month, where I met FRED. Not a person, but a database. I can’t believe I hadn’t met FRED before – it is an amazingly valuable resource for national, time-series economic data.

FRED was created by the Economic Research unit of the Federal Reserve Bank of St. Louis. It was designed to aggregate economic data from many government sources into a centralized database, with straightforward interface for creating charts and tables. At present, it contains 567,000 US and international time series datasets from 87 sources.

Categories of data include banking and finance (interest and exchange rates, lending, monetary data), labor markets (basic demographics, employment and unemployment, job openings, taxes, real estate), national accounts (national income, debt, trade), production and business (business cycles, production, retail trade, sector-level information about industries),  prices (commodities, consumer price indexes) and a lot more. Sources include the Federal Reserve, the Bureau of Labor Statistics, the Census Bureau, the Bureau of Economic Analysis, the Treasury Department, and a mix of other government and corporate sources from the US and around the world.

On their home page at https://fred.stlouisfed.org/ you can search for indicators or choose one of several options for browsing. The default dashboard shows you some of the most popular series and newest releases at a glance. Click on Civilian Unemployment Rate, and you retrieve a chart with monthly stats that stretch from the late 1940s to the present. Most of FRED’s plots highlight periods of recession since these have a clear impact on economic trends. You can modify the chart’s date range, change the frequency (monthly, quarterly, annually – varies by indicator), download the chart or the underlying data in a number of formats, and share a link to it. There are also a number of advanced customization features, such as adding other series to the chart. Directly below the chart are notes that provide a clear definition of the indicator and its source (in this case, the Bureau of Labor Statistics) and links to related tables and resources.

FRED - Chart of Civilian Unemployment Rate

The unemployment rate is certainly something that you’d expect to see, but once you browse around a bit you’ll be surprised by the mix of statistics and the level of detail. I happened to stumble across a monthly Condo Price Index for the New York City Metro Area.

Relative to other sources or portals, FRED is great for viewing and retrieving national (US and other countries) economic and fiscal data and charts gathered from many sources. It’s well suited for time-series data; there are lots of indexes and you can opt for seasonally adjusted or unadjusted values. Many of the series include data for large regions of the US, states, metro areas, and counties. The simplest way to find sub-national data is to do a search, and once you do you can apply filters for concepts, frequencies, geographies, and sources. FRED is not the place to go if you need data for small geographies below the county level. If you opt to create a FRED account (purely optional) you’ll be able to save and track indicators that you’re interested in and build your own dashboards.

If you’re interested in maps, visit FRED’s brother GeoFRED at https://geofred.stlouisfed.org/.  The homepage has a series of sample thematic maps for US counties and states and globally for countries. Choose any map, and once it opens you can change the geography and indicator to something else. You can modify the frequency, units, and time periods for many of the indicators, and you have basic options for customizing the map (colors, labels, legend, etc.) The maps are interactive, so you can zoom in and out and click on a place to see its data value. Most of the county-level data comes from the Census Bureau, but as you move up to states or metro areas the number of indicators and sources increase. For example, the map below shows individual income taxes collected per capita by state in 2018.

GeoFRED - State Income Tax

There’s a basic search function for finding specific indicators. Just like the charts, maps can be downloaded as static images, shared and embedded in websites, and you can download the data behind the map (it’s simpler to download the same indicator for multiple geographies using GeoFRED compared to FRED).

Take a few minutes and check it out. For insights and analyses of data published via FRED, visit FRED’s blog at https://fredblog.stlouisfed.org/.

Census Workshop Recap

I’ve been swamped these past few months, revising my census book, teaching a spatial database course, and keeping the GIS Lab running. Thus, this will be a shorter post!

Last week I taught a workshop on understanding, finding, and accessing US Census Data at the Metropolitan Library Council of New York. If you couldn’t make it, here are the presentation slides and the group exercise questions.

Most of the participants were librarians who were interested in learning how to help patrons find and understand census data, but there were also some data analysts in the crowd. We began with an overview of how the census is structured by dataset, geography, and subject categories. I always cover the differences between the decennial census and the ACS, with a focus on how to interpret ACS estimates and gauge their reliability.

For workshops I think it’s best to start with searching for profiles (lots of different data for one place). This gives new users a good overview of the breadth and depth of the types of variables that are available in the census. Since this was a New York City-centric crowd we looked at the City’s excellent NYC Population Factfinder first. The participants formed small groups and searched through the application to answer a series of fact-finding questions that I typically receive. Beyond familiarizing themselves with the applications and data, the exercises also helped to spark additional questions about how the census is structured and organized.

Then we switched over to the Missouri Census Data Center’s profile and trends applications (listed on the right hand side of their homepage) to look up data for other parts of the country, and in doing so we were able to discuss the different census geographies that are available for different places. Everyone appreciated the simple and easy to use interface and the accessible tables and graphics. The MCDC doesn’t have a map-based search, so I did a brief demo of TIGERweb for viewing census geography across the country.

Once everyone had this basic exposure, we hopped into the American Factfinder to search for comparison tables (a few pieces of data for many places). We discussed how census data is structured in tables and what the difference between the profile, summary, and detailed tables are. We used the advanced search and I introduced my tried and true method of filtering by dataset, geography, and topic to find what we need. I mentioned the Census Reporter as good place to go for ACS documentation, and as an alternate source of data. Part of my theme was that there are many tools that are suitable for different needs and skill levels, and you can pick your favorite or determine what’s suitable for a particular purpose.

We took a follow-the-leader approach for the AFF, where I stepped through the website and the process for downloading two tables and importing them into a spreadsheet, high-lighting gotchas along the way. We did some basic formulas for aggregating ACS estimates to create new margins of error, and a VLOOKUP for tying data from two tables together.

We wrapped up the morning with a foreshadowing of what’s to come with the new data.census.gov (which will replace the AFF) and the 2020 census. While there’s still much uncertainty around the citizenship question and fears of an under count, the structure of the dataset won’t be too different from 2010 and the timeline for release should be similar.

iceland_placename

Place Names: Comparing Two Global Gazetteers

Gazetteers are directories of place names and locations, which are useful for:

  1. Identifying variations in place names
  2. Obtaining coordinates
  3. Locating a place within a hierarchy of places
  4. Generating lists of types of features

For example, if you’re working with data that’s associated with specific cities, mountains, or bodies of water, and you have the names of these features but not the coordinates or the country or state / province where they’re located, you can use a gazetteer to obtain all three. Or, if you want to create a map of a specific type of feature (i.e. populated places, ruins, mines) or want map labels for features (forests, bodies of water) you can extract and plot the gazetteer data in GIS.

In this post I’ll provide an overview of two major global gazetteers: the GEOnet Names Server and Geonames. Each one provides several different interfaces and services for exploring and accessing data which I’ll briefly mention, but I’ll focus on on the data files that you can download and what’s contained in them. I’ll conclude with a strategy for relating a small to medium place-based data file of your own to the gazetteer to obtain coordinates. If you have a file with hundreds or a few thousand records and were planning to get coordinates by eyeballing Google Maps and clicking one by one, try this instead.

NGA GNS

File Downloads | Documentation and code book

The US National Geospatial-Intelligence Agency (NGA) maintains a vast gazetteer with data for all of the countries in the world (almost) and provides it to the public via the GEOnet Names Server (GNS). The GNS gazetteer does NOT include features in the United States or any of its territories; the US Geological Survey maintains a separate system called the Geographic Names Information System (GNIS) whose structure and organization is different.

The GNS is updated on a weekly basis and is provided through a number of interfaces that include a map-based and a text-based search, and Web Mapping (WMS) and Web Feature (WFS) Services that allow you to display data in a GIS or a web map.

Data files are packaged on a country by country basis. Alternatively you can download one file that has the whole world in it, or an archive with separate files for each country. The data is stored in tab-delimited text files that include a header row (i.e. the column names). ZIP files for each country include a primary file that contains all the country’s features, and a series of files that contain a subset of the primary file based on feature type. So, if you wanted to work with just populated places or with hydrographic features you can work with the specific file instead of having to filter them out of the primary one.

Each record in the GNS represents a name for a feature, as opposed to a feature itself. Thus, if a feature is known by more than one name it will appear multiple times in the file. Each record has a unique feature identifier (UFI) and a unique name identifier (UNI) which are large integers. The UFI number is repeated in the data, while the UNI is unique. The GNS files contain a number of different columns containing several feature names (short names, long ones, with and without diacritics) and a name type column (NT) that indicates whether the record is for a an approved (N), or variant name (V). If you want a list of features without duplicates, you would need to create a subset of the records that only includes the approved name.

Features are classified into nine broad classes (FC), which in turn are subdivided into many different designations (DSG). The nine classes are: administrative region, populated place, vegetation, locality or area, undersea, roads and railroads, hypsographic (terrain), hydrographic (water), and spot (point-based features). Additional columns include codes designating the size of a populated place (PC) and relative importance of the feature (DISPLAY) which is useful when mapping data at varying scales. The GNS does not contain information on actual population or elevation (this was included in the past but is no longer available).

The GNS includes a few geographic references that indicate where the feature is located. There is a global region code (RC) in the first column, a primary country code (cc1) and an administrative division (state or province) code for the primary country, and a secondary country code (cc2). Geographic features like rivers, seas, mountains, and forests may span the boundary of more than one country, so the cc1 and cc2 columns indicate this. Data in these fields may be stored as a comma-separated list or array with the different codes. The GNS uses two-letter FIPS 10-4 country codes created by the US government.

Country codes in the GNS

This SQL query illustrates how country and admin1 codes are stored in the GNS, and how some features (streams in this case) span several countries.

Lastly, longitude and latitude coordinates are provided in separate fields in two formats: decimal degrees (needed for plotting and mapping) and degrees-minutes-seconds. The coordinates are in the WGS 84 CRS (EPSG 4326).

Geonames

File Downloads | Documentation and code book

Geonames is the Wikipedia or OpenStreetMap of gazetteers. It’s a collaborative, crowd-sourced project. Many users may contribute a few locations or make a correction or two, but by and large most of the data comes from public or government sources that is loaded into Geonames en masse and subsequently modified. Geonames provides a text and map-based search, and an API that let’s scripters and programmers directly access the data.

Data files are packaged country by country, or globally by certain types (i.e. all countries or the largest cities). The data is stored in tab-delimited text files without a header row, so you need to consult the documentation to identify the columns. All data for each country is packaged in a single file.

Unlike the GNS, each Geonames record represents a specific feature. There is a conventional name (name) and a variant that uses plain ascii characters (asciiname). Some variant names are included in a single list / array column called alternatenames; to get a full list of variants and spellings in different languages you would download a separate alternate names file that you could link to this one. Each feature is assigned a geonameid, which is simply a large unique integer.

Features are divided into the same nine classes that are used in the GNS, and the subdivisions are the same as well. Documentation for the classes and subdivisions is provided. Population and elevation data is provided when available and relevant, but there’s no information on timeliness or source in the data file (but you can view the full edit history for a record in the online interface).

Geonames goes to great lengths to provide the geographic framework or hierarchy for each feature, so you can get instant geographic context. They use two-letter ISO country codes to designate countries (country_code), a list of alternate or secondary countries (cc2), and for the primary country up to four different levels of administrative divisions (i.e. state / province, county, municipality, etc). There’s also a field that indicates what timezone each feature is in.

There is one set of longitude and latitude coordinates in decimal degrees in the WGS84 CRS.

Geonames Belize City

Geonames search result for Belize City, illustrating options and available data.

Summary Comparison

To compare the different files I downloaded data for Belize, since it has a small number of records. The GNS file had 2,801 records for names, but if you look at unique features the record count was 2,180. The Geonames file for Belize has a comparable number of 2,309.

Commonalities

  • Free and publicly available
  • Tab-delimited text in country-based files
  • Longitude and latitude coordinates in decimal degrees in WGS84
  • Same feature classification system with nine classes and multiple sub-classes

GNS

  • A single, official government source
  • A file of feature names: must filter out variants to get unique feature records
  • File comes with column header
  • Files are divided into sub-files for feature classes
  • Uses FIPS codes for countries
  • Useful fields for ranking features for mapping
  • Limited data on geographic hierarchy
  • No data on population or elevation
  • Lacks data for the United States and territories (obtainable via the USGS GNIS)

Geonames

  • Collaborative project with data from many sources
  • A file of features, variant names included in separate column
  • Additional alternate names and spellings in most languages available in separate files
  • File lacks column header
  • Uses ISO codes for countries
  • Extensive information on geographic hierarchy
  • Has population, elevation, and timezone for certain features
  • No ranking columns for map display

Gazetteer Caveats

1. It’s important to recognize that each source uses different codes for classifying countries: the GNS uses FIPS and Geonames uses ISO. While they appear similar (two-letter abbreviations) they are NOT the same: The FIPS code for Belize is BH and the ISO Code if BZ; in the ISO system BH is for Bahrain while the FIPS system doesn’t use BZ as a code. The CIA World Factbook includes a table comparing different country code systems. The GNS will convert to ISO at some uncertain date in the future.

2. Gazetteer data must be imported using UTF-8 encoding to preserve all the characters from the various alphabets.

3. Each feature in a gazetteer will have longitude and latitude coordinates that represent the geographic center of a feature. That means that a large areal feature like a country, a linear feature like a road, and a small point feature like a monument will have one coordinate pair. The coordinates for the monument will be pretty precise, while the set for the road and country are broad generalizations. Long linear features like roads and rivers may appear in the datasets several times as distinct feature records at different points. While it’s possible to get bounding box coordinates from Geonames, this data is not included in the downloadable country files.

4. A place name may appear multiple times in a gazetteer because names are not unique. Several different places of the same type may have the same name, and several features of different types may have the same name. For example, the Geonames file for Belize has four places name Santa Elena; two are populated places in different parts of the country while the other two are spot features (a camp and an estate) that are located near each of the populated places. The GNS file has even more records for this place, some with the approved name Santa Elena and others with the variant Saint Helena.

GNS Names and Variants

GNS records for Santa Elena, Belize. Notice the UFI is duplicated for features that have multiple names while the UNI is unique. The NT field indicates approved names (N) versus other types like variants (V). Records are for a mix of populated places (P, PPL) and spot (S) types of various kinds (ancient site, campground, and estate).

For all these reasons, it rarely makes sense to use the files in their entirety for obtaining names and coordinates or plotting places. You’ll want to extract data just for the types of features that you need. If you’re trying to match a list of place names to the gazetteer you’ll need to insure that you’re matching the right name to the right place. You can use the feature classes and the administrative divisions of the country to narrow down the location, and when in doubt use the gazetteer map interfaces to locate a specific place.

Matching Your Own Data to a Gazetteer

Winnow down the gazetteer file to just the features you need. Make sure that all the place names in your own data file are standardized so you don’t have variant spellings for the same place. In your data add a column for a unique identifier at the beginning of the sheet. Locate each place in your file in the gazetteer, then copy the unique ID from that file into your sheet. Then, if you’re using a spreadsheet you can use the VLOOKUP formula to use the ID from your sheet to pull related data from the gazetteer sheet (the longitude and latitude coordinates, codes for the administrative divisions, etc). This saves you a lot of copying and pasting. Similarly, if you were using a relational database you can write a JOIN statement to tie the two tables together using the ID.

This approach saves you the time of manually clicking on Google Maps or OSM to look up coordinates for a place and transcribing them, and you get the added benefit of grabbing any extra useful information the gazetteer provides. If you haven’t started the process of gathering your own data, start with the gazetteer file: winnow it down and append your own data to it as your research progresses.

But what if you had tons of coordinates that you need to retrieve? Because of the ambiguity in place names using a VLOOKUP or JOIN based on the name will be imprecise, because there may be more than one place with the same name and you’ll have no way of knowing if you selected the right one. You could modify your own data and the data in the gazetteer by concatenating administrative codes to the place name (i.e. St. Elena, 02) to make the name more precise and increase the chances of an accurate join. This approach requires you to be familiar with the administrative subdivisions in the areas you’re researching.

If you were trying to identify coordinates for tens of thousands of towns, cities, and larger administrative divisions you could try using a geocoder instead of a gazetteer. Geocoders are designed primarily for obtaining coordinates for addresses, but if an exact match can’t be found many will return coordinates for the smallest possible area that’s part of the address. If you provided a list of cities that also include a state / province and country, you could obtain the coordinates for just the city.

A final alternative where you can get a wider range of features in a geospatial format in bulk is the OpenStreetMap. I’ll return to this in a future post, but there’s an excellent OSM – QGIS tutorial that can help get you started.

Interested in learning more? If you’re in the spatial sciences or digital humanities check out this book: Placing Names: Enriching and Integrating Gazetteers.

LISA map of Broad Band Subscription by Household

Mapping US Census Data on Internet Access

ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

Figure 1. Percentage of Households With Subscription to Any Broadband Service: 2013-2017[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above $50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below $50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA map of Broad Band Subscription by Household

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.