research

FRED Chart - Pesronal Savings Rate

Finding Economic Data with FRED

I attended ALA’s annual conference in DC last month, where I met FRED. Not a person, but a database. I can’t believe I hadn’t met FRED before – it is an amazingly valuable resource for national, time-series economic data.

FRED was created by the Economic Research unit of the Federal Reserve Bank of St. Louis. It was designed to aggregate economic data from many government sources into a centralized database, with straightforward interface for creating charts and tables. At present, it contains 567,000 US and international time series datasets from 87 sources.

Categories of data include banking and finance (interest and exchange rates, lending, monetary data), labor markets (basic demographics, employment and unemployment, job openings, taxes, real estate), national accounts (national income, debt, trade), production and business (business cycles, production, retail trade, sector-level information about industries),  prices (commodities, consumer price indexes) and a lot more. Sources include the Federal Reserve, the Bureau of Labor Statistics, the Census Bureau, the Bureau of Economic Analysis, the Treasury Department, and a mix of other government and corporate sources from the US and around the world.

On their home page at https://fred.stlouisfed.org/ you can search for indicators or choose one of several options for browsing. The default dashboard shows you some of the most popular series and newest releases at a glance. Click on Civilian Unemployment Rate, and you retrieve a chart with monthly stats that stretch from the late 1940s to the present. Most of FRED’s plots highlight periods of recession since these have a clear impact on economic trends. You can modify the chart’s date range, change the frequency (monthly, quarterly, annually – varies by indicator), download the chart or the underlying data in a number of formats, and share a link to it. There are also a number of advanced customization features, such as adding other series to the chart. Directly below the chart are notes that provide a clear definition of the indicator and its source (in this case, the Bureau of Labor Statistics) and links to related tables and resources.

FRED - Chart of Civilian Unemployment Rate

The unemployment rate is certainly something that you’d expect to see, but once you browse around a bit you’ll be surprised by the mix of statistics and the level of detail. I happened to stumble across a monthly Condo Price Index for the New York City Metro Area.

Relative to other sources or portals, FRED is great for viewing and retrieving national (US and other countries) economic and fiscal data and charts gathered from many sources. It’s well suited for time-series data; there are lots of indexes and you can opt for seasonally adjusted or unadjusted values. Many of the series include data for large regions of the US, states, metro areas, and counties. The simplest way to find sub-national data is to do a search, and once you do you can apply filters for concepts, frequencies, geographies, and sources. FRED is not the place to go if you need data for small geographies below the county level. If you opt to create a FRED account (purely optional) you’ll be able to save and track indicators that you’re interested in and build your own dashboards.

If you’re interested in maps, visit FRED’s brother GeoFRED at https://geofred.stlouisfed.org/.  The homepage has a series of sample thematic maps for US counties and states and globally for countries. Choose any map, and once it opens you can change the geography and indicator to something else. You can modify the frequency, units, and time periods for many of the indicators, and you have basic options for customizing the map (colors, labels, legend, etc.) The maps are interactive, so you can zoom in and out and click on a place to see its data value. Most of the county-level data comes from the Census Bureau, but as you move up to states or metro areas the number of indicators and sources increase. For example, the map below shows individual income taxes collected per capita by state in 2018.

GeoFRED - State Income Tax

There’s a basic search function for finding specific indicators. Just like the charts, maps can be downloaded as static images, shared and embedded in websites, and you can download the data behind the map (it’s simpler to download the same indicator for multiple geographies using GeoFRED compared to FRED).

Take a few minutes and check it out. For insights and analyses of data published via FRED, visit FRED’s blog at https://fredblog.stlouisfed.org/.

Census Workshop Recap

I’ve been swamped these past few months, revising my census book, teaching a spatial database course, and keeping the GIS Lab running. Thus, this will be a shorter post!

Last week I taught a workshop on understanding, finding, and accessing US Census Data at the Metropolitan Library Council of New York. If you couldn’t make it, here are the presentation slides and the group exercise questions.

Most of the participants were librarians who were interested in learning how to help patrons find and understand census data, but there were also some data analysts in the crowd. We began with an overview of how the census is structured by dataset, geography, and subject categories. I always cover the differences between the decennial census and the ACS, with a focus on how to interpret ACS estimates and gauge their reliability.

For workshops I think it’s best to start with searching for profiles (lots of different data for one place). This gives new users a good overview of the breadth and depth of the types of variables that are available in the census. Since this was a New York City-centric crowd we looked at the City’s excellent NYC Population Factfinder first. The participants formed small groups and searched through the application to answer a series of fact-finding questions that I typically receive. Beyond familiarizing themselves with the applications and data, the exercises also helped to spark additional questions about how the census is structured and organized.

Then we switched over to the Missouri Census Data Center’s profile and trends applications (listed on the right hand side of their homepage) to look up data for other parts of the country, and in doing so we were able to discuss the different census geographies that are available for different places. Everyone appreciated the simple and easy to use interface and the accessible tables and graphics. The MCDC doesn’t have a map-based search, so I did a brief demo of TIGERweb for viewing census geography across the country.

Once everyone had this basic exposure, we hopped into the American Factfinder to search for comparison tables (a few pieces of data for many places). We discussed how census data is structured in tables and what the difference between the profile, summary, and detailed tables are. We used the advanced search and I introduced my tried and true method of filtering by dataset, geography, and topic to find what we need. I mentioned the Census Reporter as good place to go for ACS documentation, and as an alternate source of data. Part of my theme was that there are many tools that are suitable for different needs and skill levels, and you can pick your favorite or determine what’s suitable for a particular purpose.

We took a follow-the-leader approach for the AFF, where I stepped through the website and the process for downloading two tables and importing them into a spreadsheet, high-lighting gotchas along the way. We did some basic formulas for aggregating ACS estimates to create new margins of error, and a VLOOKUP for tying data from two tables together.

We wrapped up the morning with a foreshadowing of what’s to come with the new data.census.gov (which will replace the AFF) and the 2020 census. While there’s still much uncertainty around the citizenship question and fears of an under count, the structure of the dataset won’t be too different from 2010 and the timeline for release should be similar.

iceland_placename

Place Names: Comparing Two Global Gazetteers

Gazetteers are directories of place names and locations, which are useful for:

  1. Identifying variations in place names
  2. Obtaining coordinates
  3. Locating a place within a hierarchy of places
  4. Generating lists of types of features

For example, if you’re working with data that’s associated with specific cities, mountains, or bodies of water, and you have the names of these features but not the coordinates or the country or state / province where they’re located, you can use a gazetteer to obtain all three. Or, if you want to create a map of a specific type of feature (i.e. populated places, ruins, mines) or want map labels for features (forests, bodies of water) you can extract and plot the gazetteer data in GIS.

In this post I’ll provide an overview of two major global gazetteers: the GEOnet Names Server and Geonames. Each one provides several different interfaces and services for exploring and accessing data which I’ll briefly mention, but I’ll focus on on the data files that you can download and what’s contained in them. I’ll conclude with a strategy for relating a small to medium place-based data file of your own to the gazetteer to obtain coordinates. If you have a file with hundreds or a few thousand records and were planning to get coordinates by eyeballing Google Maps and clicking one by one, try this instead.

NGA GNS

File Downloads | Documentation and code book

The US National Geospatial-Intelligence Agency (NGA) maintains a vast gazetteer with data for all of the countries in the world (almost) and provides it to the public via the GEOnet Names Server (GNS). The GNS gazetteer does NOT include features in the United States or any of its territories; the US Geological Survey maintains a separate system called the Geographic Names Information System (GNIS) whose structure and organization is different.

The GNS is updated on a weekly basis and is provided through a number of interfaces that include a map-based and a text-based search, and Web Mapping (WMS) and Web Feature (WFS) Services that allow you to display data in a GIS or a web map.

Data files are packaged on a country by country basis. Alternatively you can download one file that has the whole world in it, or an archive with separate files for each country. The data is stored in tab-delimited text files that include a header row (i.e. the column names). ZIP files for each country include a primary file that contains all the country’s features, and a series of files that contain a subset of the primary file based on feature type. So, if you wanted to work with just populated places or with hydrographic features you can work with the specific file instead of having to filter them out of the primary one.

Each record in the GNS represents a name for a feature, as opposed to a feature itself. Thus, if a feature is known by more than one name it will appear multiple times in the file. Each record has a unique feature identifier (UFI) and a unique name identifier (UNI) which are large integers. The UFI number is repeated in the data, while the UNI is unique. The GNS files contain a number of different columns containing several feature names (short names, long ones, with and without diacritics) and a name type column (NT) that indicates whether the record is for a an approved (N), or variant name (V). If you want a list of features without duplicates, you would need to create a subset of the records that only includes the approved name.

Features are classified into nine broad classes (FC), which in turn are subdivided into many different designations (DSG). The nine classes are: administrative region, populated place, vegetation, locality or area, undersea, roads and railroads, hypsographic (terrain), hydrographic (water), and spot (point-based features). Additional columns include codes designating the size of a populated place (PC) and relative importance of the feature (DISPLAY) which is useful when mapping data at varying scales. The GNS does not contain information on actual population or elevation (this was included in the past but is no longer available).

The GNS includes a few geographic references that indicate where the feature is located. There is a global region code (RC) in the first column, a primary country code (cc1) and an administrative division (state or province) code for the primary country, and a secondary country code (cc2). Geographic features like rivers, seas, mountains, and forests may span the boundary of more than one country, so the cc1 and cc2 columns indicate this. Data in these fields may be stored as a comma-separated list or array with the different codes. The GNS uses two-letter FIPS 10-4 country codes created by the US government.

Country codes in the GNS

This SQL query illustrates how country and admin1 codes are stored in the GNS, and how some features (streams in this case) span several countries.

Lastly, longitude and latitude coordinates are provided in separate fields in two formats: decimal degrees (needed for plotting and mapping) and degrees-minutes-seconds. The coordinates are in the WGS 84 CRS (EPSG 4326).

Geonames

File Downloads | Documentation and code book

Geonames is the Wikipedia or OpenStreetMap of gazetteers. It’s a collaborative, crowd-sourced project. Many users may contribute a few locations or make a correction or two, but by and large most of the data comes from public or government sources that is loaded into Geonames en masse and subsequently modified. Geonames provides a text and map-based search, and an API that let’s scripters and programmers directly access the data.

Data files are packaged country by country, or globally by certain types (i.e. all countries or the largest cities). The data is stored in tab-delimited text files without a header row, so you need to consult the documentation to identify the columns. All data for each country is packaged in a single file.

Unlike the GNS, each Geonames record represents a specific feature. There is a conventional name (name) and a variant that uses plain ascii characters (asciiname). Some variant names are included in a single list / array column called alternatenames; to get a full list of variants and spellings in different languages you would download a separate alternate names file that you could link to this one. Each feature is assigned a geonameid, which is simply a large unique integer.

Features are divided into the same nine classes that are used in the GNS, and the subdivisions are the same as well. Documentation for the classes and subdivisions is provided. Population and elevation data is provided when available and relevant, but there’s no information on timeliness or source in the data file (but you can view the full edit history for a record in the online interface).

Geonames goes to great lengths to provide the geographic framework or hierarchy for each feature, so you can get instant geographic context. They use two-letter ISO country codes to designate countries (country_code), a list of alternate or secondary countries (cc2), and for the primary country up to four different levels of administrative divisions (i.e. state / province, county, municipality, etc). There’s also a field that indicates what timezone each feature is in.

There is one set of longitude and latitude coordinates in decimal degrees in the WGS84 CRS.

Geonames Belize City

Geonames search result for Belize City, illustrating options and available data.

Summary Comparison

To compare the different files I downloaded data for Belize, since it has a small number of records. The GNS file had 2,801 records for names, but if you look at unique features the record count was 2,180. The Geonames file for Belize has a comparable number of 2,309.

Commonalities

  • Free and publicly available
  • Tab-delimited text in country-based files
  • Longitude and latitude coordinates in decimal degrees in WGS84
  • Same feature classification system with nine classes and multiple sub-classes

GNS

  • A single, official government source
  • A file of feature names: must filter out variants to get unique feature records
  • File comes with column header
  • Files are divided into sub-files for feature classes
  • Uses FIPS codes for countries
  • Useful fields for ranking features for mapping
  • Limited data on geographic hierarchy
  • No data on population or elevation
  • Lacks data for the United States and territories (obtainable via the USGS GNIS)

Geonames

  • Collaborative project with data from many sources
  • A file of features, variant names included in separate column
  • Additional alternate names and spellings in most languages available in separate files
  • File lacks column header
  • Uses ISO codes for countries
  • Extensive information on geographic hierarchy
  • Has population, elevation, and timezone for certain features
  • No ranking columns for map display

Gazetteer Caveats

1. It’s important to recognize that each source uses different codes for classifying countries: the GNS uses FIPS and Geonames uses ISO. While they appear similar (two-letter abbreviations) they are NOT the same: The FIPS code for Belize is BH and the ISO Code if BZ; in the ISO system BH is for Bahrain while the FIPS system doesn’t use BZ as a code. The CIA World Factbook includes a table comparing different country code systems. The GNS will convert to ISO at some uncertain date in the future.

2. Gazetteer data must be imported using UTF-8 encoding to preserve all the characters from the various alphabets.

3. Each feature in a gazetteer will have longitude and latitude coordinates that represent the geographic center of a feature. That means that a large areal feature like a country, a linear feature like a road, and a small point feature like a monument will have one coordinate pair. The coordinates for the monument will be pretty precise, while the set for the road and country are broad generalizations. Long linear features like roads and rivers may appear in the datasets several times as distinct feature records at different points. While it’s possible to get bounding box coordinates from Geonames, this data is not included in the downloadable country files.

4. A place name may appear multiple times in a gazetteer because names are not unique. Several different places of the same type may have the same name, and several features of different types may have the same name. For example, the Geonames file for Belize has four places name Santa Elena; two are populated places in different parts of the country while the other two are spot features (a camp and an estate) that are located near each of the populated places. The GNS file has even more records for this place, some with the approved name Santa Elena and others with the variant Saint Helena.

GNS Names and Variants

GNS records for Santa Elena, Belize. Notice the UFI is duplicated for features that have multiple names while the UNI is unique. The NT field indicates approved names (N) versus other types like variants (V). Records are for a mix of populated places (P, PPL) and spot (S) types of various kinds (ancient site, campground, and estate).

For all these reasons, it rarely makes sense to use the files in their entirety for obtaining names and coordinates or plotting places. You’ll want to extract data just for the types of features that you need. If you’re trying to match a list of place names to the gazetteer you’ll need to insure that you’re matching the right name to the right place. You can use the feature classes and the administrative divisions of the country to narrow down the location, and when in doubt use the gazetteer map interfaces to locate a specific place.

Matching Your Own Data to a Gazetteer

Winnow down the gazetteer file to just the features you need. Make sure that all the place names in your own data file are standardized so you don’t have variant spellings for the same place. In your data add a column for a unique identifier at the beginning of the sheet. Locate each place in your file in the gazetteer, then copy the unique ID from that file into your sheet. Then, if you’re using a spreadsheet you can use the VLOOKUP formula to use the ID from your sheet to pull related data from the gazetteer sheet (the longitude and latitude coordinates, codes for the administrative divisions, etc). This saves you a lot of copying and pasting. Similarly, if you were using a relational database you can write a JOIN statement to tie the two tables together using the ID.

This approach saves you the time of manually clicking on Google Maps or OSM to look up coordinates for a place and transcribing them, and you get the added benefit of grabbing any extra useful information the gazetteer provides. If you haven’t started the process of gathering your own data, start with the gazetteer file: winnow it down and append your own data to it as your research progresses.

But what if you had tons of coordinates that you need to retrieve? Because of the ambiguity in place names using a VLOOKUP or JOIN based on the name will be imprecise, because there may be more than one place with the same name and you’ll have no way of knowing if you selected the right one. You could modify your own data and the data in the gazetteer by concatenating administrative codes to the place name (i.e. St. Elena, 02) to make the name more precise and increase the chances of an accurate join. This approach requires you to be familiar with the administrative subdivisions in the areas you’re researching.

If you were trying to identify coordinates for tens of thousands of towns, cities, and larger administrative divisions you could try using a geocoder instead of a gazetteer. Geocoders are designed primarily for obtaining coordinates for addresses, but if an exact match can’t be found many will return coordinates for the smallest possible area that’s part of the address. If you provided a list of cities that also include a state / province and country, you could obtain the coordinates for just the city.

A final alternative where you can get a wider range of features in a geospatial format in bulk is the OpenStreetMap. I’ll return to this in a future post, but there’s an excellent OSM – QGIS tutorial that can help get you started.

Interested in learning more? If you’re in the spatial sciences or digital humanities check out this book: Placing Names: Enriching and Integrating Gazetteers.

LISA map of Broad Band Subscription by Household

Mapping US Census Data on Internet Access

ACS Data on Computers and the Internet

The Census Bureau recently released the latest five-year period estimates from the American Community Survey (ACS), with averages covering the years from 2013 to 2017.

Back in 2013 the Bureau added new questions to the ACS on computer and internet use: does a household have a computer or not, and if yes what type (desktop or laptop, smartphone, tablet, or other), and does a household have an internet subscription or not, and if so what kind (dial-up, broadband, and type of broadband). 1-year averages for geographies with 65,000 people or more have been published since 2013, but now that five years have passed there is enough data to publish reliable 5-year averages for all geographies down to the census tract level. So with this 2013-2017 release we have complete coverage for computer and internet variables for all counties, ZCTAs, places (cities and towns), and census tracts for the first time.

Summaries of this data are published in table S2801, Types of Computers and Internet Subscriptions. Detailed tables are numbered B28001 through B28010 and are cross-tabulated with each other (presence of computer and type of internet subscription) and by age, educational attainment, labor force status, and race. You can access them all via the American Factfinder or the Census API, or from third-party sites like the Census Reporter. The basic non-cross-tabbed variables have also been incorporated into the Census Bureau’s Social Data Profile table DP02, and in the MCDC Social profile.

The Census Bureau issued a press-release that discusses trends for median income, poverty rates, and computer and internet use (addressed separately) and created maps of broadband subscription rates by county (I’ve inserted one below). According to their analysis, counties that were mostly urban had higher average rates of access to broadband internet (75% of all households) relative to mostly rural counties (65%) and completely rural counties (63%). Approximately 88% of all counties that had subscription rates below 60 percent were mostly or completely rural.

Figure 1. Percentage of Households With Subscription to Any Broadband Service: 2013-2017[Source: U.S. Census Bureau]

Not surprisingly, counties with lower median incomes were also associated with lower rates of subscription. Urban counties with median incomes above $50,000 had an average subscription rate of 80% compared to 71% for completely rural counties. Mostly urban counties with median incomes below $50k had average subscription rates of 70% while completely rural counties had an average rate of 62%. In short, wealthier rural counties have rates similar to less wealthy urban counties, while less wealthy rural areas have the lowest rates of all. There also appear to be some regional clusters of high and low broadband subscriptions. Counties within major metro areas stand out as clusters with higher rates of subscription, while large swaths of the South have low rates of subscription.

Using GeoDa to Identify Broadband Clusters

I was helping a student recently with making LISA maps in GeoDa, so I quickly ran the data (percentage of households with subscription to any broadband service) through to see if there were statistically significant clusters. It’s been a couple years since I’ve used GeoDa and this version (1.12) is significantly more robust than the one I remember. It focuses on spatial statistics but has several additional applications to support basic data mapping and stats. The interface is more polished and the software can import and export a number of different vector and tabular file formats.

The Univariate Local Moran’s I analysis, also known as LISA for local indicators of spatial auto-correlation, identifies statistically significant geographic clusters of a particular variable. Once you have a polygon shapefile or geopackage with the attribute you want to study, you add it to GeoDa and then create a weights file (Tools menu) using the unique identifier for the shapes. The weights file indicates how individual polygons neighbor each other: queens contiguity classifies features as neighbors as long as they share a single node, while rooks contiguity classifies them as neighbors if they share an edge (at least two points that can form a line).

Once you’ve created and saved a weights file you can run the analysis (Shapes menu). You select the variable that you want to map, and can choose to create a cluster map, scatter plot, and significance map. The analysis generates 999 random permutations of your data and compares it to the actual distribution to evaluate whether clusters are likely the result of random chance, or if they are distinct and significant. Once the map is generated you can right click on it to change the number of permutations, or you can filter by significance level. By default a 95% confidence level is used.

The result for the broadband access data is below. The High-High polygons in red are statistically significant clusters of counties that have high percentages of broadband use: the Northeast corridor, much of California, the coastal Pacific Northwest, the Central Rocky Mountains, and certain large metro areas like Atlanta, Chicago, Minneapolis, big cities in Texas, and a few others. There is a relatively equal number of Low-Low counties that are statistically significant clusters of low broadband service. This includes much of the deep South, south Texas, and New Mexico. There are also a small number of outliers. Low-High counties represent statistically significant low values surrounded by higher values. Examples include highly urban counties like Philadelphia, Baltimore City, and Wayne County (Detroit) as well as some rural counties located along the fringe of metro areas. High-Low counties represent significant higher values surrounded by lower values. Examples include urban counties in New Mexico like Santa Fe, Sandoval (Albuquerque), and Otero (Alamogordo), and a number in the deep south. A few counties cannot be evaluated as they are islands (mostly in Hawaii) and thus have no neighbors.

LISA map of Broad Band Subscription by Household

LISA Map of % of Households that have Access to Broadband Internet by County (2013-2017 ACS). 999 permutations, 95% conf interval, queens contiguity

All ACS data is published at a 90% confidence level and margins of error are published for each estimate. Margins of error are typically higher for less populated areas, and for any population group that is small within a given area. I calculated the coefficient of variation for this variable at the county level to measure how precise the estimates are, and used GeoDa to create a quick histogram. The overwhelming majority had CV values below 15, which is regarded as being highly reliable. Only 16 counties had values that ranged from 16 to 24, which puts them in the medium reliability category. If we were dealing with a smaller population (for example, dial-up subscribers) or smaller geographies like ZCTAs or tracts, we would need to be more cautious in analyzing the results, and might have to aggregate smaller populations or areas into larger ones to increase reliability.

Wrap Up

The issue of the digital divide has gained more coverage in the news lately with the exploration of the geography of the “new economy”, and how technology-intensive industries are concentrating in certain major metros while bypassing smaller metros and rural areas. Lack of access to broadband internet and reliable wifi in rural areas and within older inner cities is one of the impediments to future economic growth in these areas.

You can download a shapefile with the data and results of the analysis described in this post.

Washington DC street

Using the ACS to Calculate Daytime Population

I’m in the home stretch for getting the last chapter of the first draft of my census book completed. The next to last chapter of the book provides an overview of a number of derivatives that you can create from census data, and one of them is the daytime population.

There are countless examples of using census data for site selection analysis and for comparing and ranking places for locating new businesses, providing new public services, and generally measuring potential activity or population in a given area. People tend to forget that census data measures people where they live. If you were trying to measure service or business potential for residents, the census is a good source.

Counts of residents are less meaningful if you wanted to gauge how crowded or busy a place was during the day. The population of an area changes during the day as people leave their homes to go to work or school, or go shopping or participate in social activities. Given the sharp divisions in the US between residential, commercial, and industrial uses created by zoning, residential areas empty out during the weekdays as people travel into the other two zones, and then fill up again at night when people return. Some places function as job centers while others serve as bedroom communities, while other places are a mixture of the two.

The Census Bureau provides recommendations for calculating daytime population using a few tables from the American Community Survey (ACS). These tables capture where workers live and work, which is the largest component of the daytime population.

Using these tables from the ACS:

Total resident population
B01003: Total Population
Total workers living in area and Workers who lived and worked in same area
B08007: Sex of Workers by Place of Work–State and County Level (‘Total:’ line and ‘Worked in county of residence’ line)
B08008: Sex of Workers by Place of Work–Place Level (‘Total:’ line and ‘Worked in place of residence’ line)
B08009: Sex of Workers by Place of Work–Minor Civil Division Level (‘Total:’ line and ‘Worked in MCD of residence’ line)
Total workers working in area
B08604: Total Workers for Workplace Geography

They propose two different approaches that lead to the same outcome. The simplest approach: add the total resident population to the total number of workers who work in the area, and then subtract the total resident workforce (workers who live in the area but may work inside or outside the area):

Daytime Population = Total Residents + Total Workers in Area - Total Resident Workers

For example, according to the 2017 ACS Washington DC had an estimated 693,972 residents (from table B01003), 844,345 (+/- 11,107) people who worked in the city (table B08604), and 375,380 (+/- 6,102) workers who lived in the city. We add the total residents and total workers, and subtract the total workers who live in the city. The subtraction allows us to avoid double counting the residents who work in the city (as they are already included in the total resident population) while omitting the residents who work outside the city (who are included in the total resident workers). The result:

693,972 + 844,345 - 375,380 = 1,162,937

And to get the new margin of error:

SQRT(0^2 + 11,107^2 + 6,102^2) = 12,673

So the daytime population of DC is approx 468,965 people (68%) higher than its resident population. The district has a high number of jobs in the government, non-profit, and education sectors, but has a limited amount of expensive real estate where people can live. In contrast, I did the calculation for Philadelphia and its daytime population is only 7% higher than its resident population. Philadelphia has a much higher proportion of resident workers relative to total workers. Geographically the city is larger than DC and has more affordable real estate, and faces stiffer suburban competition for private sector jobs.

The variables in the tables mentioned above are also cross-tabulated in other tables by age, sex, race, Hispanic origin , citizenship status, language, poverty, and tenure, so it’s possible to estimate some characteristics of the daytime population. Margins of error will limit the usefulness of estimates for small population groups, and overall the 5-year period estimates are a better choice for all but the largest areas. Data for workers living in an area who lived and worked in the same area is reported for states, counties, places (incorporated cities and towns), and municipal civil divisions (MCDs) for the states that have them.

Data for the total resident workforce is available for other, smaller geographies but is reported for those larger places, i.e. we know how many people in a census tract live and work in their county or place of residence, but not how many live and work in their tract of residence. In contrast, data on the number of workers from B08604 is not available for smaller geographies, which limits the application of this method to larger areas.

Download or explore these ACS tables from your favorite source: the American Factfinder, the Census Reporter, or the Missouri Census Data Center.

Net Out-Migration from the NY Metro Area to Other Metro Areas 2011-2015

Recent Migration Trends for New York City and Metro

The Baruch GIS lab crew just published a paper: New Yorkers on the Move: Recent Migration Trends for the City and Metro Area. The paper (no. 15 Feb 2018) is part of the Weissman Center for International Business Occasional Paper Series, which focuses on New York City’s role in the international and domestic economy.

Findings

We analyzed recent population trends (2010 to 2016) in New York City and the greater metropolitan area using the US Census Bureau’s Population Estimates to study components of population change (births, deaths, domestic and international migration) and the IRS Statistics of Income division’s county to county migration data to study domestic migration flows.

Here are the main findings:

  1. The population of New York City and the New York Metropolitan Area increased significantly between 2010 and 2016, but annually growth has slowed due to greater domestic out-migration.
  2. Compared to other large US cities and metro areas, New York’s population growth depends heavily on foreign immigration and natural increase (the difference between births and deaths) to offset losses from domestic out-migration.
  3. Between 2011 and 2015 the city had few relationships where it was a net receiver of migrants (receiving more migrants than it sends) from other large counties. The New York metro area had no net-receiver relationships with any major metropolitan area.
  4. The city was a net sender (sending more migrants than it received) to all of its surrounding suburban counties and to a number of large urban counties across the US. The metro area was a net sender to metropolitan areas throughout the country.

For the domestic migration portion of the analysis we were interested in seeing the net flows between places. For example, the NYC metro area sends migrants to and receives migrants from the Miami metro. What is the net balance between the two – who receives more versus who sends more?

The answer is: the NYC metro is a net sender to most of the major metropolitan areas in the country, and has no significant net receiver relationships with any other major metropolitan area. For example, for the period from 2011 to 2015 the NYC metro’s largest net sender relationship was with the Miami metro. About 88,000 people left the NYC metro for metro Miami while 58,000 people moved in the opposite direction, resulting in a net gain of 30,000 people for Miami (or in other words, a net loss of 30k people for NYC). The chart below shows the top twenty metros where the NYC metro had a deficit in migration (sending more migrants to these areas than it received). A map of net out-migration from the NYC metro to other metros appears at the top of this post. In contrast, NYC’s largest net receiver relationship (where the NYC metro received more migrants than it sent) was with Ithaca, New York, which lost a mere 300 people to the NYC metro.

All of our summary data is available here.

domestic migration to NYMA 2011-2015: top 20 deficit metro areas

Process

For the IRS data we used the county to county migration SQLite database that Janine meticulously constructed over the course of the last year, which is freely available on the Baruch Geoportal. Anastasia employed her Python and Pandas wizardry to create Jupyter notebooks that we used for doing our analysis and generating our charts, all of which are available on github. I used an alternate approach with Python and the SQLite and prettytable modules to generate estimates independently of Anastasia, so we could compare the two and verify our numbers (we were aggregating migration flows across years and geographies from several tables, and calculating net flows between places).

One of our goals for this project was to use modern tools and avoid the clunky use of email. With the Jupyter notebooks, git and github for storing and syncing our work, and ShareLaTeX for writing the paper, we avoided using email for constantly exchanging revised versions of scripts and papers. Ultimately I had to use latex2rtf to convert the paper to a word processing format that the publisher could use. This post helped me figure out which bibliography packages to choose (in order for latex2rtf to interpret citations and references, you need to use the older natbib & bibtex combo and not biblatex & biber).

If you are doing similar research, Zillow has an excellent post that dicusses the merits of the different datasets. There are also good case studies on Washington DC and Philadelphia that employ the same datasets.