Natural increase: the difference between births and deaths
Domestic migration: moves between two points within the United States
Foreign migration: moves between the United States and a US territory or foreign country
Net migration: the difference between in-migration and out-migration (measured separately for domestic and foreign)
NYC: the five counties / boroughs that comprise New York City
NYMA: the New York Metropolitan Area as defined by the Office of Management and Budget in Sept 2018, consists of 10 counties in NY State (including the 5 NYC counties), 12 in New Jersey, and one in Pennsylvania
Population growth in both NYC and the NYMA was driven by positive net foreign migration and natural increase, which offset negative net domestic migration.
Population growth for both NYC and the NYMA was strong over the first half of the decade, but population growth slowed as domestic out-migration increased from 2011 to 2017.
NYC and the NYMA began experiencing population loss from 2017 forward, as both foreign migration and natural increase began to decelerate. Declines in foreign migration are part of a national trend; between 2016 and 2019 net foreign migration for the US fell by 43% (from 1.05 million to 595 thousand).
The city and metro’s experience fit within national trends. Most of the top counties in the US that are home to the largest cities and many of the largest metropolitan areas experienced slower population growth over the decade. In addition to NYC, three counties: Cook (Chicago), Los Angeles, and Santa Clara (San Jose) experienced actual population loss towards the decade’s end. The New York, Los Angeles, and Chicago metro areas also had declining populations by the latter half of the decade.
Most of NYC’s domestic out-migrants moved to suburban counties within the NYMA (representing 38% of outflows and 44% of net out-migration), and to Los Angeles County, Philadelphia County, and counties in Florida. Out-migrants from the NYMA moved to other large metros across the country, as well as smaller, neighboring metros like Poughkeepsie NY, Fairfield CT, and Trenton NJ. Metro Miami and Philadelphia were the largest sources of both in-migrants and out-migrants.
NYC and the NYMA lack any significant relationships with other counties and metro areas where they are net receivers of domestic migrants, receiving more migrants from those places than they send to those places.
NYC and the NYMA are similar to the cities and metros of Los Angeles and Chicago, in that they rely on high levels foreign migration and natural increase to offset high levels of negative domestic migration, and have few substantive relationships where they are net receivers of domestic migrants. Academic research suggests that the absolute largest cities and metros behave this way; attracting both low and high skilled foreign migrants while redistributing middle and working class domestic migrants to suburban areas and smaller metros. This pattern of positive foreign migration offsetting negative domestic migration has characterized population trends in NYC for many decades.
During the 2010s, most of the City and Metro’s foreign migrants came from Latin America and Asia. Compared to the US as a whole, NYC and the NYMA have slightly higher levels of Latin American and European migrants and slightly lower levels of Asian and African migrants.
Given the Census Bureau’s usual residency concept and the overlap in the onset the of COVID-19 pandemic lock down with the 2020 Census, in theory the pandemic should not alter how most New Yorkers identify their usual residence as of April 1, 2020. In practice, the pandemic has been highly disruptive to the census-taking process, which raises the risk of an under count.
The impact of COVID-19 on future domestic migration is difficult to gauge. Many of the pandemic destinations cited in recent cell phone (NYT and WSJ) and mail forwarding (NYT) studies mirror the destinations that New Yorkers have moved to between 2011 and 2018. Foreign migration will undoubtedly decline in the immediate future given pandemic disruptions, border closures, and restrictive immigration policies. The number of COVID-19 deaths will certainly push down natural increase for 2020.
In this post I’ll demonstrate how I created annotated heatmaps (or what I’m calling a rank change grid) showing change in rank over time using Python and Matplotlib’s imshow plots. I was writing a report on population trends and internal migration using the IRS county to county migration dataset, and wanted to depict the top origins and destinations of migrants for New York City and the New York Metropolitan Area and how they changed from year to year.
I hit upon this idea based on an example in the Matplotlib documentation using the imshow plot. Imshow was designed for manipulating and creating images, but since images are composed of rows and columns of pixels you can use this function to create grids (for GIS folks, think of a raster). The rows can indicate rank from 1 to N, while the columns could represent time, which in my case is years. I could label each grid cell with the name of a place (i.e. origin or destination), and if a place changes ranks over time I could assign the cell a color indicating increase or decrease; otherwise I’d assign a neutral color indicating no change. The idea is that you could look at place at a given rank in year 1 and follow it across the chart by looking at the label. If a new place appears in a given position, the color change clues you in, and you can quickly scan to see whether a given place went up or down.
The image below shows change in rank for the top metro area destinations for migrants leaving the NYC metro from 2011 to 2018. You can see that metro Miami was the top destination for several years, up until 2016-17 when it flips positions with metro Philadelphia, which had been the number 2 destination. The sudden switch from a neutral color indicates that the place occupying this rank is new. You can also follow how 3rd ranked Bridgeport falls to 4th place in the 2nd year (displaced by Los Angeles), remains in 4th place for a few years, and then falls to 5th place (again bumped by Los Angeles, which falls from 3rd to 4th as it’s bumped by Poughkeepsie).
I opted for this over a more traditional approach called a bump chart (also referred to a slope chart or graph), with time on the x-axis and ranks on the y-axis, and observations labeled at either the first or last point in time. Each observation is assigned a specific color or symbol, and lines connect each observation to its changing position in rank so you can follow it along the chart. Interpreting these charts can be challenging; if there are frequent changes in rank the whole thing begins to look like spaghetti, and the more observations you have the tougher it gets to interpret. Most of the examples I found depicted a small and finite number of observations. I have hundreds of observations and only want to see the top ten, and if observations fall in and out of the top N ranks you get several discontinuous lines which look odd. Lastly, neither Matplotlib or Pandas have a default function for creating bump charts, although I found a few examples where you could create your own.
Creating the rank change grids was a three-part process that required: taking the existing data and transforming it into an array of the top or bottom N values that you want to show, using that array to generate an array that shows change in ranks over time, and generating a plot using both arrays, one for the value and the other for the labels. I’ll tackle each piece in this post. I’ve embedded the functions at the end of each explanation; you can also look at my GitHub repo that has the Jupyter Notebook I used for the analysis for the paper (to be published in Sept 2020).
Create the Initial Arrays
In the paper I was studying flows between NYC and other counties, and the NYC metro area and other metropolitan statisical areas. I’ll refer just to the metro areas as my example in this post, but my functions were written to handle both types of places, stored in separate dataframes. I began with a large dataframe with every metro that exchanged migrants with the NYC metro. There is a row for each metro where the index is the Census Bureau’s unique FIPS code, and columns that show inflows, outflows, and net flows year by year (see image below). There are some rows that represent aggregates, such as flows to all non-metro areas and the sum of individual metro flows that could not be disclosed due to privacy regulations.
The first step is to create an array that has just the top or bottom N places that I want to depict, just for one flow variable (in, out, or net). Why an array? Arrays are pretty solid structures that allow you to select specific rows and columns, and they mesh nicely with imshow charts as each location in the matrix can correspond with the same location in the chart. Most of the examples I looked at used arrays. It’s possible to use other structures but it’s more tedious; nested Python lists don’t have explicit rows and columns so a lot of looping and slicing is required, and with dataframes there always seems to be some catch with data types, messing with the index versus the values, or something else. I went with NumPy’s array type.
I wrote a function where I pass in the dataframe, the type of variable (in, out, or net flow), the number of places I want, whether they are counties or metro areas, and whether I want the top or bottom N records (true or false). Two arrays are returned: the first shows the FIPS unique ID numbers of each place, while the second returns the labels. You don’t have to do anything to calculate actual ranks, because once the data is sorted the ranks become implicit; each row represents ranks 1 through 10, each column represents a year, and the ID or label for a place that occupies each position indicates its rank for that year.
In my dataframe, the names of the columns are prefixed based on the type of variable (inflow, outflow, or net flow), followed by the year, i.e. inflows_2011_12. In the function, I subset the dataframe by selecting columns that start with the variable I want. I have to deal with different issues based on whether I’m looking at counties or metro areas, and I need to get rid of any IDs that are for summary values like the non-metro areas; these IDS are stored in a list called suppressed, and the ~df.indexisin(suppressed) is pandaesque for taking anything that’s not in this list (the tilde acts as not). Then, I select the top or bottom values for each year, and append them to lists in a nested list (each sub-list represents the top / bottom N places in order for a given year). Next, I get the labels I want by creating a dictionary that relates all ID codes to label names, pull out the labels for the actual N values that I have, and format them before appending them to lists in a nested list. For example, the metro labels are really long and won’t fit in the chart, so I split them and grab just the first piece: Albany-Schenectady-Troy, NY becomes Albany (split using the dash) while Akron, OH becomes Akron (if no dash is present, split at comma). At the end, I use np.array to turn the nested lists into arrays, and transpose (T) them so rows become ranks and years become values. The result is below:
# Create array of top N geographies by flow type, with rows as ranks and columns as years
# Returns 2 arrays with values for geographies (id codes) and place names
# Must specify: number of places to rank, counties or metros, or sort by largest or smallest (True or False)
cols=[c for c in df if c.startswith(flowtype)]
for c in cols:
if largest is True:
elif largest is False:
for row in geogs:
for uid in row:
if fips[uid]=='District of Columbia, DC':
line.append(fips[uid].replace('County, ','\n')) #creates short labels
if '-' in fips[uid]:
line.append(fips[uid].split('-')) #creates short labels
return a_geogs, a_labels
Change in Rank Array
Using the array of geographic ID codes, I can feed this into function number two to create a new array that indicates change in rank over time. It’s better to use the ID code array as we guarantee that the IDs are unique; labels (place names) may not be unique and pose all kinds of formatting issues. All places are assigned a value of 0 for the first year, as there is no previous year to compare them to. Then, for each subsequent year, we look at each value (ID code) and compare it to the value in the same position (rank) in the previous column (year). If the value is the same, that place holds the same rank and is assigned a 0. Otherwise, if it’s different we look at the new value and see what position it was in in the previous year. If it was in a higher position last year, then it has declined and we assign -1. If it was in a lower position last year or was not in the array in that column (i.e. below the top 10 in that year) it has increased and we assign it a value of 1. This result is shown below:
# Create array showing how top N geographies have changed ranks over time, with rows as rank changes and
# columns as years. Returns 1 array with values: 0 (no change), 1 (increased rank), and -1 (descreased rank)
# Create a number of blank lists
changelist = [ for _ in range(rowcount)]
for i in range(colcount):
# Rank change for 1st year is 0, as there is no previous year
for j in range(rowcount):
col=geoarray[:,i] #Get all values in this col
prevcol=geoarray[:,i-1] #Get all values in previous col
for v in col:
array_pos=np.where(col == v) #returns array
current_pos=int(array_pos) #get first array value
array_pos2=np.where(prevcol == v) #returns array
if len(array_pos2)==0: #if array is empty, because place was not in previous year
previous_pos=int(array_pos2) #get first array value
#No change in rank
elif current_posprevious_pos: #Larger value = smaller rank
#Rank has decreased
Create the Plot
Now we can create the actual chart! The rank change array is what will actually be charted, but we will use the labels array to display the names of each place. The values that occupy the positions in each array pertain to the same place. The chart function takes the names of both these arrays as input. I do some fiddling around at the beginning to get the labels for the x and y axis the way I want them. Matplotlib allows you to modify every iota of your plot, which is in equal measures flexible and overwhelming. I wanted to make sure that I showed all the tick labels, and changed the default grid lines to make them thicker and lighter. It took a great deal of fiddling to get these details right, but there were plenty of examples to look at (Matplotlib docs, cookbook, Stack Overflow, and this example in particular). For the legend, shrinking the colorbar was a nice option so it’s not ridiculously huge, and I assign -1, 0, and 1 to specific colors denoting decrease, no change, and increase. I loop over the data values to get their corresponding labels, and depending on the color that’s assigned I can modify whether the text is dark or light (so you can see it against the background of the cell). The result is what you saw at the beginning of this post for outflows (top destinations for migrants leaving the NY metro). The function call is below:
# Create grid plot based on an array that shows change in ranks and an array of cell labels
xlabels=[yr.replace('_','-') for yr in years]
mycolors = colors.ListedColormap(['#de425b','#f7f7f7','#67a9cf'])
fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(rank_change, cmap=mycolors)
# Show all ticks...
# ... and label them with the respective list entries
# Create white grid.
ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
cbar = ax.figure.colorbar(im, ax=ax, ticks=[1,0,-1], shrink=0.5)
# Loop over data dimensions and create text annotations.
for i in range(len(ylabels)):
for j in range(len(xlabels)):
if rank_change[i,j] &amp;lt; 0:
text = ax.text(j, i, alabels[i, j],
ha="center", va="center", color="w", fontsize=10)
text = ax.text(j, i, alabels[i, j],
ha="center", va="center", color="k", fontsize=10)
#ax.set_title("Change in Rank Over Time")
Conclusions and Alternatives
I found that this approach worked well for my particular circumstances, where I had a limited number of data points to show and the ranks didn’t fluctuate much from year to year. The charts for ten observations displayed over seven points in time fit easily onto standard letter-sized paper; I could even get away with adding two additional observations and an eighth point in time if I modified the size and placement of the legend. However, beyond that you can begin to run into trouble. I generated charts for the top twenty places so I could see the results for my own analysis, but it was much too large to create a publishable graphic (at least in print). If you decrease the dimensions for the chart or reduce the size of the grid cells, the labels start to become unreadable (print that’s too small or overlapping labels).
There are a number of possibilities for circumventing this. One would be to use shorter labels; if we were working with states or provinces we can use the two-letter postal codes, or ISO country codes in the case of countries. Not an option in my example. Alternatively, we could move the place names to the y-axis (sorted alphabetically or by first or final year rank) and then use the rank as the annotation label. This would be a fundamentally different chart; you could see how one place changes in rank over time, but it would be tougher to discern which places were the most important source / destination for the area you’re studying (you’d have to skim through the whole chart). Or, you could keep ranks on the y-axis and assign each place a unique color in the legend, shade the grid cells using that color, and thus follow the changing colors with your eye. But this flops is you have too many places / colors.
A different caveat is this approach doesn’t work so well if there is a lot of fluctuation in ranks from year to year. In this example, the top inflows and outflows were relatively stable from year to year. There were enough places that held the same rank that you could follow the places that changed positions. We saw the example above for outflows, below is an example for inflows (i.e. the top origins or sources of migrants moving to the NY metro):
In contrast, the ranks for net flows were highly variable. There was so much change that the chart appears as a solid block of colors with few neutral (unchanged) values, making it difficult to see what’s going on. An example of this is below, representing net flows for the NYC metro area. This is the difference between inflows and outflows, and the chart represents metros that receive more migrants from New York than they send (i.e. net receivers of NY migrants). While I didn’t use the net flow charts in my paper, it was still worth generating as it made it clear to me that net flow ranks fluctuate quite a bit, which was a fact I could state in the text.
There are also a few alternatives to using imshow. Matplotlib’s pcolor plot can produce similar effects but with rectangles instead of square grid cells. That could allow for more observations and longer labels. I thought it was less visually pleasing than the equal grid, and early on I found that implementing it was clunkier so I went no further. My other idea was to create a table instead of a chart. Pandas has functions for formatting dataframes in a Jupyter Notebook, and there are options for exporting the results out to HTML. Formatting is the downside – if you create a plot as an image, you export it out and can then embed it into any document format you like. When you’re exporting tables out of a notebook, you’re only exporting the content and not the format. With a table, the content and formatting is separate, and the latter is often tightly bound to the publication format (Word, LaTeX, HTML, etc.) You can design with this in mind if you’re self-publishing a blog post or report, but this is not feasible when you’re submitting something for publication where an editor or designer will be doing the layout.
I really wanted to produce something that I could code and run automatically in many different iterations, and was happy with this solution. It was an interesting experiment, as I grappled with taking something that seemed intuitive to do the old-fashioned way (see below) and reproducing it in a digital, repeatable format.
Each semester I visit several undergraduate classes in public affairs and journalism, to introduce students to census data. They’re researching or reporting on particular issues and trends in neighborhoods in New York City, and they are looking for statistics to either support their work or generate ideas for a story. I usually showcase the NYC Population Factfinder as a starting point, mention the Census Reporter for areas outside the city , and provide background info on the decennial census, American Community Survey, and census geography and subjects. This year I included two new examples toward the beginning of the lecture to spark their interest.
I recently helped reporter Susannah Jacob navigate census data for an article she wrote on hyper-gentrification in the West Village for the New York Review of Books. A perfect example, as it’s what the students are expected to do for their assignment! Like any good journalist (and human geographer), Susannah pounded the pavement of the neighborhood, interviewing residents and small businesses and observing and documenting the urban landscape and how it was changing. But she also wanted to see what the data could tell her, and whether it would corroborate or refute what she was seeing and hearing.
We used the NYC Population Factfinder to assemble census tracts to approximate the neighborhood, and I did a little legwork to pull data from the County / ZIP Code Business Patterns so we could see how the business landscape was changing. The most surprising stat we discovered was that the number of 1-unit detached homes had doubled. This wouldn’t be odd in many rapidly growing places in the US, but it’s unusual for an old, built-out urban neighborhood. A 1-unit detached home is a free-standing single family structure that doesn’t share walls with other buildings. Most homes in Manhattan are either attached (row houses / town houses) or units in multi-unit buildings (apartments / condos / co-ops). How could this be? Uber-wealthy people are buying up adjoining row homes, knocking down the walls, and turning them into urban mansions. Seems extraordinary, but apparently is part of a trend.
We certainly ran up against the limitations of ACS data. The estimates for tracts have large margins of error, and when comparing two short time frames it’s difficult to detect actual change, as differences in estimates are clouded by sampling noise. Even after aggregating several tracts, many of the estimates for change weren’t reliable enough to report. When they were (as in the housing example) you could only say that there has been a relative increase without becoming wedded to a precise number. In this case, from 214 (+/- 127) detached units in 2006-2010 to 627 (+/-227) in 2013-2017, an increase of 386 (+/- 260). Not great estimates, but you can say it’s an increase as the low end for change is still positive at 126 units. Considering the time frame and character of the neighborhood, that’s still noteworthy (bearing in mind we’re working with a 90% confidence interval). In cases where the differences overlap and could represent either an increase or decrease there are few claims you can make, and it’s best to walk away (or look at larger area). I always discuss the margin of error with students and caution them about treating these numbers as counts.
While census data is invaluable for describing and studying individual places, it’s inherent geographic nature also allows us to study places in relation to each other, and to illustrate geographic patterns. For my second example, I zoom out and show them this map of racial-ethnic distribution in the United States:
This is one of a series of six maps by demographer William Frey at the Brookings Institute that highlights the geographic diversity of the United States. In this map, each county is shaded for a particular race / ethnicity if the population of that group in that county is greater than that group’s share of the national population. For example, Hispanics / Latinos represent 18.3% of the total US population, so counties where they represent more than this percentage are shaded.
For the purpose of the class, it helps make the census ‘pop’ and gets the students to think about the statistics as geospatial datasets that they can see and relate to, and that can form the basis for interesting research.
I used the pic at the top of this post as the background for my intro slide. It’s a screenshot of a city from A-Train, a 1992 city-building train simulator that was ported from Japan to the world by Artdink and Maxis, following the success of something called SimCity. It wasn’t nearly as successful, but I always liked the graphics which have now attained a retro-gaming vibe.
Most of the participants were librarians who were interested in learning how to help patrons find and understand census data, but there were also some data analysts in the crowd. We began with an overview of how the census is structured by dataset, geography, and subject categories. I always cover the differences between the decennial census and the ACS, with a focus on how to interpret ACS estimates and gauge their reliability.
For workshops I think it’s best to start with searching for profiles (lots of different data for one place). This gives new users a good overview of the breadth and depth of the types of variables that are available in the census. Since this was a New York City-centric crowd we looked at the City’s excellent NYC Population Factfinder first. The participants formed small groups and searched through the application to answer a series of fact-finding questions that I typically receive. Beyond familiarizing themselves with the applications and data, the exercises also helped to spark additional questions about how the census is structured and organized.
Then we switched over to the Missouri Census Data Center’s profile and trends applications (listed on the right hand side of their homepage) to look up data for other parts of the country, and in doing so we were able to discuss the different census geographies that are available for different places. Everyone appreciated the simple and easy to use interface and the accessible tables and graphics. The MCDC doesn’t have a map-based search, so I did a brief demo of TIGERweb for viewing census geography across the country.
Once everyone had this basic exposure, we hopped into the American Factfinder to search for comparison tables (a few pieces of data for many places). We discussed how census data is structured in tables and what the difference between the profile, summary, and detailed tables are. We used the advanced search and I introduced my tried and true method of filtering by dataset, geography, and topic to find what we need. I mentioned the Census Reporter as good place to go for ACS documentation, and as an alternate source of data. Part of my theme was that there are many tools that are suitable for different needs and skill levels, and you can pick your favorite or determine what’s suitable for a particular purpose.
We took a follow-the-leader approach for the AFF, where I stepped through the website and the process for downloading two tables and importing them into a spreadsheet, high-lighting gotchas along the way. We did some basic formulas for aggregating ACS estimates to create new margins of error, and a VLOOKUP for tying data from two tables together.
We wrapped up the morning with a foreshadowing of what’s to come with the new data.census.gov (which will replace the AFF) and the 2020 census. While there’s still much uncertainty around the citizenship question and fears of an under count, the structure of the dataset won’t be too different from 2010 and the timeline for release should be similar.
We analyzed recent population trends (2010 to 2016) in New York City and the greater metropolitan area using the US Census Bureau’s Population Estimates to study components of population change (births, deaths, domestic and international migration) and the IRS Statistics of Income division’s county to county migration data to study domestic migration flows.
Here are the main findings:
The population of New York City and the New York Metropolitan Area increased significantly between 2010 and 2016, but annually growth has slowed due to greater domestic out-migration.
Compared to other large US cities and metro areas, New York’s population growth depends heavily on foreign immigration and natural increase (the difference between births and deaths) to offset losses from domestic out-migration.
Between 2011 and 2015 the city had few relationships where it was a net receiver of migrants (receiving more migrants than it sends) from other large counties. The New York metro area had no net-receiver relationships with any major metropolitan area.
The city was a net sender (sending more migrants than it received) to all of its surrounding suburban counties and to a number of large urban counties across the US. The metro area was a net sender to metropolitan areas throughout the country.
For the domestic migration portion of the analysis we were interested in seeing the net flows between places. For example, the NYC metro area sends migrants to and receives migrants from the Miami metro. What is the net balance between the two – who receives more versus who sends more?
The answer is: the NYC metro is a net sender to most of the major metropolitan areas in the country, and has no significant net receiver relationships with any other major metropolitan area. For example, for the period from 2011 to 2015 the NYC metro’s largest net sender relationship was with the Miami metro. About 88,000 people left the NYC metro for metro Miami while 58,000 people moved in the opposite direction, resulting in a net gain of 30,000 people for Miami (or in other words, a net loss of 30k people for NYC). The chart below shows the top twenty metros where the NYC metro had a deficit in migration (sending more migrants to these areas than it received). A map of net out-migration from the NYC metro to other metros appears at the top of this post. In contrast, NYC’s largest net receiver relationship (where the NYC metro received more migrants than it sent) was with Ithaca, New York, which lost a mere 300 people to the NYC metro.
For the IRS data we used the county to county migration SQLite database that Janine meticulously constructed over the course of the last year, which is freely available on the Baruch Geoportal. Anastasia employed her Python and Pandas wizardry to create Jupyter notebooks that we used for doing our analysis and generating our charts, all of which are available on github. I used an alternate approach with Python and the SQLite and prettytable modules to generate estimates independently of Anastasia, so we could compare the two and verify our numbers (we were aggregating migration flows across years and geographies from several tables, and calculating net flows between places).
One of our goals for this project was to use modern tools and avoid the clunky use of email. With the Jupyter notebooks, git and github for storing and syncing our work, and ShareLaTeX for writing the paper, we avoided using email for constantly exchanging revised versions of scripts and papers. Ultimately I had to use latex2rtf to convert the paper to a word processing format that the publisher could use. This post helped me figure out which bibliography packages to choose (in order for latex2rtf to interpret citations and references, you need to use the older natbib & bibtex combo and not biblatex & biber).
The staff at the Population Division at NYC City Planning take the limitations of the American Community Survey (ACS) data seriously. Census estimates for tract-level data tend to be unreliable; to counter this, they aggregate tracts into larger Neighborhood Tabulation Areas (NTAs) to produce estimates that have better precision. In their Census Factfinder tool, they display but grey-out variables where the margin of error (MOE) is unacceptably large. If users want to aggregate geographies, the Factfinder does the work of re-computing the margins of error.
Now they’ve released a new tool for census mappers. The Map Reliability Calculator is an Excel spreadsheet for measuring the reliability of classification schemes for making choropleth maps. Because each ACS estimate is published with a MOE, it’s possible that certain estimates may fall outside their designated classification range.
For example, we’re 90% confident that 60.5% plus or minus 1.5% of resident workers 16 years and older in Forest Hills, Queens took public transit to work during 2011-2015. The actual value could be as low as 59% or as high as 62%. Now let’s say we have a classification scheme that has a class with a range from 60% to 80%. Forest Hills would be placed in this class since its estimate is 60.5%, but it’s possible that it could fall into the class below it given the range of the margin of error (as the value could be as low as 59%).
The tool determines how good your classification scheme is by calculating the percent of estimates that could fall outside their assigned class, based on each MOE and the break point of the class. On the left of the sheet you paste your estimates and MOEs, and then type the number of classes you want. On the right, the reliability of classifying that data is calculated for equal intervals (equal range of values in each class) and quantiles (equal number of data points in each class). You can see the reliability of each class and the overall reliability of the scheme. The scheme is classified as reliable if: no individual class has more than 20% of its values identified as possibly falling outside the class, and less than 10% of all the scheme’s values possibly fall outside their classes.
I pasted some 5-year ACS data for NYC PUMAs below (the percentage of workers 16 years and older who take public transit to work in 2011-2015) under STEP 1. In STEP 2 I entered 5 for the number of classes. In the classification schemes on the right, equal intervals is reliable; only 6.6% of the values may fall outside their class. Quantiles was not reliable; 11.9% fell outside. If I reduce the number of classes to 4, reliability improves and both schemes fall under 10%; although unreliability for one of the classes for quantiles is high at 18%, but still below the 20% threshold. Equal intervals should usually perform better than quantiles, as the latter scheme can make rather arbitrary breaks that result in small differences in value ranges between classes (in order to insure that each class has the same number of data points).
You can also enter custom-defined schemes. For example let’s say you use natural breaks (classes determined by gaps in value ranges). There’s a 2-step process here; first you classify the data in GIS and determine what the breaks are, and then you enter them in the spreadsheet. If you’re using QGIS there’s a snag in doing this; QGIS doesn’t show you the “true” breaks of your data based on the actual values, and when you classify data it displays clean breaks that overlap. For example, natural breaks of this data with 5 classes appears like this:
So, does the value for 29.0 fall in the first class or the second? The answer is, the first (test it by selecting that record in the attribute table and see where it is on the map, and what color it is). So you need to adjust the values appropriately, paying attention to the precision and scale of your numbers. In this case I bump the first value of each class up by .1, except for the bottom class which you leave alone:
In the calculator you have to enter the top class value first, and just the first value in the range:
In this case only 7.1% of the total values may fall outside their class so things look good – but my bottom class barely makes the minimum class threshold at 19.4%. I can try dropping the classes down to 4 or I can manually adjust this class to see if I can improve reliability.
If you’re unsure if you made the right adjustments to the classes in translating them from QGIS to the calculator, in QGIS turn on the Show Feature Count option for the layer to see how many data points are in each class, and compare that to the class counts in the calculator. If they don’t match, you need to re-adjust.
This is a great tool for census mappers who want or need to account for issues with ACS reliability. It’s an Excel spreadsheet but I used it in LibreOffice Calc with no problem. In addition to the calculator sheet there’s a second sheet with instructions and background info. Download the Map Reliability Calculator here. You can try it out with this test data, workers who commute with mass transit, 2011-2015 ACS for NYC PUMAs.