I have some news! After 13 1/2 years, January 31, 2021 will be my last day as the Geospatial Data Librarian at Baruch College, City University of New York (CUNY). On February 1, 2021, I will be the new GIS and Data Librarian at Brown University in Providence, Rhode Island!
It’s an exciting opportunity that I’m looking forward to. I will be building geospatial information and data services in the library from the ground up, in concert with many new colleagues. I will be working closely with the Population Studies Training Center (PSTC) and the Spatial Structures in Social Sciences (S4) as well as the Center for Digital Scholarship within the library. Several aspects of the position will be similar, as I will continue to provide research and consultation services, create research guides and tutorials, teach workshops, collect and create datasets, and eventually build and manage a data repository and small lab where we’ll provide services and peer mentor students.
The resources I’ve created at Baruch CUNY will remain accessible, and eventually a new person will take the reins. I have moved the latest materials for the GIS Practicum, my introductory QGIS tutorial and workshop, to this website and I hope to continue updating and maintaining this resource. There are a lot of people throughout CUNY that I’m going to miss, at: the Newman Library, the CUNY Institute for Demographic Research, the Weissman Center for International Business, the Marxe School, Baruch’s Journalism Department, the Geography Department at Lehman College, the Digital Humanities program and the CUNY Mapping Service at the CUNY Graduate Center, and many others.
I will continue writing posts and sharing tips and resources here based on my new adventures at Brown, but may need a little break as I transition… stay tuned!
For this final post of 2020, I was looking back through recent projects for something interesting yet brief; I’ve been writing some encyclopedia-length posts lately and wanted to keep this one on the lighter side. In that vein, I’ve decided to share a short list of free web mapping services that I use as basemaps in QGIS (they’ll work in ArcGIS too). This has been on my mind as I’ve recently stumbled upon the OpenTopoMap, which is an alternate stylized version of the OpenStreetMap that looks pretty sharp.
Select the appropriate web map service type in the browser panel (usually WMS / WMTS or XYZ Tiles), right click, and add new connection.
Give it a meaningful name, paste the appropriate URL into the URL box, click OK.
In the browser panel drill down to see the service, and for WMS / WMTS layers you can drill down further to see specific layers you can add.
Select the layer and drag it into the window, or select, right click, and add the layer to the project.
If the resolution looks off, right click on a blank area of the toolbar and check the Tile Scale Panel. Use this to adjust the zoom for the web map. If the scale bar is greyed out you’ll need to set the map window to the same CRS as the map service: select the layer in the panel, right click, and choose set CRS – set project CRS from layer.
Some web layers may render slowly if you’re zoomed out to the full extent, or even not at all if they contain many features or are super detailed. Conversely, some layers may not render if you’re zoomed too far in, as tiles may not be available at that resolution. Experiment!
If you’re an ArcGIS user see these concise instructions for adding various tile layers. This isn’t something that I’ve ever done, as ArcGIS already has a number of accessible basemaps that you can add.
In the list below, links for the service name take you to either the website version of the service, or to a list of additional layers that you can connect to. The URLs that follow are the actual connections to the service that you’ll use within your GIS package. If you use OSM, OTP, or Stamen in your maps, make sure to cite them (they use Creative Commons Licenses – follow links to their websites for details). The government sources are public domain, but you should still cite them anyway. Happy mapping, and happy holidays!
Current TIGER features:
Current physical features:
In this post I’ll demonstrate how I created annotated heatmaps (or what I’m calling a rank change grid) showing change in rank over time using Python and Matplotlib’s imshow plots. I was writing a report on population trends and internal migration using the IRS county to county migration dataset, and wanted to depict the top origins and destinations of migrants for New York City and the New York Metropolitan Area and how they changed from year to year.
I hit upon this idea based on an example in the Matplotlib documentation using the imshow plot. Imshow was designed for manipulating and creating images, but since images are composed of rows and columns of pixels you can use this function to create grids (for GIS folks, think of a raster). The rows can indicate rank from 1 to N, while the columns could represent time, which in my case is years. I could label each grid cell with the name of a place (i.e. origin or destination), and if a place changes ranks over time I could assign the cell a color indicating increase or decrease; otherwise I’d assign a neutral color indicating no change. The idea is that you could look at place at a given rank in year 1 and follow it across the chart by looking at the label. If a new place appears in a given position, the color change clues you in, and you can quickly scan to see whether a given place went up or down.
The image below shows change in rank for the top metro area destinations for migrants leaving the NYC metro from 2011 to 2018. You can see that metro Miami was the top destination for several years, up until 2016-17 when it flips positions with metro Philadelphia, which had been the number 2 destination. The sudden switch from a neutral color indicates that the place occupying this rank is new. You can also follow how 3rd ranked Bridgeport falls to 4th place in the 2nd year (displaced by Los Angeles), remains in 4th place for a few years, and then falls to 5th place (again bumped by Los Angeles, which falls from 3rd to 4th as it’s bumped by Poughkeepsie).
I opted for this over a more traditional approach called a bump chart (also referred to a slope chart or graph), with time on the x-axis and ranks on the y-axis, and observations labeled at either the first or last point in time. Each observation is assigned a specific color or symbol, and lines connect each observation to its changing position in rank so you can follow it along the chart. Interpreting these charts can be challenging; if there are frequent changes in rank the whole thing begins to look like spaghetti, and the more observations you have the tougher it gets to interpret. Most of the examples I found depicted a small and finite number of observations. I have hundreds of observations and only want to see the top ten, and if observations fall in and out of the top N ranks you get several discontinuous lines which look odd. Lastly, neither Matplotlib or Pandas have a default function for creating bump charts, although I found a few examples where you could create your own.
Creating the rank change grids was a three-part process that required: taking the existing data and transforming it into an array of the top or bottom N values that you want to show, using that array to generate an array that shows change in ranks over time, and generating a plot using both arrays, one for the value and the other for the labels. I’ll tackle each piece in this post. I’ve embedded the functions at the end of each explanation; you can also look at my GitHub repo that has the Jupyter Notebook I used for the analysis for the paper (to be published in Sept 2020).
Create the Initial Arrays
In the paper I was studying flows between NYC and other counties, and the NYC metro area and other metropolitan statisical areas. I’ll refer just to the metro areas as my example in this post, but my functions were written to handle both types of places, stored in separate dataframes. I began with a large dataframe with every metro that exchanged migrants with the NYC metro. There is a row for each metro where the index is the Census Bureau’s unique FIPS code, and columns that show inflows, outflows, and net flows year by year (see image below). There are some rows that represent aggregates, such as flows to all non-metro areas and the sum of individual metro flows that could not be disclosed due to privacy regulations.
The first step is to create an array that has just the top or bottom N places that I want to depict, just for one flow variable (in, out, or net). Why an array? Arrays are pretty solid structures that allow you to select specific rows and columns, and they mesh nicely with imshow charts as each location in the matrix can correspond with the same location in the chart. Most of the examples I looked at used arrays. It’s possible to use other structures but it’s more tedious; nested Python lists don’t have explicit rows and columns so a lot of looping and slicing is required, and with dataframes there always seems to be some catch with data types, messing with the index versus the values, or something else. I went with NumPy’s array type.
I wrote a function where I pass in the dataframe, the type of variable (in, out, or net flow), the number of places I want, whether they are counties or metro areas, and whether I want the top or bottom N records (true or false). Two arrays are returned: the first shows the FIPS unique ID numbers of each place, while the second returns the labels. You don’t have to do anything to calculate actual ranks, because once the data is sorted the ranks become implicit; each row represents ranks 1 through 10, each column represents a year, and the ID or label for a place that occupies each position indicates its rank for that year.
In my dataframe, the names of the columns are prefixed based on the type of variable (inflow, outflow, or net flow), followed by the year, i.e. inflows_2011_12. In the function, I subset the dataframe by selecting columns that start with the variable I want. I have to deal with different issues based on whether I’m looking at counties or metro areas, and I need to get rid of any IDs that are for summary values like the non-metro areas; these IDS are stored in a list called suppressed, and the ~df.indexisin(suppressed) is pandaesque for taking anything that’s not in this list (the tilde acts as not). Then, I select the top or bottom values for each year, and append them to lists in a nested list (each sub-list represents the top / bottom N places in order for a given year). Next, I get the labels I want by creating a dictionary that relates all ID codes to label names, pull out the labels for the actual N values that I have, and format them before appending them to lists in a nested list. For example, the metro labels are really long and won’t fit in the chart, so I split them and grab just the first piece: Albany-Schenectady-Troy, NY becomes Albany (split using the dash) while Akron, OH becomes Akron (if no dash is present, split at comma). At the end, I use np.array to turn the nested lists into arrays, and transpose (T) them so rows become ranks and years become values. The result is below:
# Create array of top N geographies by flow type, with rows as ranks and columns as years
# Returns 2 arrays with values for geographies (id codes) and place names
# Must specify: number of places to rank, counties or metros, or sort by largest or smallest (True or False)
cols=[c for c in df if c.startswith(flowtype)]
for c in cols:
if largest is True:
elif largest is False:
for row in geogs:
for uid in row:
if fips[uid]=='District of Columbia, DC':
line.append(fips[uid].replace('County, ','\n')) #creates short labels
if '-' in fips[uid]:
line.append(fips[uid].split('-')) #creates short labels
return a_geogs, a_labels
Change in Rank Array
Using the array of geographic ID codes, I can feed this into function number two to create a new array that indicates change in rank over time. It’s better to use the ID code array as we guarantee that the IDs are unique; labels (place names) may not be unique and pose all kinds of formatting issues. All places are assigned a value of 0 for the first year, as there is no previous year to compare them to. Then, for each subsequent year, we look at each value (ID code) and compare it to the value in the same position (rank) in the previous column (year). If the value is the same, that place holds the same rank and is assigned a 0. Otherwise, if it’s different we look at the new value and see what position it was in in the previous year. If it was in a higher position last year, then it has declined and we assign -1. If it was in a lower position last year or was not in the array in that column (i.e. below the top 10 in that year) it has increased and we assign it a value of 1. This result is shown below:
# Create array showing how top N geographies have changed ranks over time, with rows as rank changes and
# columns as years. Returns 1 array with values: 0 (no change), 1 (increased rank), and -1 (descreased rank)
# Create a number of blank lists
changelist = [ for _ in range(rowcount)]
for i in range(colcount):
# Rank change for 1st year is 0, as there is no previous year
for j in range(rowcount):
col=geoarray[:,i] #Get all values in this col
prevcol=geoarray[:,i-1] #Get all values in previous col
for v in col:
array_pos=np.where(col == v) #returns array
current_pos=int(array_pos) #get first array value
array_pos2=np.where(prevcol == v) #returns array
if len(array_pos2)==0: #if array is empty, because place was not in previous year
previous_pos=int(array_pos2) #get first array value
#No change in rank
elif current_posprevious_pos: #Larger value = smaller rank
#Rank has decreased
Create the Plot
Now we can create the actual chart! The rank change array is what will actually be charted, but we will use the labels array to display the names of each place. The values that occupy the positions in each array pertain to the same place. The chart function takes the names of both these arrays as input. I do some fiddling around at the beginning to get the labels for the x and y axis the way I want them. Matplotlib allows you to modify every iota of your plot, which is in equal measures flexible and overwhelming. I wanted to make sure that I showed all the tick labels, and changed the default grid lines to make them thicker and lighter. It took a great deal of fiddling to get these details right, but there were plenty of examples to look at (Matplotlib docs, cookbook, Stack Overflow, and this example in particular). For the legend, shrinking the colorbar was a nice option so it’s not ridiculously huge, and I assign -1, 0, and 1 to specific colors denoting decrease, no change, and increase. I loop over the data values to get their corresponding labels, and depending on the color that’s assigned I can modify whether the text is dark or light (so you can see it against the background of the cell). The result is what you saw at the beginning of this post for outflows (top destinations for migrants leaving the NY metro). The function call is below:
# Create grid plot based on an array that shows change in ranks and an array of cell labels
xlabels=[yr.replace('_','-') for yr in years]
mycolors = colors.ListedColormap(['#de425b','#f7f7f7','#67a9cf'])
fig, ax = plt.subplots(figsize=(10,10))
im = ax.imshow(rank_change, cmap=mycolors)
# Show all ticks...
# ... and label them with the respective list entries
# Create white grid.
ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
cbar = ax.figure.colorbar(im, ax=ax, ticks=[1,0,-1], shrink=0.5)
# Loop over data dimensions and create text annotations.
for i in range(len(ylabels)):
for j in range(len(xlabels)):
if rank_change[i,j] &amp;lt; 0:
text = ax.text(j, i, alabels[i, j],
ha="center", va="center", color="w", fontsize=10)
text = ax.text(j, i, alabels[i, j],
ha="center", va="center", color="k", fontsize=10)
#ax.set_title("Change in Rank Over Time")
Conclusions and Alternatives
I found that this approach worked well for my particular circumstances, where I had a limited number of data points to show and the ranks didn’t fluctuate much from year to year. The charts for ten observations displayed over seven points in time fit easily onto standard letter-sized paper; I could even get away with adding two additional observations and an eighth point in time if I modified the size and placement of the legend. However, beyond that you can begin to run into trouble. I generated charts for the top twenty places so I could see the results for my own analysis, but it was much too large to create a publishable graphic (at least in print). If you decrease the dimensions for the chart or reduce the size of the grid cells, the labels start to become unreadable (print that’s too small or overlapping labels).
There are a number of possibilities for circumventing this. One would be to use shorter labels; if we were working with states or provinces we can use the two-letter postal codes, or ISO country codes in the case of countries. Not an option in my example. Alternatively, we could move the place names to the y-axis (sorted alphabetically or by first or final year rank) and then use the rank as the annotation label. This would be a fundamentally different chart; you could see how one place changes in rank over time, but it would be tougher to discern which places were the most important source / destination for the area you’re studying (you’d have to skim through the whole chart). Or, you could keep ranks on the y-axis and assign each place a unique color in the legend, shade the grid cells using that color, and thus follow the changing colors with your eye. But this flops is you have too many places / colors.
A different caveat is this approach doesn’t work so well if there is a lot of fluctuation in ranks from year to year. In this example, the top inflows and outflows were relatively stable from year to year. There were enough places that held the same rank that you could follow the places that changed positions. We saw the example above for outflows, below is an example for inflows (i.e. the top origins or sources of migrants moving to the NY metro):
In contrast, the ranks for net flows were highly variable. There was so much change that the chart appears as a solid block of colors with few neutral (unchanged) values, making it difficult to see what’s going on. An example of this is below, representing net flows for the NYC metro area. This is the difference between inflows and outflows, and the chart represents metros that receive more migrants from New York than they send (i.e. net receivers of NY migrants). While I didn’t use the net flow charts in my paper, it was still worth generating as it made it clear to me that net flow ranks fluctuate quite a bit, which was a fact I could state in the text.
There are also a few alternatives to using imshow. Matplotlib’s pcolor plot can produce similar effects but with rectangles instead of square grid cells. That could allow for more observations and longer labels. I thought it was less visually pleasing than the equal grid, and early on I found that implementing it was clunkier so I went no further. My other idea was to create a table instead of a chart. Pandas has functions for formatting dataframes in a Jupyter Notebook, and there are options for exporting the results out to HTML. Formatting is the downside – if you create a plot as an image, you export it out and can then embed it into any document format you like. When you’re exporting tables out of a notebook, you’re only exporting the content and not the format. With a table, the content and formatting is separate, and the latter is often tightly bound to the publication format (Word, LaTeX, HTML, etc.) You can design with this in mind if you’re self-publishing a blog post or report, but this is not feasible when you’re submitting something for publication where an editor or designer will be doing the layout.
I really wanted to produce something that I could code and run automatically in many different iterations, and was happy with this solution. It was an interesting experiment, as I grappled with taking something that seemed intuitive to do the old-fashioned way (see below) and reproducing it in a digital, repeatable format.
I have recently created an atcoordinates YouTube channel that features a series of how-to videos on finding and accessing US census data using a variety of websites and tools. I explain basic census concepts while demonstrating how to access data. At this point there are four videos:
Exploring US Census Data: Basic Concepts. This is a narrated slide show where I cover the essential choices you need to make and concepts you need to understand in order to access census data, regardless of the tool or platform: data set, time period, subjects or topics, and geography. I discuss the decennial census, American Community Survey, and population estimates. This video is intended as a prerequisite for viewing the others, so I don’t have to explain the same concepts each time and can focus on demonstrating each particular application.
American Community Survey Census Profiles with MCDC Apps. This screencast illustrates how you can quickly and easily access census profiles for any place in the US using the Missouri Census Data Center’s profile applications. It’s also a good introduction to census data in general, if you’re unfamiliar with the scope of data that’s available.
I plan on adding additional videos every month or so. The pandemic lock down and uncertainty over whether classes will be back in session this fall inspired me to do this. While I prefer written tutorials, I find that I’ve been watching YouTube more often for learning how to do certain tasks with particular software, so I thought this would be useful for others. The videos average about 10 to 15 minutes in length, although the introductory one is a bit longer. The length is intentional; I wanted to explain the concepts and describe why you’re making certain choices, instead of simply pointing and clicking without any explanation.
Feel free to spread the news, share and embed the videos in research guides or web pages, and use them in classes or workshops. Of course, for a more in-depth look at US census data, check out my book: Exploring the US Census: Your Guide to America’s Datapublished by SAGE.
I just posted an updated version of my QGIS tutorial / workbook manual, Introduction to GIS Using Open Source Software. This version was written for QGIS 3.10 A Coruña, which recently superseded QGIS 3.4 Madeira as the current Long Term Release (LTR). The LTR is intended to be more stable than the current releases and is supported for at least a year.
The workbook was designed to accompany a day-long introductory workshop that I teach and is divided into five chapters. Chapter 1 is a broad and concise overview of GIS, chapters 2 to 4 are hands-on exercises that cover: the basics of using the interface and the difference between vectors and rasters (chapter 2), a site selection analysis that demonstrates geoprocessing, spatial selection, table joins, coordinate plotting, expressions, and spatial analysis (chapter 3), and a thematic mapping example that illustrates coordinate reference systems (CRS), data classification, and mapping (chapter 4). Chapter 5 summarizes data sources and resources for learning more about GIS. In chapters 2 to 4 the steps for doing the exercises are kept concise with many screenshots, while detailed commentary explaining how everything works follows.
The manual and tutorial data are freely available for personal and classroom use under a Creative Commons license. I’m providing the material for both 3.10 A Coruna and 3.4 Madeira for now, but will take down the latter at the end of the spring semester (late May 2020).
The changes between 3.4 and 3.10 are not dramatic as far as the basic tools and principles that I cover in the book go, but I thought an update was worthwhile as there are just enough changes that could trip up new users (see the 3.10 visual change log for a full list of software updates). In addition to incorporating changes to the interface, I also took the opportunity to tighten up and condense the material. In particular, I consolidated the coordinate reference system (CRS) exercises in chapter 4 from two sections to one, because in practice I found that it was overkill for a one-day session.
Here are a few noteworthy changes to the tutorial and software that impact novice users:
The default setting for the toolbar buttons is rather small, so during the setup phase in chapter 2 I inserted an optional step to make them bigger. Go to: Settings > Options > General tab, and under the Application section change the icon size from 16 to 24.
In 3.10, when new files are generated from geoprocessing operations and added to your project, the layers appear in the layers panel with the name you give them. In 3.4 they were assigned generic aliases like “Clipped” and “Buffer” based on the process you ran.
In 3.10 the “Quantiles” classification scheme has been replaced with “Equal Counts”. Same scheme, different terminology.
There’s now a dedicated north arrow button in the map layout screen. In 3.4 and earlier versions you added an arrow by selecting the add image button.
In 3.10, every time you add a layer with a CRS that doesn’t match the existing CRS of the window, you’re presented with a datum transformation screen to modify the file you’re adding. This is a helpful warning if you already have existing layers in your project that match the window and your new file doesn’t, but it’s annoying when you’re trying to add files to a blank window in a new project. You can turn this feature off under: Options > Settings > CRS tab, under Default Datum Transformations uncheck the box for Ask for datum transformation.
It’s hard to believe that this is the 10th edition I’ve published in the past ten years. QGIS has certainly come a long way during that time. For a trip down memory lane, look at the 1st edition I wrote for QGIS 1.5 Tethys in 2011! Back then I wrote the whole thing in HTML… thankfully I “discovered” LaTeX a year later, and have used it for writing tutorials ever since.
If you wanted to learn GIS in general and QGIS in particular, spend a day with the manual and work through the exercises and you’ll have a good foundation. All the basics are there, as well as best practices and the “gotchas” that tend to trip people up.
In this post I’ll demonstrate some essential data processing steps prior to joining census American Community Survey (ACS) tables downloaded from data.census.gov to TIGER shapefiles, in order to create thematic maps. I thought this would be helpful for students in my university who are now doing GIS-related courses from home, due to COVID-19. I’ll illustrate the following with Excel and QGIS: choosing an appropriate boundary file for making your map, manipulating geographic id codes (GEOIDs) to insure you can match data file to shapefile, prepping your spreadsheet to insure that the join will work, and calculating new summaries and percent totals with ACS formulas. Much of this info is drawn from the chapters in my book that cover census geography (chapter 3), ACS data (chapter 6), and GIS (chapter 10). I’m assuming that you already have some basic spreadsheet, GIS, and US census knowledge.
For readers who are not interested in the technical details, you still may be interested in the map we’ll create in this example: how many children under 18 lack access to a computer with internet access at home? With COVID-19 there’s a sudden expectation that all school children will take classes remotely from home. There are 73.3 million children living in households in the US, and approximately 9.3 million (12.7%) either have no computer at home, or have a computer but no internet access. The remaining children have a computer with either broadband or dial-up at home. Click on the map below to explore the county distribution of the under 18 population who lack internet access at home, or follow this link: https://arcg.is/0TrGTy.
Second, we need a boundary file of counties. You could go to the TIGER Line Shapefiles, which provides precise boundaries of every geographic area. Since we’re using this data to make a thematic map, I suggest using the Cartographic Boundary Files (CBF) instead, which are generalized versions of TIGER. Coastal water has been removed and boundaries have been smoothed to make the file smaller and less detailed. We don’t need all the detail if we’re making a national-scale map of the US that’s going on a small screen or an 8 1/2 by 11 piece of paper. I’m using the medium (5m) generalized county file for 2018. Download the files, put them together in a new folder on your computer, and unzip them.
TIGER Line shapefile
Downloads from data.census.gov include three csv files per table that contain: the actual data (data_with_overlays), metadata (list of variable ids and names), and a description of the table (table_title). There are some caveats when opening csv files with Excel, but they don’t apply to this example (see addendum to this post for details). Open your csv file in Excel, and save it as an Excel workbook (don’t keep it in a csv format).
The first column contains the GEOID, which is a code that uniquely identifies each piece of geography in the US. In my file, 0500000US28151 is the first record. The part before ‘US’ indicates the summary level of the data, i.e. what the geography is and where it falls in the census hierarchy. The 050 indicates this is a county. The part after the ‘US’ is the specific identifier for the geography, known as an ANSI / FIPS code: 28 is the state code for Mississippi, and 151 is the county code for Washington County, MS. You will need to use this code when joining this data to your shapefile, assuming that the shapefile has the same code. Will it?
That depends. There are two conventions for storing these codes; the full code 0500000US28151 can be used, or just the ANSI / FIPS portion, 28151. If your shapefile uses just the latter (find out by adding the shapefile in GIS and opening its attribute table), you won’t have anything to base the join on. The regular 2018 TIGER file uses just the ANSI / FIPS, but the 2018 CBF has both the full GEOID and the ANSI FIPS. So in this case we’re fine, but for the sake of argument if you needed to create the shorter code it’s easy to do using Excel’s RIGHT formula:
The formulas RIGHT, LEFT, and MID are used to return sub-strings of text
The formulas reads X characters from the right side of the value in the cell you reference and returns the result. You just have to count the number of characters up to the “S’ in the “US”. Copy and paste the formula all the way down the column. Then, select the entire column, right click and chose copy, select it again, right click and choose Paste Special and Values (in Excel, the little clipboard image with numbers on top of it). This overwrites all the formulas in the column with the actual result of the formula. You need to do this, as GIS can’t interpret your formulas. Put some labels in the two header spaces, like GEO_ID2 and id2.
Copy a column, and use Paste Special – Values on top of that column to overwrite formulas with values
Subsets and Headers
It’s common that you’ll download census tables that have more variables than you need for your intended purpose. In this example we’re interested in children (people under 18) living in households. We’re not going to use the other estimates for the population 18 to 64 and 65 and over. Delete all the columns you don’t need (if you ever needed them, you’ve got them saved in your csv as a backup).
Notice there are two header rows: one has a variable ID and the other has a label. In ACS tables the variables always come in pairs, where the first is the estimate and the second is the margin of error (MOE). For example, in Washington County, Mississippi there are 46,545 people living in households +/- 169. Columns are arranged and named to reflect how values nest: Estimate!!Total is the total number of people in households, Estimate!!Total!!Under 18 years is the number people under 18 living in households, which is a subset of the total estimate.
The rub here is that we’re not allowed to have two header rows when we join this table to our shapefile – we can only have one. We can’t keep the labels because they’re too long – once joined, the labels will be truncated to 10 characters and will be indistinguishable from each other. We’ll have to delete that row, leaving us with the cryptic variable IDs. We can choose to keep those IDs – remember we have a separate metadata csv file where we can look up the labels – or we can rename them. The latter is feasible if we don’t have too many. If you do rename them, you have to keep them short, no more than 10 characters or they’ll be truncated. You can’t use spaces (underscores are ok), any punctuation, and can’t begin variables names with a number. In this example I’m going to keep the variable IDs.
Two odd gotchas: first, find the District of Columbia in your worksheet and look at the MOE for total persons in households (variable 001M). There is a footnote for this value, five asterisks *****. Replace it with a zero. Keep an eye out for footnotes, as they wreak havoc. If you ever notice that a numeric column gets saved as text in GIS, it’s probably because there’s a footnote somewhere. Second, change the label for the county name from NAME to GEO_NAME (our shapefile already has a column called NAME, and it will cause problems if we have duplicates). If you save your workbook now, it’s ready to go if you want to map the data in it. But in this example we have some more work to do.
Create New ACS Values
We want to map the percentage of children that do not have access to either a computer or the internet at home. In this table these estimates are distinct for children with a computer and no internet (variable 006), and without a computer (variable 007). We’ll need to aggregate these two. For most thematic maps it doesn’t make sense to map whole counts or estimates; naturally places that have more people are going to have more computers. We need to normalize the data by calculating a percent total. We could do this work in the GIS package, but I think it’s easier to use the spreadsheet.
To calculate a new estimate for children with no internet access at home, we simply add the two values together (006_E and 007_E). To calculate a new margin of error, we take the square root of the sum of the squares for the MOEs that we’re combining (006_M and 007_M). We also use the ROUND formula so our result is a whole number. Pretty straightforward:
When summing ACS estimates, take the square root of the sum of the squares for each MOE to calculate a MOE for the new estimate.
To calculate a percent total, divide our new estimate by the number of people under 18 in households (002_E). The formula for calculating a MOE for a percent total is tougher: square the percent total and the MOE for the under 18 population (002_M), multiply them, subtract that result from the MOE for the under 18 population with no internet, take the square root of that result and divide it by the under 18 population (002_E):
The formula for calculating the MOE for a proportion includes: the percentage, MOE for the subset population (numerator), and the estimate and MOE for the total population (denominator)
In Washington County, MS there are 3,626 +/- 724 children that have no internet access at home. This represents 29.4% +/- 5.9% of all children in the county who live in a household. It’s always a good idea to check your math: visit the ACS Calculator at Cornell’s Program for Applied Demographics and punch in some values to insure that your spreadsheet formulas are correct.
You should scan the results for errors. In this example, there is just one division by zero error for Kalawao County in Hawaii. In this case, replace the formula with 0 for both percentage values. In some cases it’s also possible that the MOE proportion formula will fail for certain values. Not a problem in our example, but if it does the solution is to modify the formula for the failed cases to calculate a ratio instead. Replace the percentage in the formula with the ratio (the total population divided by the subset population) AND change the minus sign under the square root to a plus sign.
Some of these MOE’s look quite high relative to the estimate. If you’d like to quantify this, you can calculate a coefficient of variation for the estimate (not the percentage). This formula is straightforward: divide the MOE by 1.645, divide that result by the estimate, and multiply by 100:
A CV can be used to gauge the reliability of an estimate
Generally speaking, a CV value between 0-15 indicates that as estimate is highly reliable, 12-34 is of medium reliability, and 35 and above is low reliability.
That’s it!. Make sure to copy the columns that have the formulas we created, and do a paste-special values over top of them to replace the formulas with the actual values. Some of the CV values have errors because of division by zero. Select the CV column and do a find and replace, to find #DIV/0! and replace it with nothing. Then save and close the workbook.
In QGIS, we select the Data Source Manager button, and in the vector menu add the CBF shapefile. All census shapefiles are in the basic NAD83 system by default, which is not great for making a thematic map. Go to the Vector Menu – Data Management Tools – Reproject Layer. Hit the little globe beside Target CRS. In the search box type ‘US National’, select the US National Atlas Equal Area option in the results, and hit OK. Lastly, we press the little ellipses button beside the Reprojected box, Save to File, and save the file in a good spot. Hit Run to create the file.
In the layers menu, we remove the original counties file, then select the new one (listed as Reprojected), right click, Set CRS, Set Project CRS From Layer. That resets our window to match the map projection of this layer. Now we have a projected counties layer that looks better for a thematic map. If we right click the layer and open its attribute table, we can see that there are two columns we could use for joining: AFFGEOID is the full census code, and GEOID is the shorter ANSI / FIPS.
Hit the Data Source Manager button again, stay under the vector menu, and browse to add the Excel spreadsheet. If our workbook had multiple sheets we’d be prompted to choose which one. Close the menu and we’ll see the table in the layers panel. Open it up to insure it looks ok.
To do a join, select the counties layer, right click, and choose properties. Go to the Joins tab. Hit the green plus symbol at the bottom. Choose the spreadsheet as the join layer, GEO_ID as the join field in the spreadsheet, and AFFGEOID as the target field in the counties file. Go down and check Custom Field Name, and delete what’s in the box. Hit OK, and OK again in the Join properties. Open the attribute table for the shapefile, scroll over and we should see the fields from the spreadsheet at the end (if you don’t, check and verify that you chose the correct IDs in the join menu).
We’re ready to map. Right click the counties and go to the properties. Go to the Symbology tab and flip the dropdown from Single symbol to Graduated. This lets us choose a Column (percentage of children in households with no internet access) and create a thematic map. I’ve chosen Natural Breaks as the Mode and changed the colors to blues. You can artfully manipulate the legend to show the percentages as whole numbers by typing *100 in the Column box beside the column name, and adding a % at the end of the Legend format string. I also prefer to alter the default settings for boundary thickness: click the Change button beside Symbol, select Simple fill, and reduce the width of the boundaries from .26 to .06, and hit OK.
There we have a map! If you right click on the counties in the layers panel and check the Show Feature Count box, you’ll see how many counties fall in each category. Of course, to make a nice finished map with title, legend, and inset maps for AK, HI, and PR, you’d go into the Print Layout Manager. To incorporate information about uncertainty, you can add the county layer to your map a second time, and style it differently – maybe apply crosshatching for all counties that have a CV over 34. Don’t forget to save your project.
Percentage of Children in Households without Internet Access by County 2014-2018
How About that Web Map?
I used my free ArcGIS Online account to create the web map at the top of the page. I followed all the steps I outlined here, and at the end exported the shapefile that had my data table joined to it out as a new shapefile; in doing so the data became fused to the new shapefile. I uploaded the shapefile to ArcGIS online, chose a base map, and re-applied the styling and classification for the county layer. The free account includes a legend editor and expression builder that allowed me to show my percentages as fractions of 100 and to modify the text of the entries. The free account does not allow you to do joins, so you have to do this prep work in desktop GIS. ArcGIS Online is pretty easy to learn if you’re already familiar with GIS. For a brief run through check out the tutorial Ryan and I wrote as part of my lab’s tutorial series.
Addendum – Excel and CSVs
While csv files can be opened in Excel with one click, csv files are NOT Excel files. Excel interprets the csv data (plain text values separated by commas, with records separated by line breaks) and parses it into rows and columns for us. Excel also makes assumptions about whether values represents text or numbers. In the case of ID codes like GEOIDs or ZIP Codes, Excel guesses wrong and stores these codes as numbers. If the IDs have leading zeros, the zeros are dropped and the codes become incorrect. If they’re incorrect, when you join them to a shapefile the join will fail. Since data.census.gov uses the longer GEOID this doesn’t happen, as the letters ‘US’ are embedded in the code, which forces Excel to recognize it as text. But if you ever deal with files that use the shorter ANSI / FIPS you’ll run into trouble.
Instead of clicking on csvs to open them in Excel: launch Excel to a blank workbook, go to the data ribbon and choose import text files, select your csv file from your folder system, indicate that it’s a delimited text file, and select your ID column and specify that it’s text. This will import the csv and save it correctly in Excel.