library services

Brown University on OpenTopoMap

A New Year and a New Start

I have some news! After 13 1/2 years, January 31, 2021 will be my last day as the Geospatial Data Librarian at Baruch College, City University of New York (CUNY). On February 1, 2021, I will be the new GIS and Data Librarian at Brown University in Providence, Rhode Island!

It’s an exciting opportunity that I’m looking forward to. I will be building geospatial information and data services in the library from the ground up, in concert with many new colleagues. I will be working closely with the Population Studies Training Center (PSTC) and the Spatial Structures in Social Sciences (S4) as well as the Center for Digital Scholarship within the library. Several aspects of the position will be similar, as I will continue to provide research and consultation services, create research guides and tutorials, teach workshops, collect and create datasets, and eventually build and manage a data repository and small lab where we’ll provide services and peer mentor students.

The resources I’ve created at Baruch CUNY will remain accessible, and eventually a new person will take the reins. I have moved the latest materials for the GIS Practicum, my introductory QGIS tutorial and workshop, to this website and I hope to continue updating and maintaining this resource. There are a lot of people throughout CUNY that I’m going to miss, at: the Newman Library, the CUNY Institute for Demographic Research, the Weissman Center for International Business, the Marxe School, Baruch’s Journalism Department, the Geography Department at Lehman College, the Digital Humanities program and the CUNY Mapping Service at the CUNY Graduate Center, and many others.

I will continue writing posts and sharing tips and resources here based on my new adventures at Brown, but may need a little break as I transition… stay tuned!

Best – Frank

A-Train Classic

Neighborhood Research and the Census for Undergrads

Each semester I visit several undergraduate classes in public affairs and journalism, to introduce students to census data. They’re researching or reporting on particular issues and trends in neighborhoods in New York City, and they are looking for statistics to either support their work or generate ideas for a story. I usually showcase the NYC Population Factfinder as a starting point, mention the Census Reporter for areas outside the city , and provide background info on the decennial census, American Community Survey, and census geography and subjects. This year I included two new examples toward the beginning of the lecture to spark their interest.

I recently helped reporter Susannah Jacob navigate census data for an article she wrote on hyper-gentrification in the West Village for the New York Review of Books. A perfect example, as it’s what the students are expected to do for their assignment! Like any good journalist (and human geographer), Susannah pounded the pavement of the neighborhood, interviewing residents and small businesses and observing and documenting the urban landscape and how it was changing. But she also wanted to see what the data could tell her, and whether it would corroborate or refute what she was seeing and hearing.

NYRB Article on the West Village

Source: Jacob & Roye, New York Review of Books, Oct 2019.

We used the NYC Population Factfinder to assemble census tracts to approximate the neighborhood, and I did a little legwork to pull data from the County / ZIP Code Business Patterns so we could see how the business landscape was changing. The most surprising stat we discovered was that the number of 1-unit detached homes had doubled. This wouldn’t be odd in many rapidly growing places in the US, but it’s unusual for an old, built-out urban neighborhood. A 1-unit detached home is a free-standing single family structure that doesn’t share walls with other buildings. Most homes in Manhattan are either attached (row houses / town houses) or units in multi-unit buildings (apartments / condos / co-ops). How could this be? Uber-wealthy people are buying up adjoining row homes, knocking down the walls, and turning them into urban mansions. Seems extraordinary, but apparently is part of a trend.

We certainly ran up against the limitations of ACS data. The estimates for tracts have large margins of error, and when comparing two short time frames it’s difficult to detect actual change, as differences in estimates are clouded by sampling noise. Even after aggregating several tracts, many of the estimates for change weren’t reliable enough to report. When they were (as in the housing example) you could only say that there has been a relative increase without becoming wedded to a precise number. In this case, from 214 (+/- 127) detached units in 2006-2010 to 627 (+/-227) in 2013-2017, an increase of 386 (+/- 260). Not great estimates, but you can say it’s an increase as the low end for change is still positive at 126 units. Considering the time frame and character of the neighborhood, that’s still noteworthy (bearing in mind we’re working with a 90% confidence interval). In cases where the differences overlap and could represent either an increase or decrease there are few claims you can make, and it’s best to walk away (or look at larger area). I always discuss the margin of error with students and caution them about treating these numbers as counts.

While census data is invaluable for describing and studying individual places, it’s inherent geographic nature also allows us to study places in relation to each other, and to illustrate geographic patterns. For my second example, I zoom out and show them this map of racial-ethnic distribution in the United States:

Map of US Racial and Ethnic Diversity

Source: William H. Frey analysis of US Census population estimates, 2018.

This is one of a series of six maps by demographer William Frey at the Brookings Institute that highlights the geographic diversity of the United States. In this map, each county is shaded for a particular race / ethnicity if the population of that group in that county is greater than that group’s share of the national population. For example, Hispanics / Latinos represent 18.3% of the total US population, so counties where they represent more than this percentage are shaded.

For the purpose of the class, it helps make the census ‘pop’ and gets the students to think about the statistics as geospatial datasets that they can see and relate to, and that can form the basis for interesting research.

Some footnotes – if you like Frey’s maps, I highly recommend his book Diversity Explosion: How New Racial Demographics are Remaking America. It explores the evolving demographic and geographic landscape of the US with clear, accessible writing and more of these great maps (in color).

I used the pic at the top of this post as the background for my intro slide. It’s a screenshot of a city from A-Train, a 1992 city-building train simulator that was ported from Japan to the world by Artdink and Maxis, following the success of something called SimCity. It wasn’t nearly as successful, but I always liked the graphics which have now attained a retro-gaming vibe.

GIS consultations by status chart

Plotting Library GIS Services with Pandas

With the dawn of a new academic year I usually spend a little time looking back at the previous one. Since I began my position as Geospatial Data Librarian at Baruch College I’ve logged my questions, consultations, course visits, and workshops in a spreadsheet that I’ve used for creating summaries and charts. I spent a good chunk of this summer improving my Pandas skills, and put them to the test by summarizing and plotting my services data in a Jupyter Notebook instead.

Pandas is a data science module for Python that adds so many new components that it’s like a language all by itself. Its big selling point is that it adds a grid-like data structure to Python. In vanilla Python, you typically read data files into a list of lists where the big list represents the file, the individual lists represent rows, and the list elements represent values. There are no columns; to manipulate data you iterate through the sub-lists and elements by their position number. In well-structured datasets, elements in the same position in each sub-list represent attributes that would be stored in the same column.

In contrast, Pandas provides a true row and column structure called a dataframe, where you access each row by its index (a unique id) and columns by name or position. Furthermore, methods and functions that you apply to the data are automatically applied to entire rows and columns, and in some cases even to the entire dataframe, so that looping through data element by element is largely unnecessary. You’re able to treat a dataframe as if it were a spreadsheet or database table, in that you can concatenate dataframes together, merge them on their index numbers, and group records by values.

Using Pandas in concert with a Jupyter Notebook allows for an iterative approach to exploring and manipulating data, and is particularly conducive to creating plots and charts. You can use Python’s tried and true matplotlib module to build your chart bit by bit, or you can use Panda’s own plotting functions, which are wrappers around matplotlib that allow you to quickly create charts with fewer lines of code. Another plotting module called Seaborn offers a third approach.

This cheat sheet has become my indispensable reference for keeping track of the different Pandas functions and methods, and for helping me mentally navigate the different ways of doing things in Pandas versus regular Python. Plotting was a struggle at first, as I tried to figure out when to use Pandas versus matplotlib versus Seaborn. The fact that it’s possible to use all three at once to create the same plot added to my confusion! This visualization flowchart helped me sort things out. For simple stuff, I used the Pandas plot functions, but if the chart required additional customization I used matplotlib to generate the extra pieces, or the whole thing. In essence, use matplotlib for super detailed control over customization, and use Pandas plot functions as shortcuts for writing more concise code.


I’ve stored my notebook and the data file on github (still a work in progress) if you’d like to take a closer look (the notebook is the ipynb file). I’m going to address a portion of what’s in the notebook in this post.

First and foremost you need to import pandas and matplotlib’s pyplot. The %matplotlib inline trick tells the notebook to display all charts that you generate with matplotlib; otherwise it just creates them without displaying them. The lets you apply a global style (chart colors, background, grid lines etc) to all plots in your notebook. This convenient style sheet reference demonstrates what they all look like.

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt'seaborn-muted')

Web Stats

I’ll start with the simplest example. My spreadsheet doesn’t contain web stats, so I needed to hard code these into the notebook. To create a dataframe you build it column by column, and add the index last. In a notebook you don’t need to use a print function to see the data, you simply enter the name of the object that you want to display:

geoportal=pd.DataFrame({'Page Views' : [29500, 37254, 40421, 33527],
'Unique Views' : [23052, 29285, 31668, 26418],
'Downloads' : [3561, 6807, 6682, 5208]},

Dataframe in Jupyter Notebook

The plot was pretty darn simple, using Panda’s DataFrame.plot you specify the type of chart (bar in this case), pass in a few arguments, and voila! Pandas automatically uses the index for the x axis (academic years in this case) and will attempt to plot all columns on the y axis. If this isn’t desirable you can set x and y in the arguments. The default legend placement isn’t ideal in this example, but we’ll see how to change it later. plt.savefig() saves the chart as an image file outside the notebook., title='Baruch Geoportal')

Baruch Geoportal Web Stats

Questions and Consultations

The rest of my data is stored in an Excel spreadsheet. You can quickly read spreadsheets into a dataframe by specifying the file and sheet, and the head() command previews the top records.

questions = pd.read_excel('RefLog.xlsx', sheet_name='Questions')

Dataframe of questions

I used the groupby method to summarize the number of questions by semester, indicating what column to use for grouping, and how to aggregate. In this example I use .size() which counts all records (another method called count is similar except it does not count null values). Since this result returns just a single column, Pandas returns a data type called a sequence, which is a single-column dataframe with an index (similar to a dictionary key-value pair in vanilla Python). If I want a new dataframe, I can explicitly feed in the columns, reset the index and set it to the year. You can plot data from either type.

#Summarize as a series

#Summarize as a dataframe

Dataframe of questions summarized

As before, the plot is pretty simple, but in this case when saving the figure I specify bounding box tight so the labels don’t get cut off (I rotated them 45 degrees for legibility)., legend=None, title="GIS Questions")

GIS questions chart

To create a stacked bar chart that shows the number of questions and the status of the person who asked them, I can create a new dataframe where I group by both year and status. One of the initial challenges in learning how to plot data is figuring out what structure is appropriate. After some experimentation, I figured out that each status needs to be as column in order to plot it. I used the following with the unstack method to pivot the data:

.groupby(by=['Year','Status']).size().unstack() questions_status2

Dataframe questions unstacked

questions_status2[['Student','Faculty','Staff','CUNY','Public']]\, rot=45, title="GIS Questions")

GIS questions by status chart

Explicitly stating the columns isn’t necessary, but it allows you to specify the order in which they appear in the chart. I have another worksheet that lists my consultations, that I read in, transform, and plot using the same statements:

GIS consultations by status chart

Questions represent emails or phone calls that I’ve received, while consultations are in- person, one-on-one sessions. Both the questions and consultations are specific to demographic, geospatial, or GIS-related topics. Students, faculty, and staff refer to people affiliated with my college (Baruch), while the CUNY category captures affiliates from all the other schools in the university regardless of their status. Public captures anyone outside the university.

The initial patterns are similar: the number of questions was low for my initial three years, and then began to take off in the 2010-11 academic year. This coincided with my movement out of the library’s Information Services Division and into the Graduate Services Division, where I was able to devote more time to providing my specialized services and less time providing general ones (i.e. the reference desk, visiting freshmen English composition classes). 2010-11 was also the year I introduced my day-long introductory GIS workshops which led to an increase in business, particularly from other CUNY campuses.

Another turning point was 2014-15 but the data diverges; the number of questions dips and hasn’t returned to to the peak I hit in 2013-14, while consultations remain consistently high. This is the year that I moved into the GIS Lab, and was able to provide better on-going in-person support. It was also the year I received tenure and promotion, which immediately resulted in a heavy increase in service commitments, i.e. serving on various college committees that took me away from my work (while I have graduate assistants that help with consultations, questions are sent directly to me). 2017-18 is a big divot on both charts as this was the year I was away on sabbatical to write my book (my grad assistant Janine held down the fort at the lab while I monitored questions from home), but there was a solid rebound in 2018-19.

Course Visits and Workshop Stats

I frequently visit public policy, journalism, and other courses to give lectures on census data and GIS, and for these charts I wanted to show the number of classes I visited and attendance on one chart. After loading my teaching data in, I excluded records that represented my GIS workshops by using the query method. Since I wanted to create two different aggregates – a count and a sum – I applied the .agg method after using groupby:

 classes_yr=classes[['Year','Class','Attendance']].groupby('Year').agg({'Class':'count', 'Attendance':'sum'})

Courses dataframe

As best as I could tell, the Pandas plot function couldn’t handle a line and bar on the same chart with a secondary Y axis, so I used matplotlib instead, building the chart one piece at a time:


ax = classes_yr['Attendance'].plot(secondary_y=True, marker='o', color='orange')
ax = classes_yr['Class'].plot(kind='bar', title='Course Visits', rot=45)


Course Visits chart

The courses I visit are consistently mid-sized with about 20 students a piece, so visits and attendance track pretty closely. The pattern is similar to my questions and consultations, initially low, rising as I gained independence, dropping once I hit tenure and service commitments, then gradually rising until the 2017-18 sabbatical year.

For the GIS workshops (stored in greater detail in a separate worksheet) I wanted to create two charts: a summary of attendance for each year by status, and another showing the schools that participants came from. Since attendance will vary by the number of workshops, I also wanted to incorporate the number of sessions into the first chart. After loading in the data:

Workshops dataframeand creating a grouped summary:

Dataframe workshops summary

I created an independent sequence for the labels using string methods:

Sequence lables

and I used matplotlib so I could set different tick labels and move the legend, as the default placement blocked portions of the bars:

ax=gis_yr[['Undergrad','Grad Stdt','Faculty','Staff','Other']]\
(stacked=True, rot=25, title="GIS Workshops")

ax.set_xlabel('Year (# Sessions)')
plt.legend(loc='upper center', bbox_to_anchor=(1, 1))


GIS workshops chart

For the workshops, the status includes all CUNY members regardless of school, while Other is anyone not affiliated with CUNY. Graduate students have always comprised the largest share of participants. Once again, there is the tenure dip in 2014-15 (fewer sessions) and no sessions during 2017-18 sabbatical. 2016-17 was an exceptional year as one of our sessions was held at the FOSS4G conference, so there are lots of participants from the Other category. The latest year was disappointing, as bad weather impacted attendance at two of the sessions.

I wanted to create a pie chart to show participation by CUNY school, but to make it aesthetically pleasing I needed to remove schools with few participants and add them to an Other CUNY category. Otherwise there would be tiny wedges with unreadable labels. After creating a subset of the workshops dataframe that summed values only for the school columns, I iterated through the schools to sum attendance to a variable, dropped those schools, and added the sum to the other category (see the notebook for details). I used the Pandas plot function to create the pie chart, and used the autopct argument to display percentages in the wedges. I also specified a figure size, which you can do for any chart (and becomes important when you decide to embed them in documents):


gis_schools.plot.pie(legend=False, figsize=(6,6), \
title='Workshop Participants by School \n ({} Participants in Total)'.format(gis_total), autopct='%i%%')

Pie chart showing workshop participation

One-third of participants were from my college, and one-fourth were from the Graduate Center, which is our nearest CUNY neighbor with a large population of master’s and PhD students who are keenly interested in learning GIS. The next biggest contributors are Hunter and Lehman Colleges, which are the two CUNY schools that have geography departments with GIS programs; Hunter is also close to Baruch, and we took a road trip to offer some sessions on Lehman’s campus.

Wrap Up

What I like about this approach is that you can summarize and reconfigure data without messing with the original source, and you can clearly see what your formulas are as they’re not hidden beneath the resulting values. These are both hazards when working directly within spreadsheets. While it takes time to learn these new functions and to grapple with finding work-arounds for exceptions, I don’t think it’s any less difficult than trying to accomplish the same things in a spreadsheet. I’ve always found spreadsheet charting to be rather clumsy, where you’re forced to cycle through numerous windows or to click on minuscule pieces of a chart to access hidden settings that you need.  The Pandas / notebook approach makes a lot of sense for iterative data exploration, summation, and visualization, although I’ll continue to rely on regular Python for projects that fall outside this specific domain.