crosswalk

Datasets from GeoData@SciLi: Libraries as Data Creators

While spending February buried under snow here in Providence, I took the opportunity to update several of the data products we create here at GeoData@SciLi. I’ll provide a summary of what we’re working on in this post. The heading for each project links to its GitHub repo, where you can access the datasets and the scripts we wrote for creating them.

My overall vision has always been that library data services should go beyond simply finding public data and purchasing data for students and faculty; we should actively engage in creating value-added products to meet the research and teaching needs of the university. With a dedication to open data, we also contribute to building a data infrastructure that benefits our local communities, and researchers around world. Creating our own projects keeps our technical skills sharp, gives us more in-depth knowledge about working with particular datasets, and exposes us to the practical processing problems our users face, which makes us better at understanding these issues and thus better able to serve them. To ensure that we can maintain and update our datasets, we automate and script as many of our processes as much as possible. The goal is not to build products, but to build processes to create products.

Ocean State Spatial Database

This is our signature product, a geodatabase of basic Rhode Island GIS and tabular data that folks can use as a foundation for building local projects. The idea is to save mappers the trouble of reinventing the wheel every time they want to do state-based research. I’ve honed this idea over a long period of time; as an graduate student at UW twenty years ago I was creating census databases for the Seattle metropolitan area that we published in WAGDA. I expanded this concept at CUNY, where we created and updated the NYC Geodatabase for many years, which included all forms of mass transit data (which wasn’t readily available at the time). For the Rhode Island version, I pivoted to include layers and attributes that would be of interest at a state-level, and was able to re-use many of the scripts and processes I built previously.

The Census TIGER files are the foundation, and we spent time creating suitably generalized base layers from them. Each layer or object is named with a prefix that categorizes and alphabetizes them in a logical order. “a” layers are areal features that represent land areas (counties, cities and towns, tracts, ZCTAs), “b” features are the actual legal boundaries for these areas (not generalized), “c” features are census data tables that can be joined to the a and b features, and “d” features consist of other points and lines (roads, water bodies, schools, hospitals, etc). The database is published in two formats: a Spatialite version for QGIS, and a file geodatabase for ArcGIS.

Ocean State Spatial Database Layers and Sample Map in QGIS — OSSDB Features and Sample Map in QGIS (Hospitals and the Percentage of Business Establishments that are Health Care Services by ZCTA)

Most of the features are fixed to the 2020 census and don’t change. There are two feature sets that we need to update every year. The first set are tables from the American Community Survey (ACS) and ZIP Code Business Pattern (ZBP). We’ve created tables that consist of a selection of variables that would be of broad interest to many users. We use python notebooks to download the data from the Census Bureau’s API. The ACS variable IDs and labels are stored in a spreadsheet that the script reads in, and checks it against the Census Bureau’s variable list for the demographic profile tables, to see if identifiers and labels have changed compared to the previous year. They often change, so the program flags these and we update the spreadsheet to pull the correct variables. We run the program for a specific geography, and the results are stored in a temporary database. For the ZBP data, we crosswalk and aggregate ZIP Codes to ZCTAs to create ZCTA-level data. I have separate scripts for quality control, where we check number of columns, count of rows, and any given variable to data from last year to see if there are any significant differences that could be errors, and another script for copying the data from the temporary database into the new one.

The other set of features we update are points representing schools, colleges and universities, hospitals, and public libraries. The libraries come from a federal source (IMLS PLS survey), while the others come from state sources (schools and colleges from an educational directory, and hospitals from a licensing directory). We use python to access RIDOT’s geocoding API (their parcel or point-based geocoder) to get coordinates for each feature. There’s a lot of exception handling, to deal with bad or non-matching addresses, some of which creep up every year. I store these in a JSON file; the program runs a preliminary check to see if these addresses have been corrected, and if they’re not the program uses the good address stored in the JSON. For quality control, the Detect Dataset Changes tool in the QGIS Processing toolbox allows us to see if features and attributes have changed, and we do extra work to verify the existence of records that have fallen in or out since last year.

Providence Geocoded Crime Incidents

A few years ago I had an excellent undergraduate fellow in the Data Sciences program who created a process for taking all of the police case logs from the Providence Open Data Portal and creating a GIS dataset out of them. We created this dataset for three reasons: the portal contains just the last 180 days and we wanted to create a historic archive, the records did not have coordinates, and the crimes were not standardized. Geocoding was the biggest challenge, as the location information was listed as one of the following: a street intersection, a block number, or a landmark. The script identifies the type of location, and then employs a different procedure for each. Intersections were easy, as we could pass these to the RIDOT geocoder (their street-interpolation geocoder). For block numbers, the program looks at a local file that contains all addresses in the state’s 911 database, which we filter down to just the City of Providence. It finds the matching street, gets the minimum and maximum address numbers within the given block, and computes the centroid between those addresses. For landmarks like Roger Williams Park or Providence Place Mall, we have a local list of major landmarks with coordinates that the program draws from. All non-matching addresses are written to a separate file, and you have the opportunity to add additional landmarks that didn’t match and rerun them. Crimes are matched to the FBI’s uniform categories for violent and non-violent crime, and there’s also an opportunity to update the list if new incident descriptions appear in the data.

We warn users that the matches are not exact, and this needs to be kept in mind when doing any analysis; for every incident we record the match type so users can assess quality. For all of our projects, we provide detailed documentation that explains exactly how the data was created. At this point we have half the data for 2023, and everything for 2024 and 2025. We run the program a few times each year, to ensure that we capture every incident before 180 days elapses.

Providence Census Geography Crosswalk

I wrote about this project when we released it last year; it is a set of relational tables for taking census data published at the tract, block group, and block level, and apportioning and aggregating it to local Providence geographies that include neighborhoods and wards (there’s also a crosswalk for ZCTAs to local geographies, but it’s rather useless as there is little correspondence). We also published a set of reference maps for showing correspondence or lack thereof between the census and local areas.

Map of Census Tracts and Neighborhoods in Providence — Census Tracts (black outlines) and Neighborhoods in Providence

The newest development is that one of my undergraduates used the crosswalk to generate 2020 census demographic profile summaries for neighborhoods and wards, so that users can simply download a pre-compiled set of data without having to do their own crosswalking. Population and household variables were apportioned using total population as a weight, while housing unit variables were apportioned using total housing units. He also generated percent totals for each variable, which required carefully scrutinizing what the proper numerators and denominators should be based on published census results. Python to the rescue again, he used a notebook that read the census tables in from the Ocean State Spatial Database, which saved us the trouble of using the census API. We publish the data tables in the same GitHub repository as the crosswalk.

UN ICSC Retail Price Indexes

I haven’t updated this one yet, but it’s next on the list. I wrote about this project a few years ago; this is a country-level index that documents variation in the cost of living at different UN duty stations. The UN publishes this data at different intervals throughout the year, in macro-driven Excel files that allow you to pull up data for one country at a time. The trick for this project was looping through hundreds of these files, finding the data hidden by the macro, and turning it into a single time series that includes unique identifiers for place, time, and good / service. This project was born from a research request from a PhD student, and we saw the value of building a process to keep it updated and to publish it for others to use. The scripting was done by the first undergraduate student worker I had at Brown, Ethan McIntosh. Thanks to him, I download the new data each year, run the program, and voila, new data!

Conclusion

I hope you found this summary useful, either because you can use these datasets, or you can learn something from one of our scripts and processes that you can apply to your own work. I hope that more academic libraries will embrace the concept of being data creators, and would incorporate this work into their data service models (along with formally contributing to existing initiatives like the Data Rescue Project or the OpenStreetMap). Feel free to reach out with comments and feedback.

Crosswalking Census Data to Neighborhood Geographies

Last semester we completed a project to create a crosswalk between census geographies and local geographies in Providence, RI. Crosswalks are used for relating two disparate sets of geography, so that you can compile data that’s published for one set of geography in another. Many cities have locally-defined jurisdictions like wards or community districts, as well as informally defined areas like neighborhoods. When you’re working with US Census data, you use small statistical areas that the Bureau defines and publishes data for; blocks, block groups, census tracts, and perhaps ZCTAs and PUMAs. A crosswalk allows you to apportion data that’s published for census areas, to create estimates for local areas (there are also crosswalks that are used for relating census geography that changes over time, such as the IPUMS crosswalks).

How the Crosswalk Works

For example, in the Providence Census Geography Crosswalk we have two crosswalks that allow you to take census tract data, and convert it to either neighborhoods or wards. I’ll refer to the neighborhoods in this post. In the crosswalk table, there is one record for each portion of a tract that overlaps a neighborhood. For each record, there are attribute columns that indicate the count and the percentage of a tract’s population, housing units, land area, and total area that fall within a given neighborhood. If a tract appears just once in the table, that means it is located entirely within one neighborhood. In the image below, we see that tract 1.01 appears in the table once, and its population percentage is 1. That means that it falls entirely within the Washington Park neighborhood, and 100% of its population is in that neighborhood. In contrast, tract 1.02 appears in the table twice, which means it’s split between two neighborhoods. Its pct_pop column indicates that 31.5% of its population is in South Elmwood, while 68.5% is in Washington Park. The population count represents the number of people from that tract that are in that neighborhood.

Looking at the map below, we can see that census tract 1.01 falls entirely within Washington Park, and tract 1.02 is split between Washington Park and South Elmwood. To generate estimates for Washington Park, we would sum data for tract 1.01 and the portion of tract 1.02 that falls within it. Estimates for South Elmwood would be based solely on the portion of tract 1.02 that falls within it. With the crosswalk, “portion” can be defined as the percentage of the tract’s population, housing units, land area, or total area that falls within a neighborhood.

The primary purpose of the crosswalk is to generate census data estimates for neighborhoods. You apportion tract data to neighborhoods using an allocation factor (population, housing units, or area) and aggregate the result. For example, if we have a census tract table from the 2020 census with the population that’s 65 years and older, we can use the crosswalk to generate neighborhood-level estimates of the 65+ population. To do that, we would:

Join the data table to the crosswalk using the tract’s unique ID; the crosswalk has both the long and short form of the GEOIDs used by the Census Bureau. So for each crosswalk record, we would associate the 65+ population for the entire tract with it.
Multiply the 65+ population by one of the allocation columns – the percent population in this example. This would give us an estimate of the 65+ population that live in that tract / neighborhood piece.
Group or aggregate this product by the neighborhood name, to obtain a neighborhood-level table of the 65+ population.
Round decimals to whole numbers.

To do the calculations in a spreadsheet, you would import the appropriate crosswalk sheet into the workbook that contains the census data that you want to apportion, so that they appear as separate sheets in the same workbook. In the crosswalk worksheet, use the VLOOKUP formula and reference the GEOID to “join” the census tract data to the crosswalk. The formula requires: cell containing the ID value you wish to look up, the range of cells in a worksheet that you will search through, the number of the column that contains the value you wish to retrieve (column A is 1, Z is 26, etc.), and the parameter “FALSE” to get an exact match. It is assumed that the look up value in the target table (the matching ID) appears in the first column (A).

The tract data is now repeated for each tract / neighborhood segment. Next, use formulas to multiply the allocation percentage (pct_pop in this example) by the census data value (over 65 pop for the entire tract) to create an allocated estimate for each tract / neighborhood piece.

Then you can generate a pivot table (on the Insert ribbon in Excel) where you group and sum that allocated result by neighborhood (neighborhoods as rows, census data as summed values in columns). Final step is to round the estimates.

This process is okay for small projects where you have a few estimates you want to quickly tabulate, but it doesn’t scale well. I’d use a relational database instead; import the crosswalk and census data table into SQLite, where you can easily do a left join, calculated field, and then a group by statement. Or, use the joining / calculating / aggregating equivalents in Python or R.

I used the percentage of population as the allocation factor in this example. If the census data you’re apportioning pertains to housing units, you could use the housing units percentage instead. In any case, there is an implicit assumption that the data you are apportioning has the same distribution as the allocation factor. In reality this may not be true; the distribution of children, seniors, homeowners, people in poverty etc. may vary from the total population’s distribution. It’s important to bear in mind that you’re creating an estimate. If you are apportioning American Community Survey data this process gets more complicated, as the ACS statistics are fuzzy estimates. You’d also need to apportion the margin of error (MOE) and create a new MOE for the neighborhood-level estimates.

The Providence crosswalk has some additional sheets that allow you to go from tracts, ZCTAs, or blocks to neighborhoods or wards. The tract crosswalk is by far the most useful. The ZCTA crosswalk was an exercise in futility; I created it to demonstrate the complete lack of correlation between ZCTAs and the other geographies, and recommend against using it (we also produced a series of maps to visually demonstrate the relationship between all the geographies). There is a limited amount of data published at the block level, but I included it in the crosswalk for another reason…

Creating the Crosswalk

I used census blocks to create the crosswalk. They are the smallest unit of census geography, and nest within all other census geographies. I used GIS to assign each block to a neighborhood or ward based on the geography the block fell within, and then aggregated the blocks into distinct tract / ward and tract / neighborhood combinations. Then I calculated the allocation factors, the percentage of the tract’s total attributes that fell in a particular neighborhood or ward. This operation was straightforward for the wards; the city constructed them using 2020 census blocks, so the blocks nested or fit perfectly within the wards.

The neighborhoods were more complicated, as these were older boundaries that didn’t correspond to the 2020 blocks, and there were many instances where blocks were split between neighborhoods. My approach was to create a new set of neighborhood boundaries based on the 2020 blocks, and then use those new boundaries to create the crosswalk. I began with a spatial join, assigning each block a neighborhood ID based on where the center of the block fell. Then, I manually inspected the borders between each neighborhood, to determine whether I should manually re-assign a block. In almost all instances, blocks I reassigned were unpopulated and consisted of slivers that contained large highways, or blocks of greenspace or water. I struck a balance between remaining as faithful to the original boundaries as possible, while avoiding the separation of unpopulated blocks from a tract IF the rest of the blocks in that tract fell entirely within one neighborhood. In two cases where I had to assign a populated block, I used satellite imagery to determine that the population of the block lived entirely on one side of a neighborhood boundary, and made the assignment accordingly.

In the example below, 2020 tract boundaries are shown in red, 2020 block boundaries are light grey, original neighborhood boundaries are shown with dotted black lines, and reconstituted neighborhoods using 2020 blocks are shown in different colors. The boundaries of Federal Hill and the West End are shifted west, to incorporate thin unpopulated blocks that contain expressways. These empty blocks are part of tracts (10 and 13) that fall entirely within these neighborhoods; so splitting them off to adjacent Olneyville and Silver Lake didn’t make sense (as there would be no population or homes to apportion). Reassigning them doesn’t change the fact that the true boundary between these neighborhoods is still the expressway. We also see an example between Olneyville and Silver Lake where the old neighborhood boundary was just poorly aligned, and in this case blocks are assigned based on where the center of the block fell.

Creating the crosswalk from the ground up with blocks was the best approach for accounting how population is distributed within larger areas. It was primarily an aggregation-based approach, where I could sum blocks that fell within geographies. This approach allowed me to generate allocation factors for population and housing units, since this data was published with the blocks and could be carried along.

Conversely, in GIS 101 you would learn how to calculate the percentage of an area that falls within another area. You could use that approach to create a tract-level crosswalk based on area, i.e. if a tract’s area is split 50/50 between two neighborhoods, we’ll apportion its population 50/50. While this top down approach is simpler to implement, it’s far less ideal because you often can’t assume that population and area are equally distributed. Reconsider the example we began with: 31.5% of tract 1.02’s population is in South Elmwood, while 68.5% is in Washington Park. In contrast, 75.3% of tract 1.02’s land area is in South Elmwood, versus only 24.7% in Washington Park! If we apportioned our census data by area instead of population, we’d get a dramatically different, and less accurate, result. Roger Williams Park is primarily located in the portion of tract 1.02 that falls within Elmwood; it covers a lot of land but includes zero people.

Why can’t we just simply aggregate block-level census data to neighborhoods and skip the whole apportionment step? The answer is that there isn’t much data published at the block level. There’s a small set of tables that capture basic demographic variables as part of the decennial census, and that’s it. There was a sharp reduction in the number of block-level tables in the 2020 census due to new privacy regulations, and the ACS isn’t published at the block-level at all. While you can use the block-level table in the crosswalk to join and aggregate block data, in most cases you’ll need to work with tract-data and apportion it.

I used spatial SQL to create the crosswalks in Spatialite and QGIS , and if you’re interested in seeing all the gory details you can look at the code and spatial database in source folder of the project’s GitHub repo. I always prefer SQL for spatial join and aggregation operations, as I can write a single block of code instead of running 4 or 5 different of desktop GIS tools in a sequence. I’ll be updating the project this semester to include additional geographies (block groups – the level between blocks and tracts), and perhaps an introductory tutorial for using it (there are some basic docs at present).

Census Time Series Tables from NHGIS

I’m often asked about what the best approaches are for comparing US census data over time, to account for changes in census geography and to limit the amount of data processing you have to do in stitching data from different census years together. Census geography changes significantly each decade, and by and large the Census Bureau does not compile and publish historical comparison tables.

My primary suggestion is to use the National Historical Geographic Information System or NHGIS (I’ll mention some additional suggestions at the end of this post). Maintained by IPUMS at the University of Minnesota, NHGIS is the repository for all historic US census summary data from 1790 to present. While most of the data in the archive is published nominally (the format and structure in which the data was originally published), they do publish a set of Time Series Tables that compile multiple years of census data in one table. These tables come in two formats:

Nominal tables: the data is published “as is”, based on the boundaries that existed at each point in time. If a geography was added or dropped over the course of the years, it falls in or out of the table in the given year that the change occurred. With a few exceptions, the earliest nominal tables begin with the 1970 census and are published for eight geographies: nation, regions, divisions, states, counties, census tracts, county subdivisions, and places.
Standardized tables: the data has been normalized, where a geography for a single time period serves as the basis for all data in the table. The NHGIS is currently using 2010 as the basis, so that data prior and subsequent to 2010 has been modified to fit within the 2010 boundaries. This is achieved by aggregating block or block group data from each period to fit within the 2010 boundaries, and apportioning the data in cases where a block or group is split by a boundary. The earliest standardized tables begin with the 1990 census, and cover the basic 100% count data. Data is published for ten geographies: states, counties, census tracts, block groups, county subdivisions, places, congressional districts (as defined for the 110th-112th Congresses, 2007-2013), core based statistical areas (using 2009 metro area definitions), urban areas, and ZIP Code Tabulation Areas (ZCTAs).

Included in the documentation is a full list of time series tables, and whether they are available in nominal or standardized format. The availability of specific time periods and geographies varies. As of late 2024, the availability of standardized tables that include the 2020 census is currently limited to what was published in the early Public Redistricting Files. This will likely change in the near future to include additional 2020 data, and it’s possible that the standardized geography will eventually switch from 2010 to 2020 geography.

To access the Time Series Tables, you can browse the NHGIS without an account but you’ll need to create one in order to download anything. Once you launch NHGIS click on the Topics filter. In the list of topics, any topic under the Population or Housing category that has a “TS” flag next to it has at least one time series table. In the example below, I’ve used the filters to select census tracts for Geographic Level, 2010 and 2020 for Years, and Housing – Occupancy and Vacancy status as my Topic.

NHGIS Time Series flags in topics filter

In the results at the bottom, the original Source Tables from each census are shown in the first tab. The Time Series Tables can be viewed by selecting the adjacent tab. The first two tables in this example are Housing Units by Occupancy Status. Clicking on the name of the tables reveals the variables that are included, and the source for the statistics. The first table is a nominal one that stretches from 1970 to the most recent ACS. The second table is a standardized one that covers 1990 to 2020. I’ve checked both boxes to add these to my cart.

The third tab in the results are GIS Files. If we want to map standardized data, we would choose just the boundaries for the standardized year, as all of the data in the table has been modified to fit these boundaries. If we were mapping nominal data, we would need to download boundary files for each time period and map them separately (unless they were stable geographies like states that haven’t changed since 1970).

We hit the Continue button in the Cart panel when we’re ready to download. By default the extract will only include years and geographies we have filtered for. To add additional years or geos we can add them on this next screen. I’ve modified my list to get all available decennial years for each table. Note that if you’re going to select 5-year ACS data for nominal tables, choose only a few non-overlapping periods. In most cases you can’t filter geographies (i.e. select tracts within a state), you have to take them all. On the final screen you choose your structure; CSV is usually best, as is Time varies by column. Once you submit your request you’ll be prompted to log in if you haven’t already done so. Wait a bit for the extract to compile, then you can download the table and codebook.

A portion of the nominal table is depicted below. This table includes identifiers and labels for each of the census years. The variables follow, ordered by variable and then by year. In this example, occupied housing units from 1970 to 2020 appear in the first block, and vacant units in the second. All the 1970 census tract values for Autauga County, Alabama are blank (as many rural counties in 1970 were un-tracted). We can see that values for census tract 205 run only from 1980 to 2010, with no value for 2020. The tract was split into three parts in 2020, and we see values for tracts 205.01, .02, and .03 appear in 2020. So in the nominal tables, geographies appear and disappear as they are created or destroyed. However, if geographic boundaries change but the name and designation for the geography do not, that geography persists throughout the time series in spite of the change.

A portion of the standardized table is below. This table only includes identifiers and labels for the 2010 census, as all data was modified to fit the tract geography of that year. The values for each census year except 2010 are published in triplicate: an estimate, and a lower and upper bound for the estimate. If the values in these three columns differ, it indicates that a block (or block group) was split and reapportioned to fit within the tract boundary for 2010 (you may also see decimals, indicating a split occurred). You’d use the estimate in your work, while the bounds provide some indication of the estimate’s accuracy. Note in this table, tract 205 in Autauga County persists from 1990 to 2020, as it existed in 2010. Data from the three 2020 tracts was aggregated to fit the 2010 boundary.

The crosswalk tables that IPUMS used to create the standardized data are available, if you wanted or needed to generate your own normalized data. The best approach is to proceed from the bottom up, aggregating blocks to reformulate the data to the geography you wish to use. Some decennial census data, and all data from the ACS, is not available at the block level, which necessitates using block groups instead.

There are some alternatives for obtaining or creating time series census data, which could fit the bill depending on your use case (esp if you are looking at larger geographies). There’s also reference material that can help you make sense of changes.

The Longitudinal Tract Database at Brown University provides tract-level crosswalks from 1970 to 2020. They also provide some pre-compiled data tables generated from the crosswalk.
For short term comparisons, the ACS includes Comparison Profile Tables for states, counties, places, and metro areas that compare two non-overlapping time periods. For example, here is the 5-year ACS Comparative Demographic Estimates profile for Providence RI in 2022 (compares 2018-2022 with 2013-2017).
Use an interactive mapping tool like the Social Explorer to make side by side comparison maps from two time periods. They also incorporate some of the NHGIS standardized data into their database. (SE is a subscription-based product; if you’re at a university see if your library subscribes).
The Population and Housing Unit Estimates program publishes annual estimates for states, counties, and metro areas in decade by decade spreadsheets. The MCDC has created some easy to use tools for summarizing and charting this data to show annual population change and changes in demographic characteristics.
I had previously written about pulling population and economic time series tables for states, counties, and metro areas from the Bureau of Economic Analysis data portal.
Counties change more often than you think. The Census Bureau has a running list of changes to counties from 1970 to present. Metro areas change frequently too, but since they are built from counties you can aggregate older county data to fit modern metro boundaries. The census provides delineation files that assign counties to metros.

Geospatial Metadata with Dublin Core and GeoBlacklight Standards

NOTE – in mid-2021 the GeoBlacklight metadata standard was refined and renamed as the OpenGeoMetadata Aardvark Metadata Schema. The documentation for the standard was moved from the GB to the OGM GitHub repo.

During the COVID-19 lock down this past spring, I took the opportunity to tackle a project that was simmering on my back burner for quite a while: creating a coherent metadata standard for describing all of the GIS data we create in the lab. For a number of years we had been creating ISO 19115 / 19139 metadata records in XML, which is the official standard. But the process for creating them was cumbersome and I felt it provided little benefit. We don’t have a sophisticated repository where the records are part of a search and retrieval system, and that format wasn’t great for descriptive documentation either, i.e. people weren’t reading or referring to the metadata to learn about the datasets.

We were already creating basic Dublin Core (DC) records for our spatial databases and non-spatial data layers, so I decided to create some in-house guidelines and build processes and scripts for creating all of our metadata in DC, while adhering to guidelines and best practices established by the GeoBlacklight community. I’ll describe my process in this post, and hopefully it should be useful for others who are looking to do similar work. I’ll mention some resources for learning more about geospatial metadata in the conclusion.

All of the metadata documentation and scripts I’ve created are available on GitHub. I’ll add a few snippets of code here and there to illustrate,

Out With the Old

FGDC and ISO 19115 / 19139 have long been the standards for creating geospatial metadata and expressing it in XML. The ISO standard has eclipsed the FGDC standard, but some institutions continue to use FGDC for legacy reasons. Since ISO was the standard, I decided that we should use it for all the layers we create in the lab. So why am I abandoning it?

The ISO standard is enormous and complex. Even though we were using just the required elements in the North American Profile, there were still over 100 fields that we had to enter. Many of them seemed silly, like having to input a name and mailing address for our organization for every instance where we were a publisher, distributor, creator, etc. We were creating the records in the ArcCatalog, which only permits you to create and edit records using the ArcGIS metadata standard. You have to export the results out to ISO, but if you have changes to make you need to edit the original ArcGIS metadata, which forces you to keep two copies of every record. The metadata can be easily viewed in ArcCatalog, but not in other software. Creating your own stylesheet for displaying ISO records is a real pain, because there are dozens of namespaces and lots of nesting in the XML. So we ended up printing the styled records from ArcCatalog as PDFs (ugh), to include alongside the XML. Processing and validating the records to our own standards was equally painful.

In short, metadata creation became a real bottleneck in our data creation process, and the records didn’t serve a useful purpose for either search or description. Each record was about six pages long, and at least two-thirds of the information wasn’t relevant for people who were trying to understand our data.

In with the New

Dublin Core (DC) is a basic, open-ended metadata standard that was created in the mid 1990s, and has been adopted by numerous libraries and archives for documenting collections. There is a set of core elements and an additional set of terms that refine these elements. For example, there is a core Coverage element intended for describing time and place, but then there are more specific DC terms for Temporal and Spatial description. DC hasn’t been widely used for geospatial metadata because it’s not able to express all of the details that are important to those datasets, such as coordinate reference systems, resolution, and scale.

At least that’s what I thought. GeoBlacklight is an open source search and discovery platform for GIS data which relies on metadata records to power the platform. It was developed at Stanford and is used by several universities such as NYU and Cornell. A working group behind the project created and adopted a GeoBlacklight Metadata Schema (GB) which uses a mix of DC elements and GB-specific elements that are necessary for that platform. GB metadata is stored in a JSON format.

The primary aim of the standard is to provide records that aid searching, and there is less of an emphasis on fully describing the datasets for the purpose of providing documentation. But for my purposes, it was fine. I could adopt a subset of the DC elements and terms that I felt were necessary for describing our resources, while keeping an eye on the GB standard and following it wherever I could. If there was a conflict between what we wanted to do and the GB standard, I made sure that I could crosswalk one to the other. I would express our records in XML, and would write a script to transform them to the GB standard in a JSON format. That way, I can share our data and metadata with NYU, which has a GB-backed spatial data repository and re-hosts several of our datasets.

Data Application Profile

The first step was to establish our in-house guidelines. I created a data application profile to specify the DC elements we would use, which are required versus optional, and whether they can repeat or not. For example, we require a Title and Creator element, where there can be only one title but multiple creators (as more than one person may have created the resource). We also use the Contributor element to reference people who had some role in creating the dataset, but not as significant as the creator. This element is optional, as there might not be a contributor.

For some of the elements we require specific formatting, while for others we specify a set of controlled terms that must be used. For the Title, we follow Open Geoportal best practices where the title is written as: Layer Name, Place, Date (OGP is a separate but related geospatial data discovery platform and community). Publication dates are written as YYYY-MM. For the Subject element, we use the ISO 19115 categories that are commonly used for describing geospatial data. These are useful for filtering records by topic, and are much easier to consistently implement than Library of Congress Subject Headings. In some cases I adopted elements and vocabulary that were required by GB, such as the Provenance field which is used by GB to record the name of the institution that holds the dataset.

Snippet of an XML metadata record:

<title>Real Estate Sales, New York NY, 2019</title>
<creator>Frank Donnelly</creator>
<creator>Anastasia Clark</creator>
<subject>economy</subject>
<subject>planningCadastre</subject>
<subject>society</subject>
<subject>structure</subject>

In other cases I went contrary to the GB standard, but kept an eye on how to satisfy it so when records are crosswalked from our DC standard in XML to the GB standard in JSON, the GB records will contain all the required information. I used the DC Medium element to specify how the GIS data was represented (point, line, polygon, raster, etc), which is required as a non-DC element in GB records. I used the Source element to express the original data sources we used for creating our data, which is something that’s important to us. GB doesn’t use the Source element in this way – so when our records are crosswalked this information will be dropped. I used Coverage for expressing place names from Geonames, and Spatial to record the bounding box for the features using the DC Box format. I also used The Spatial field as a way of expressing the coordinate reference system for the layer: the coordinates and system name published in the record are for the system the layer uses. When this information gets crosswalked to a GB record, the data in Coverage will be transferred to Spatial (as GB uses this DC field for designating place names), and the Spatial bounding box data gets migrated and transformed to a GB-specific envelope element (in GB, the bounding box is used for spatial searching and is expressed in WGS 84).

Creating Records and Expressing them in XML

I created a blank XML template that contains all our required and optional elements, with some sample values to indicate how the elements should be populated. If we are creating new records from scratch, we can copy the template and fill it in using a plain text editor. Alternatively we can also use the Dublin Core Metadata Generator, but in using that we’d have to be careful to follow our DAP, selecting just our required elements and omitting ones that we don’t use. All of our elements are expressed under a single root element called “metadata”. While it is common to prefix DC elements with a namespace, we do NOT include these namespaces because it makes validating the records a pain… more about that in the next section.

In most cases, we won’t be creating records from scratch but will be modifying existing ones, and only a handful of the elements would need to be updated in any given year. To this end, I wrote two scripts that make copies of specific records, one for each year or time period (for one feature that repeats over many years – one file to many), and one for each file name (for a series of different layers produced for one time period – one file to one file). That way, we can edit the copies and change just the fields that need updating.

The process of updating the records is going to vary with each series and with each particular iteration. I wrote some Python functions that can be used repeatedly, and import these into scripts that are custom designed for each dataset. I use the ElementTree module to parse the XML and update specific fields. For elements that require updated values, I either hard code them into the script or read them from a JSON file that I construct. For example, to update our real estate sales layers I took the file for the most recent year and made copies of it, named for each year in the past. Then I created a JSON file that’s read into a Python dictionary, and using the year in the file name as the key, elements for: year issued, year published, creator, and contributor in the XML are modified using the dictionary values. In some cases the substring of an element must be changed, such as the year that appears at the end of the Title field. In other cases an entire value can simply be swapped out.

A portion of a json file is below, where year is the key and value is a dictionary that contains element name as key, and value is element text / value to be updated:

{
"2018":{"issued":"2019-05","creator":["Anastasia Clark","Frank Donnelly"],"contributor":["Chris Kim"]},
"2017":{"issued":"2018-10","creator":["Anastasia Clark","Frank Donnelly"],"contributor":["Abigail Andrews"]},
"2016":{"issued":"2017-05","creator":["Anastasia Clark"],"contributor":["Frank Donnelly","Janine Billadello"]},
}

Several files called “real_property_sales_nyc_YEAR.xml” are read in. These files are all copies of data for the most recent year (2019), but YEAR in the file name is substituted with each year of data that we have (2018, 2017, etc). The script reads in the year from the file name, and updates the appropriate elements using that year as the key (in the actual code the functions are imported from a separate file so they can be used in multiple scripts). With ElementTree: parse the xml file to save it as a tree object, then getroot to get the root element of the tree, which allows you to find specific elements (like the title element) and get the text value of that element (the actual title), so you can modify the value:

import xml.etree.ElementTree as ET

def esimple(root,elem,value):
    """Replace the value of an element"""
    e=root.find(elem)
    e.text=value

def esubstring(root,elem,oldstring,newstring):
    """Replace a portion (substring) of the value of an element"""
    e=root.find(elem)
    e.text=e.text.replace(oldstring,newstring)

with open(json_update_file) as jfile:
    updates = json.load(jfile)

for file in os.listdir(infolder):
    if file[-4:]=='.xml':
        xfilein=os.path.join(infolder,file)
        year=file[-8:-4] #assumes year is at end of file name
        tree = ET.parse(xfilein)
        root = tree.getroot()
        esimple(root,'temporal',year)
        esimple(root,'issued',updates[year]['issued'])
        esubstring(root,'title','2019',year)
...

Validating Records

Since I’m already operating within Python, I decided to use a 3rd party Python module called xmlschema to validate our records. It’s simple and works quite nicely, but first we need to have a schema. An XML schema or XSD file is an XML file that contains instructions on how our XML metadata should be structured: all the elements we use, whether they are required or not, whether they can repeat or not, the order in which they must appear, and whether there are controlled vocabularies. I hard-coded shorter vocabs like the ISO categories and many of the GB requirements into the schema, but I didn’t include absolutely everything. Since we’re a small shop that makes a limited amount of metadata, we’ll still be checking the records manually.

The Python validator script checks for well formed-ness first (i.e. every open tag has a closing tag), and then proceeds to check the record against the schema file. If it hits something that’s invalid, it stops. I wrote the script as a batch process, so if one file fails it moves on to the next one. It produces an error report, and we make the corrections and go back and try validating again until everything passes.

In the snippet below, we read in the schema file and a folder of records to be validated. myschema.is_valid(file) tests whether the schema is valid or not. If it’s valid (True) I perform some other check (not shown), and if that turns out OK then we have no problems. Otherwise if my check fails, or the schema returns as not valid (False), record the problem and move on to the next file. Problems are printed to screen, and other status messages are output (not shown):

import xmlschema as xmls
import os
schemafile=os.path.join('bcgis_dc_schema.xsd')
xmlfolder=os.path.join('projects','nyc_real_estate','newfiles')
problems={}
noproblems=[]
my_schema = xmls.XMLSchema(schemafile)

for file in os.listdir(xmlfolder):
    if file[-4:]=='.xml':
        filepath=os.path.join(xmlfolder,file)
        my_xml = xmls.XMLResource(filepath)
        test=my_schema.is_valid(my_xml)
        if test==True:
            msg=spatial_check(my_xml)
            if msg==None:
                noproblems.append(file)
            else:
                problems[file]=msg
        else:
            try:
                my_schema.validate(my_xml)
            except xmls.XMLSchemaException as e:
                problems[file]=e
                continue
if len(problems)&amp;amp;gt;0:
    print('***VALIDATION ERRORS***\n')
    for k, v in problems.items():
        print(k,v)
else:
    pass

Creating the schema was an area where I ran into trouble, because I initially tried to refer to the DC namespaces in my records and kept all of the namespace prefixes with the elements, i.e. dc:title instead of title. This seemed like the “right” thing to do. It turns out that schema creation and validation for XML records with multiple namespaces is a royal pain; you have to create multiple schemas for each namespace, which in this case would mean three: DC elements, DC terms, and my own root metadata element. Then I’d have to write all kinds of stuff to over-ride the default DC schema files, because they were designed to be flexible: the only requirement is that just the listed DC elements and terms can appear. My requirements are much more stringent, as I wanted to use a subset of the DC elements and impose restrictions to insure consistency. I watched countless YouTube videos on XML schemas and read several tutorials that didn’t address this complexity, and what I did find suggested that it was going to be arduous.

Ultimately I decided – why bother? I jettisoned the namespaces from our records, and have yet another script for stripping them out of records (for older records where we had included namespaces). As long as I have a solid DAP, am following best practices consistently, am using vocabularies that others are using, and we validate and check our records, others should be able to interpret them and either ingest or transform them if they need to.

Styling the Records

While XML is fine for search and data exchange, it’s not friendly for humans to read, so it’s a good idea to apply a stylesheet to render the XML in a readable form in a web browser. XLST stylesheets give you lots of control, as you can actually process and transform the underlying XML to display what you want. But to my dismay, this was also needlessly complicated. Searching around, I discovered that many web browsers had deprecated the display of XML via XLST stylesheets stored locally for security reasons, which makes development cumbersome. I also got the sense that this is an atrophying technology; most of the references I found to XSLT were 10 to 20 years old.

So, I took a simpler route and created a CSS stylesheet, which can style XML essentially the same way that it styles HTML. You must display all of the underlying data in the XML file and your control is limited, but there are enough tweaks where you can insert labels, and vary the display for single items versus lists of items. I keep the stylesheet stored on our website, and as part of the metadata creation process a link to it is embedded in each of our records. So when you open the XML file in a browser, it looks to the online CSS file and renders it. The XML snippet at the beginning of this post looks like the image below, when a user clicks on the file to view it:

XML Metadata Record with CSS Styling

You can access this actual metadata record on our webserver via the Baruch Geoportal.

Crosswalk DC XML TO GB JSON

Migrating the DC XML records to GB JSON was actually straightforward, using Python and a combination of the ElementTree and JSON modules. I walked each of our elements over to its corresponding GB element, and made modifications when needed using the ElementTree functions and methods mentioned previously; I find each element in our XML and save its value in a dictionary using the GB element name as the key, and at the end I write the items out to JSON. In some cases elements from our records are dropped if there isn’t a suitable counterpart. The bounding box element was trickiest, as I had to convert the northing and easting coordinate format used by DC box into a standard OGC Envelope structure, and transform the CRS from whatever the layer uses to WGS 84. I used the pyproj module to do both. Some of the elements are left blank, because I’d have to coordinate with my colleagues at NYU for filling those values, like the unique id or “slug” that’s native to their repository.

Conclusion

You should always create metadata and rely on existing standards and vocabulary to the extent possible, but create something that works for you given your circumstances. Figure out whether your metadata serves a discovery function or a documentation function, or some combination of both. Create a clear set of rules that you can implement that will provide useful information to your data users, without becoming a serious impediment to your data creation process. Creating metadata does take time and you need to budget for it accordingly, but don’t waste time (like I was) dealing with clunky metadata creation software and lots of information that will largely go unused, simply because it’s “the right thing to do”. Ideally, whatever you create should be consistent and flexible enough that you can write some code to transfer it to a different standard or format if need be.

For more information about GeoBlacklight metadata, take a look at this Code4Lib article about the GB metadata schema written by the developers a few years back, but consult the current standard (which has changed a bit since the article was written). Another Code4Lib article reviews tools for working with geospatial metadata that include GUI-solutions, XSLT, and Python with ElementTree. My colleague Andrew Battista at NYU has written a post about authoring GB metadata; they use Omeka as a user-friendly front end for creating their records.

The Trouble with ZIP Codes: Solutions for Data Analysis and Mapping

Since the COVID-19 pandemic began, I’ve received several questions about finding census data and boundary files for ZIP Codes (aka US postal codes), as many states are publishing ZIP Code-level data for cases and deaths. ZIP Codes are commonly used for summarizing address data, as it’s easy to do and most Americans are familiar with them. However, there are a number of challenges associated with using ZIP Codes as a unit of analysis that most people are unaware of (until they start using them). In this post I’ll summarize these challenges and provide some solutions.

The short story is: you can get boundary files and census data from the decennial census and 5-year American Community Survey (ACS) for ZIP Code Tabulation Areas (ZCTAs, pronounced zicktas) which are approximations of ZIP Codes that have delivery areas. Use any census data provider to get ZCTA data: data.census.gov, Census Reporter, Missouri Census Data Center, NHGIS, or proprietary library databases like PolicyMap or the Social Explorer. The longer story: if you’re trying to associate ZIP Code-level data with census ZCTA boundary files or demographic data, there are caveats. I’ll cover the following issues in detail:

ZIP Codes are actually not areas with defined boundaries, and there are no official USPS ZIP Code maps. Areas must be derived using address files. The Census Bureau has done this in creating ZIP Code Tabulation Areas (ZCTAs).
The Census Bureau publishes population data by ZCTA and boundary files for them. But ZCTAs are not strictly analogous with ZIP Codes; there isn’t a ZCTA for every ZIP Code, and if you try to associate ZIP data with them some of your records won’t match. You need to crosswalk your ZIP Code data to the ZCTA-level to prevent this.
ZCTAs do not nest or fit within any other census geographies, and the postal city name associated with a ZIP Code does not correlate with actual legal or municipal areas. This can make selecting and downloading ZIP Code data for a given area difficult.
ZIP Codes were designed for delivering mail, not for studying populations. They vary tremendously in size, shape, and population.
Analyzing data at either the ZIP Code or ZCTA level over time is difficult to impossible.
ZIP Code and ZCTA numbers must be saved as text in data files, and not as numbers. Otherwise codes that have leading zeros get truncated, and the code becomes incorrect.

ZIP Codes versus ZCTAs and Boundaries

Contrary to popular belief, ZIP Codes are not areas and the US Postal Service does not delineate boundaries for them. They are simply numbers assigned to ranges of addresses along street segments, and the codes are associated with a specific post office. When we see ZIP Code boundaries (on Google Maps for example), these have been derived by creating areas where most addresses share the same ZIP Code.

The US Census Bureau creates areal approximations for ZIP Codes called ZIP Code Tabulation Areas or ZCTAs. The Bureau assigns census blocks to a ZIP number based on the ZIP that’s used by a majority of the addresses within each block, and aggregates blocks that share the same ZIP to form a ZCTA. After this initial assignment, they make some modifications to aggregate or eliminate orphaned blocks that share the same ZIP number but are not contiguous. ZCTAs are delineated once every ten years in conjunction with the decennial census, and data from the decennial census and the 5-year American Community Survey (ACS) are published at the ZCTA-level. You can download ZCTA boundaries from the TIGER / Line Shapefiles page, and there is also a generalized cartographic boundary file for them.

Crosswalking ZIP Code Data to ZCTAs

There isn’t a ZCTA for every ZIP Code. Some ZIP Codes represent large clusters of Post Office boxes or are assigned to large organizations that process lots of mail. As census blocks are aggregated into ZCTAs based on the predominate ZIP Code for addresses within the block, these non-areal ZIPs fall out of the equation and we’re left with ZCTAs that approximate ZIP Codes for delivery areas.

As a result, if you’re trying to match either your own summarized address data or sources that use ZIP Codes as the summary level (such as the Census Bureau’s Business Patterns and Economic Census datasets), some ZIP Codes will not have a matching ZCTA and will fall out of your dataset.

To prevent this from happening, you can aggregate your ZIP Code data to ZCTAs prior to joining it to boundary files or other datasets. The UDS Mapper project publishes a ZIP Code to ZCTA Crosswalk file that lists every ZIP Code and the ZCTA it is associated with. For the ZIP Codes that don’t have a corresponding area (the PO Box clusters and large organizations), these essentially represent points that fall within ZCTA polygons. Join your ZIP-level data to the ZIP Code ID in the crosswalk file, and then group or summarize the data using the ZCTA number in the crosswalk. Then you can match this ZCTA-summarized data to boundaries or census demographic data at the ZCTA-level.

UDS ZIP Code to ZCTA Crosswalk. ZIP Code 99501 is an areal ZIP Code with a corresponding ZCTA number, 99501. ZIP Code 99520 is a post office or large volume customer that falls inside ZCTA 99501, and thus is assigned to that ZCTA.

Identifying ZIPs and ZCTAs within Other Areas

ZCTAs are built from census blocks and nest within the United States; they do not fit within any other geographies like cities and towns, counties, or even states. The boundaries of a ZCTA will often cross these other boundaries, so for example a ZCTA may fall within two or three different counties. This makes it challenging to select and download census data for all ZCTAs in a given area.

You can get lists of ZIP Codes for places, for example by using the MCDC’s ZIP Code Lookup. The problem is, the postal city that appears in addresses and is affiliated with a ZIP Code does not correspond with cities as actual legal entities, so you can’t count on the name to select all ZIPs within a specific place. For example, my hometown of Claymont, Delaware has its own ZIP Code, even though Claymont is not an incorporated city with formal, legal boundaries. Most of the ZIP Codes around Claymont are affiliated with Wilmington as a place, even though they largely cover suburbs outside the City of Wilmington; the four ZIP Codes that do cover the city cross the city boundary and include outside areas. In short, if you select all the ZIP Codes that have Wilmington, DE as their place name, they actually cover an area that’s much larger than the City of Wilmington. The Census Bureau does not associate ZCTAs with place names.

Lack of correspondence between postal city names and actual city boundaries. Most ZCTAs with the prefix 198 are assigned to Wilmington as a place name, even though many are partially or fully outside the city.

So how can you determine which ZIP Codes fall within a certain area? Or how they do (or don’t) intersect with other areas? You can overlay and eyeball the areas in TIGERweb to get a quick idea. For something more detailed, here are three options:

The Missouri Census Data Center’s Geocorr application lets you calculate overlap between a source geography and a target geography using either total population or land area for any census geographies. So in a given state, if you select ZCTAs as a source, and counties as the target, you’ll get a list that displays every ZCTA that falls wholly or partially within each county. An allocation factor indicates the percentage of the ZCTA (population or land) that’s inside and outside a county, and you can make decisions as to whether to include a given ZCTA in your study area or not. If a ZCTA falls wholly inside one county, there will be only one record with an allocation factor of 1. If it intersects more than one county, there will be a record with an allocation factor for each county.
The US Department of Housing and Urban Development (HUD) publishes a series of ZIP Code crosswalk files that associates ZIP Codes with census tracts, counties, CBSAs (metropolitan areas), and congressional districts. They create these files by geocoding all addresses and calculating the ratio of residential, business, and other addresses that fall within each of these areas and that share the same ZIP Code. The files are updated quarterly. You can use them to select, assign, or apportion ZIP Codes to a given area. There’s a journal article that describes this resource in detail.
Some websites allow you to select all ZCTAs that fall within a given geography when downloading data, essentially by selecting all ZCTAs that are fully or partially within the area. The Census Reporter allows you to do this: search for a profile for an area, click on a table of interest, and then subdivide the areas by smaller areas. You can even look at a map to see what’s been selected. data.census.gov currently does not provide this option; you have to select ZCTAs one by one (or if you’re using the census API, you’ll need to create a list of ZCTAs to retrieve).

Sample output from MCDC Geocorr. ZCTAs 08251 and 08260 fall completely within Cape May County, NJ. ZCTA 08270’s population is split between Cape May (92.4%) and Atlantic (7.6%) counties. The ZCTA names are actually postal place names; these ZCTAs cover areas that are larger than these places.

Do You Really Need to Use ZIP Codes?

ZIP Codes were an excellent mid-20th century solution for efficiently processing and delivering mail that continues to be useful for that purpose. They are less ideal for studying populations or other forms of human activity. They vary tremendously in size, shape, and population which makes them inconsistent as a unit of analysis. They have no legal or administrative meaning or function, other than delivering mail. While all American’s are familiar with them, they do not have any relevant social meaning. They don’t represent neighborhoods, and when you ask someone where they’re from, they won’t say “19703”.

So what are your other options?

If you don’t have to use ZIP Code or ZCTA data for your project, don’t. For the United States as a whole, consider using counties, PUMAs, or metropolitan areas. Within states: counties, PUMAs, and county subdivisions. For smaller areas: municipalities, census tracts, or aggregates of census tracts.
If you have the raw, address-based data, consider geocoding it. Once you geocode an address, you can use GIS to assign it to any type of geography that you have a boundary file for (spatial join), and then you can aggregate it to that geography. Some geocoders even provide geographies like counties or tracts in the match result. If your data is sensitive, strip all the attributes out except for the address and a serial integer to use as an ID, and after geocoding you can associate the results back to your original data using that ID. The Census Geocoder is free, requires no log in, allows you to do batches of 1,000 addresses at a time, and forces you to use these safety precautions. For bigger jobs, there’s an API.
Sometimes you’ll have no choice and must use ZIP Code / ZCTA data, if what you’re interested in studying is only provided in that summary form, or if there are privacy concerns around geocoding the raw address data. You may want to modify the ZCTA geography for your area to aggregate smaller ZCTAs into larger ones surrounding them, for both visual display and statistical analysis. For example, in New York City there are several ZCTAs that cover only one city and census block, as they’re occupied by one large office building that processes a lot of mail (and thus have their own ZIP number). Also, unlike most census geographies, ZCTAs have large holes in them. Any area that does not have streets and thus no addresses isn’t included in a ZCTA. In urban areas, this means large parks and cemeteries. In rural areas, vast tracts of unpopulated forest, desert, or mountain terrain. And large bodies of water in every place.

One-block ZCTAs in Midtown Manhattan, NYC that have either low or zero population.

Analyzing ZIP Code Data Over Time…

In short – forget it. The Census Bureau introduced ZCTAs in the year 2000, and in 2010 they modified their process for creating them. For a variety of reasons, they’re not strictly compatible. ACS data for ZCTAs wasn’t published until 2013. Even the economic datasets don’t go that far back; the ZIP Code Business Patterns didn’t appear until the early 1990s. Use areas that have more longevity and are relatively stable: counties, census tracts.

Why Do my ZIP Codes Look Wrong in Excel?

Regardless of whether you’re using a spreadsheet, database, or scripting language, always make sure to define ZIP / ZCTA columns as strings or text, and not as numeric types. ZIP Codes and ZCTAs begin with zeros in several states. Columns that contain ZIP / ZCTA codes must be saved as text to preserve the 5-digit code. If they’re saved as numbers, the leading zeros are dropped and the numbers are rendered incorrectly. This often happens if you’re working with data in a CSV file and you click on it to open it in Excel. In parsing the CSV, Excel assumes the ZIP / ZCTA field is a number and saves it as a number, which drops the zero and truncates the code. To prevent this from happening: open Excel to a blank project, go to the Data ribbon, click the button to import text data, choose delimited text on the import screen, choose the delimiter (comma or tab, etc), and when prompted you can select the ZIP / ZCTA column and designate it as text so that it imports properly.

To import CSV files in Excel, go to the Data ribbon and under Get External Data select From Text.

Conclusion

That’s all you ever (or maybe never) wanted to know about ZIP Codes and ZCTAs! For more information see the Census Bureau’s page about ZCTAs, a thorough write up by the Missouri Census Data Center, and these informative and fun blog posts from PolicyMap (complete with photos of Mr. ZIP). I wrote an article a few years back that demonstrates how to use some of these resources (the UDS mapper file, Geocorr) to process ZIP data with SQL and python. And of course, check out my book, Exploring the U.S. Census: Your Guide to America’s Data, to explore these concepts and resources in greater detail with hands-on exercises.

At These Coordinates

Dispatches from the Geospatial Data World

crosswalk

Datasets from GeoData@SciLi: Libraries as Data Creators

Ocean State Spatial Database

Providence Geocoded Crime Incidents

Providence Census Geography Crosswalk

UN ICSC Retail Price Indexes

Conclusion

Crosswalking Census Data to Neighborhood Geographies

How the Crosswalk Works

Creating the Crosswalk

Census Time Series Tables from NHGIS

Geospatial Metadata with Dublin Core and GeoBlacklight Standards

Out With the Old

In with the New

Data Application Profile

Creating Records and Expressing them in XML

Validating Records

Styling the Records

Crosswalk DC XML TO GB JSON

Conclusion

The Trouble with ZIP Codes: Solutions for Data Analysis and Mapping

ZIP Codes versus ZCTAs and Boundaries

Crosswalking ZIP Code Data to ZCTAs

Identifying ZIPs and ZCTAs within Other Areas

Do You Really Need to Use ZIP Codes?

Analyzing ZIP Code Data Over Time…

Why Do my ZIP Codes Look Wrong in Excel?

Conclusion