Gazetteers are directories of place names and locations, which are useful for:
- Identifying variations in place names
- Obtaining coordinates
- Locating a place within a hierarchy of places
- Generating lists of types of features
For example, if you’re working with data that’s associated with specific cities, mountains, or bodies of water, and you have the names of these features but not the coordinates or the country or state / province where they’re located, you can use a gazetteer to obtain all three. Or, if you want to create a map of a specific type of feature (i.e. populated places, ruins, mines) or want map labels for features (forests, bodies of water) you can extract and plot the gazetteer data in GIS.
In this post I’ll provide an overview of two major global gazetteers: the GEOnet Names Server and Geonames. Each one provides several different interfaces and services for exploring and accessing data which I’ll briefly mention, but I’ll focus on on the data files that you can download and what’s contained in them. I’ll conclude with a strategy for relating a small to medium place-based data file of your own to the gazetteer to obtain coordinates. If you have a file with hundreds or a few thousand records and were planning to get coordinates by eyeballing Google Maps and clicking one by one, try this instead.
The US National Geospatial-Intelligence Agency (NGA) maintains a vast gazetteer with data for all of the countries in the world (almost) and provides it to the public via the GEOnet Names Server (GNS). The GNS gazetteer does NOT include features in the United States or any of its territories; the US Geological Survey maintains a separate system called the Geographic Names Information System (GNIS) whose structure and organization is different.
The GNS is updated on a weekly basis and is provided through a number of interfaces that include a map-based and a text-based search, and Web Mapping (WMS) and Web Feature (WFS) Services that allow you to display data in a GIS or a web map.
Data files are packaged on a country by country basis. Alternatively you can download one file that has the whole world in it, or an archive with separate files for each country. The data is stored in tab-delimited text files that include a header row (i.e. the column names). ZIP files for each country include a primary file that contains all the country’s features, and a series of files that contain a subset of the primary file based on feature type. So, if you wanted to work with just populated places or with hydrographic features you can work with the specific file instead of having to filter them out of the primary one.
Each record in the GNS represents a name for a feature, as opposed to a feature itself. Thus, if a feature is known by more than one name it will appear multiple times in the file. Each record has a unique feature identifier (UFI) and a unique name identifier (UNI) which are large integers. The UFI number is repeated in the data, while the UNI is unique. The GNS files contain a number of different columns containing several feature names (short names, long ones, with and without diacritics) and a name type column (NT) that indicates whether the record is for a an approved (N), or variant name (V). If you want a list of features without duplicates, you would need to create a subset of the records that only includes the approved name.
Features are classified into nine broad classes (FC), which in turn are subdivided into many different designations (DSG). The nine classes are: administrative region, populated place, vegetation, locality or area, undersea, roads and railroads, hypsographic (terrain), hydrographic (water), and spot (point-based features). Additional columns include codes designating the size of a populated place (PC) and relative importance of the feature (DISPLAY) which is useful when mapping data at varying scales. The GNS does not contain information on actual population or elevation (this was included in the past but is no longer available).
The GNS includes a few geographic references that indicate where the feature is located. There is a global region code (RC) in the first column, a primary country code (cc1) and an administrative division (state or province) code for the primary country, and a secondary country code (cc2). Geographic features like rivers, seas, mountains, and forests may span the boundary of more than one country, so the cc1 and cc2 columns indicate this. Data in these fields may be stored as a comma-separated list or array with the different codes. The GNS uses two-letter FIPS 10-4 country codes created by the US government.
Lastly, longitude and latitude coordinates are provided in separate fields in two formats: decimal degrees (needed for plotting and mapping) and degrees-minutes-seconds. The coordinates are in the WGS 84 CRS (EPSG 4326).
Geonames is the Wikipedia or OpenStreetMap of gazetteers. It’s a collaborative, crowd-sourced project. Many users may contribute a few locations or make a correction or two, but by and large most of the data comes from public or government sources that is loaded into Geonames en masse and subsequently modified. Geonames provides a text and map-based search, and an API that let’s scripters and programmers directly access the data.
Data files are packaged country by country, or globally by certain types (i.e. all countries or the largest cities). The data is stored in tab-delimited text files without a header row, so you need to consult the documentation to identify the columns. All data for each country is packaged in a single file.
Unlike the GNS, each Geonames record represents a specific feature. There is a conventional name (name) and a variant that uses plain ascii characters (asciiname). Some variant names are included in a single list / array column called alternatenames; to get a full list of variants and spellings in different languages you would download a separate alternate names file that you could link to this one. Each feature is assigned a geonameid, which is simply a large unique integer.
Features are divided into the same nine classes that are used in the GNS, and the subdivisions are the same as well. Documentation for the classes and subdivisions is provided. Population and elevation data is provided when available and relevant, but there’s no information on timeliness or source in the data file (but you can view the full edit history for a record in the online interface).
Geonames goes to great lengths to provide the geographic framework or hierarchy for each feature, so you can get instant geographic context. They use two-letter ISO country codes to designate countries (country_code), a list of alternate or secondary countries (cc2), and for the primary country up to four different levels of administrative divisions (i.e. state / province, county, municipality, etc). There’s also a field that indicates what timezone each feature is in.
There is one set of longitude and latitude coordinates in decimal degrees in the WGS84 CRS.
To compare the different files I downloaded data for Belize, since it has a small number of records. The GNS file had 2,801 records for names, but if you look at unique features the record count was 2,180. The Geonames file for Belize has a comparable number of 2,309.
- Free and publicly available
- Tab-delimited text in country-based files
- Longitude and latitude coordinates in decimal degrees in WGS84
- Same feature classification system with nine classes and multiple sub-classes
- A single, official government source
- A file of feature names: must filter out variants to get unique feature records
- File comes with column header
- Files are divided into sub-files for feature classes
- Uses FIPS codes for countries
- Useful fields for ranking features for mapping
- Limited data on geographic hierarchy
- No data on population or elevation
- Lacks data for the United States and territories (obtainable via the USGS GNIS)
- Collaborative project with data from many sources
- A file of features, variant names included in separate column
- Additional alternate names and spellings in most languages available in separate files
- File lacks column header
- Uses ISO codes for countries
- Extensive information on geographic hierarchy
- Has population, elevation, and timezone for certain features
- No ranking columns for map display
1. It’s important to recognize that each source uses different codes for classifying countries: the GNS uses FIPS and Geonames uses ISO. While they appear similar (two-letter abbreviations) they are NOT the same: The FIPS code for Belize is BH and the ISO Code if BZ; in the ISO system BH is for Bahrain while the FIPS system doesn’t use BZ as a code. The CIA World Factbook includes a table comparing different country code systems. The GNS will convert to ISO at some uncertain date in the future.
2. Gazetteer data must be imported using UTF-8 encoding to preserve all the characters from the various alphabets.
3. Each feature in a gazetteer will have longitude and latitude coordinates that represent the geographic center of a feature. That means that a large areal feature like a country, a linear feature like a road, and a small point feature like a monument will have one coordinate pair. The coordinates for the monument will be pretty precise, while the set for the road and country are broad generalizations. Long linear features like roads and rivers may appear in the datasets several times as distinct feature records at different points. While it’s possible to get bounding box coordinates from Geonames, this data is not included in the downloadable country files.
4. A place name may appear multiple times in a gazetteer because names are not unique. Several different places of the same type may have the same name, and several features of different types may have the same name. For example, the Geonames file for Belize has four places name Santa Elena; two are populated places in different parts of the country while the other two are spot features (a camp and an estate) that are located near each of the populated places. The GNS file has even more records for this place, some with the approved name Santa Elena and others with the variant Saint Helena.
For all these reasons, it rarely makes sense to use the files in their entirety for obtaining names and coordinates or plotting places. You’ll want to extract data just for the types of features that you need. If you’re trying to match a list of place names to the gazetteer you’ll need to insure that you’re matching the right name to the right place. You can use the feature classes and the administrative divisions of the country to narrow down the location, and when in doubt use the gazetteer map interfaces to locate a specific place.
Matching Your Own Data to a Gazetteer
Winnow down the gazetteer file to just the features you need. Make sure that all the place names in your own data file are standardized so you don’t have variant spellings for the same place. In your data add a column for a unique identifier at the beginning of the sheet. Locate each place in your file in the gazetteer, then copy the unique ID from that file into your sheet. Then, if you’re using a spreadsheet you can use the VLOOKUP formula to use the ID from your sheet to pull related data from the gazetteer sheet (the longitude and latitude coordinates, codes for the administrative divisions, etc). This saves you a lot of copying and pasting. Similarly, if you were using a relational database you can write a JOIN statement to tie the two tables together using the ID.
This approach saves you the time of manually clicking on Google Maps or OSM to look up coordinates for a place and transcribing them, and you get the added benefit of grabbing any extra useful information the gazetteer provides. If you haven’t started the process of gathering your own data, start with the gazetteer file: winnow it down and append your own data to it as your research progresses.
But what if you had tons of coordinates that you need to retrieve? Because of the ambiguity in place names using a VLOOKUP or JOIN based on the name will be imprecise, because there may be more than one place with the same name and you’ll have no way of knowing if you selected the right one. You could modify your own data and the data in the gazetteer by concatenating administrative codes to the place name (i.e. St. Elena, 02) to make the name more precise and increase the chances of an accurate join. This approach requires you to be familiar with the administrative subdivisions in the areas you’re researching.
If you were trying to identify coordinates for tens of thousands of towns, cities, and larger administrative divisions you could try using a geocoder instead of a gazetteer. Geocoders are designed primarily for obtaining coordinates for addresses, but if an exact match can’t be found many will return coordinates for the smallest possible area that’s part of the address. If you provided a list of cities that also include a state / province and country, you could obtain the coordinates for just the city.
A final alternative where you can get a wider range of features in a geospatial format in bulk is the OpenStreetMap. I’ll return to this in a future post, but there’s an excellent OSM – QGIS tutorial that can help get you started.
Interested in learning more? If you’re in the spatial sciences or digital humanities check out this book: Placing Names: Enriching and Integrating Gazetteers.