minimal R version CRAN_Status_Badge packageversion


Last-changedate

Introduction

The places package provides access to the daily data files from the geonames data dump. The aim of the package is to make the geonames dump consisting of over 11 million place names with geographic coordinates available for large scale mapping or text mining projects involving multiple countries or world regions.

places is intended to complement the rOpenSci geonames() package by Barry Rowlingson. The geonames package provides access to the geonames API and is recommended for smaller scale projects. The places package is better suited for larger scale projects involving text mining and mapping place names from literature or other projects involving federating different kinds of data with geoinformation at scale. The package also adds additional tables that make it easier to select subsets of data for geographic regions or subregions or economic status using United Nations and World Bank datasets.

The places package was developed with support from the Research Council of Norway (RCN project number 257631/E10) as part of the Biospolar Project.

The purpose of this walk through is to take you through how to access the data, what is available, and decisions on issues such as tidying the data.

Installing

places is not on CRAN but can be installed from Github with devtools:

devtools::install_github("poldham/places")

The geonames data dump

The geonames dump is a set of daily dump files available from the export home page http://download.geonames.org/export/dump/. This page provides .zip files for individual country data along with other data files (such as alternate names) in text files and links to other directories. The mix of files makes it messy to work with.

The places_table() function downloads and parses the page into a tibble to make it easier to work with.

## # A tibble: 283 x 9
##    name   last_modified size  description file_name file_type url    iso  
##    <chr>  <chr>         <chr> <chr>       <chr>     <chr>     <chr>  <chr>
##  1 AD.zip 2018-07-31 0… 89K   <NA>        AD        zip       http:… AD   
##  2 AE.zip 2018-07-31 0… 236K  <NA>        AE        zip       http:… AE   
##  3 AF.zip 2018-07-31 0… 4.5M  <NA>        AF        zip       http:… AF   
##  4 AG.zip 2018-07-31 0… 26K   <NA>        AG        zip       http:… AG   
##  5 AI.zip 2018-07-31 0… 10K   <NA>        AI        zip       http:… AI   
##  6 AL.zip 2018-07-31 0… 300K  <NA>        AL        zip       http:… AL   
##  7 AM.zip 2018-07-31 0… 1.0M  <NA>        AM        zip       http:… AM   
##  8 AN.zip 2018-07-31 0… 3.8K  <NA>        AN        zip       http:… AN   
##  9 AO.zip 2018-07-31 0… 1.1M  <NA>        AO        zip       http:… AO   
## 10 AQ.zip 2018-07-31 0… 523K  <NA>        AQ        zip       http:… AQ   
## # ... with 273 more rows, and 1 more variable: other <chr>

The original table contained the following columns:

  • name
  • last_modified,
  • size
  • description (empty)

To make the table easier to work with places_table() separates out the files into iso (for country files) and other for other files. File_name, file_type and url for file paths are added.

Looking Up Countries

The geonames data mainly works on two letter country codes (variously called iso and iso2c). Country names can be expressed in all kinds of different ways. If you don’t know the two letter country code you can look it up with places_lookup().

## # A tibble: 1 x 1
##   Kenya
##   <chr>
## 1 KE

We can also look up ambiguous names:

places::places_lookup(c("Peoples republic of china", "Viet Nam", "Vietnam", "Lao", "Laos"))
## # A tibble: 1 x 5
##   `Peoples Republic Of China` `Viet Nam` Vietnam Lao   Laos 
##   <chr>                       <chr>      <chr>   <chr> <chr>
## 1 CN                          VN         VN      LA    LA

Behind the scenes the name matching is handled by the countrycode package by Vincent Arul Bundock and collaborators. Country names can be expressed in all kinds of different ways and the countrycode package does a good job of recognising them. At present the places package only implements country names in English.

We use the countrycode package for lookup because of its flexibility. However, Geonames produces its own countryinfo table that can be imported as follows.

## # A tibble: 252 x 19
##    iso   iso3  isonumeric fips  country   capital area_in_sq_km population
##    <chr> <chr> <chr>      <chr> <chr>     <chr>           <dbl>      <int>
##  1 AD    AND   020        AN    Andorra   Andorr…           468      84000
##  2 AE    ARE   784        AE    United A… Abu Dh…         82880    4975593
##  3 AF    AFG   004        AF    Afghanis… Kabul          647500   29121286
##  4 AG    ATG   028        AC    Antigua … St. Jo…           443      86754
##  5 AI    AIA   660        AV    Anguilla  The Va…           102      13254
##  6 AL    ALB   008        AL    Albania   Tirana          28748    2986952
##  7 AM    ARM   051        AM    Armenia   Yerevan         29800    2968000
##  8 AO    AGO   024        AO    Angola    Luanda        1246700   13068161
##  9 AQ    ATA   010        AY    Antarcti… <NA>         14000000          0
## 10 AR    ARG   032        AR    Argentina Buenos…       2766890   41343201
## # ... with 242 more rows, and 11 more variables: continent <chr>,
## #   tld <chr>, currencycode <chr>, currencyname <chr>, phone <chr>,
## #   postal_code_format <chr>, postal_code_regex <chr>, languages <chr>,
## #   geonameid <int>, neighbours <chr>, equivalentfipscode <chr>

The advantage of this table is that it includes information such as the capital city, the area in square kilometres, the population of the capital city, continent and so on. Some of this data is included in the regions table (below). However, for general lookup of country names places_lookup() will be more flexible and forgiving.

Individual Country data

You can download the latest raw data for an individual country using places_download() and import it as a data frame with tidy column names using places_import(). A download_date column is added by default to assist with keeping track of the file history.

Download and import and presently handled in two steps to avoid assigning to the global environment. The pipe %>% is built into places.

## [1] "http://download.geonames.org/export/dump/KE.zip"
## # A tibble: 30,113 x 20
##    geonameid name  asciiname alternatenames latitude longitude
##    <chr>     <chr> <chr>     <chr>          <chr>    <chr>    
##  1 54121     Mata… Mata Arba <NA>           -0.67472 40.97667 
##  2 55628     Kolb… Kolbiyow  Kolbiyow,Lac … -1.25055 41.49477 
##  3 58869     Bur … Bur Gause Bur Gause,Bur… 3.77299  41.75744 
##  4 60856     Did … Did Songa Did Songa      -1.2     41.26667 
##  5 149213    Umba  Umba      Mto Umba,Umba… -4.66354 39.22688 
##  6 149529    Musa… Musangai… Musangairo,Ta… -2.96965 37.6457  
##  7 150092    Sere… Serenget… Serengeti Pla… -3.42386 37.92541 
##  8 150177    Schl… Schlobach Schlobach      -3.35848 37.66424 
##  9 150859    Oldo… Oldoinyo… Ol Doinyo Oro… -2.49465 36.75141 
## 10 151987    Nama… Namanga   Namanga        -2.6785  37.01915 
## # ... with 30,103 more rows, and 14 more variables: feature_class <chr>,
## #   feature_code <chr>, iso <chr>, cc2 <chr>, admin1_code <chr>,
## #   admin2_code <chr>, admin3_code <chr>, admin4_code <chr>,
## #   population <chr>, elevation <chr>, dem <int>, timezone <chr>,
## #   modification_date <date>, download_date <date>

places_download uses places_lookup internally meaning that you can simply enter a country name using the country = argument.

anguilla <- places_download(country = "anguilla") %>% 
  places_import()
## [1] "http://download.geonames.org/export/dump/AI.zip"
## # A tibble: 234 x 20
##    geonameid name  asciiname alternatenames latitude longitude
##    <chr>     <chr> <chr>     <chr>          <chr>    <chr>    
##  1 3573360   Wind… Windward… <NA>           18.27035 -62.9667 
##  2 3573361   Wind… Windward… <NA>           18.27547 -62.96621
##  3 3573362   Whit… White Hi… <NA>           18.25327 -62.99573
##  4 3573363   West… West Poi… <NA>           18.27716 -62.96029
##  5 3573364   West… West End… <NA>           18.17191 -63.14941
##  6 3573365   West… West End… West End Pond… 18.16697 -63.15673
##  7 3573366   West… West End… <NA>           18.17185 -63.15675
##  8 3573367   West… West Cay  <NA>           18.2773  -63.27498
##  9 3573368   Welc… Welches … Welches,Welch… 18.2446  -63.01217
## 10 3573369   Watt… Wattices  <NA>           18.22674 -63.03368
## # ... with 224 more rows, and 14 more variables: feature_class <chr>,
## #   feature_code <chr>, iso <chr>, cc2 <chr>, admin1_code <chr>,
## #   admin2_code <chr>, admin3_code <chr>, admin4_code <chr>,
## #   population <chr>, elevation <chr>, dem <int>, timezone <chr>,
## #   modification_date <date>, download_date <date>

places_download can handle different cases and will fail fast on common errors.

places_download(code = "Kenya")

or

places_download(country = "KE")

If you try and download more than one file at a time things will quickly go wrong.

places_download(code = c("AI", "GB"))

places_download() is not vectorised and only handles one country file at a time. For multiple countries or regions it is easier to work with the allcountries table.

All Countries

Geonames produces a daily file containing the data for all countries. The allcountries daily file contains over 11 million place names in a +330MB compressed file that is 1.4Gb when uncompressed.

For many purposes you may be happy with a highly compressed archive of the allcountries file. The archive was created on the 1st of January 2018 as a .rda file and is a 257MB compressed .rda file that can be called with:

This will take a few minutes to download and then to load… so maybe take a break for a cup of tea.

If you would like to retrieve the latest data file, use:

allcountries <- places_download(country = "allcountries") %>% places_import()

You may want to have another cup of tea while waiting for this.

Place names

The geonames tables contains three name fields. The asciiname is the most useful for text mining and you may want to take a look at the alternatenames field for known variants. The built in Kenya dataset (KE) can be useful for getting to grips with the data.

## # A tibble: 39,009 x 19
##    geonameid name  asciiname alternatenames latitude longitude feature_full
##    <chr>     <chr> <chr>     <chr>          <chr>    <chr>     <chr>       
##  1 54121     Mata… Mata Arba <NA>           -0.67472 40.97667  L.AREA      
##  2 55628     Kolb… Kolbiyow  Kolbiyow       -1.25055 41.49477  H.STMI      
##  3 55628     Kolb… Kolbiyow  Lac Chinoti    -1.25055 41.49477  H.STMI      
##  4 55628     Kolb… Kolbiyow  Lac Colbio     -1.25055 41.49477  H.STMI      
##  5 55628     Kolb… Kolbiyow  Lac Colbìo     -1.25055 41.49477  H.STMI      
##  6 55628     Kolb… Kolbiyow  Lac Gifta Bura -1.25055 41.49477  H.STMI      
##  7 55628     Kolb… Kolbiyow  Lac Kolbio     -1.25055 41.49477  H.STMI      
##  8 55628     Kolb… Kolbiyow  Lach Colbio    -1.25055 41.49477  H.STMI      
##  9 55628     Kolb… Kolbiyow  Lach Gif-ta B… -1.25055 41.49477  H.STMI      
## 10 55628     Kolb… Kolbiyow  Laga Kalbio    -1.25055 41.49477  H.STMI      
## # ... with 38,999 more rows, and 12 more variables: iso <chr>, cc2 <chr>,
## #   admin1_code <chr>, admin2_code <chr>, admin3_code <chr>,
## #   admin4_code <chr>, population <chr>, elevation <chr>, dem <int>,
## #   timezone <chr>, modification_date <date>, download_date <date>

Feature Codes

Geonames uses feature codes to describe the georeferenced data. Feature codes are divided between classes and codes. The classes are as follows:

  • A Administrative Boundary Features
  • H Hydrographic Features
  • L Area Features
  • P Populated Place Features
  • R Road / Railroad Features
  • S Spot Features
  • T Hypsographic Features
  • U Undersea Features
  • V Vegetation Features

An example of a corresponding code is:

  • H.ANCH anchorage
  • R.OILP oil pipeline

Note that the original featurecode table concatenated the class and feature code as in the examples above, but in the actual country and allcountries files they are separated into feature_class and feature_code. This makes joining awkward. To solve this a new field called feature_full is created at import while dropping the feature_class and feature code fields to prevent duplicates when joining. That is simpler than it sounds.

To view the featurecodes use:

## # A tibble: 671 x 7
##    feature_full feature_class feature_code feature_name feature_detail
##    <chr>        <chr>         <chr>        <chr>        <chr>         
##  1 A.ADM1       A             ADM1         first-order… a primary adm…
##  2 A.ADM1H      A             ADM1H        historical … a former firs…
##  3 A.ADM2       A             ADM2         second-orde… a subdivision…
##  4 A.ADM2H      A             ADM2H        historical … a former seco…
##  5 A.ADM3       A             ADM3         third-order… a subdivision…
##  6 A.ADM3H      A             ADM3H        historical … a former thir…
##  7 A.ADM4       A             ADM4         fourth-orde… a subdivision…
##  8 A.ADM4H      A             ADM4H        historical … a former four…
##  9 A.ADM5       A             ADM5         fifth-order… a subdivision…
## 10 A.ADM5H      A             ADM5H        historical … a former fift…
## # ... with 661 more rows, and 2 more variables: feature_clean <chr>,
## #   multi <lgl>

To join the featurecodes table onto a dataset you could use:

KE <- dplyr::left_join(KE, places::featurecodes, by = "feature_full")

KE %>% dplyr::select(feature_full, feature_name, name, longitude, latitude)
## # A tibble: 29,598 x 5
##    feature_full feature_name        name             longitude latitude
##    <chr>        <chr>               <chr>            <chr>     <chr>   
##  1 L.AREA       area                Mata Arba        40.97667  -0.67472
##  2 H.STMI       intermittent stream Kolbiyow         41.49477  -1.25055
##  3 T.MT         mountain            Bur Gause        41.75744  3.77299 
##  4 T.HLL        hill                Did Songa        41.26667  -1.2    
##  5 H.STM        stream              Umba             39.22688  -4.66354
##  6 H.STM        stream              Musangairo       37.6457   -2.96965
##  7 T.PLN        plain(s)            Serengeti Plains 37.92541  -3.42386
##  8 T.HLL        hill                Schlobach        37.66424  -3.35848
##  9 T.MT         mountain            Oldoinyo Orok    36.75141  -2.49465
## 10 H.STMI       intermittent stream Namanga          37.01915  -2.6785 
## # ... with 29,588 more rows

The featurecodes table, includes the feature code table divided into four columns that were parsed from the original file.

  • feature_full (for joining)
  • feature name (the short description)
  • feature detail (a longer description)
  • feature_clean (basic cleaning on the feature name to remove plurals such as forest(s))
  • MULTI A logical field indicating whether the feature_clean fields contains multi word phrases (TRUE)

You may wish to join this table to the main table or to investigate the codes to use as filters. The feature code for a mountain is “MT” but to get all mountainous features you may need to look up other codes (e.g. MTS and MTU)

## # A tibble: 581 x 25
##    geonameid name  asciiname alternatenames latitude longitude feature_full
##    <chr>     <chr> <chr>     <chr>          <chr>    <chr>     <chr>       
##  1 58869     Bur … Bur Gause Bur Gause,Bur… 3.77299  41.75744  T.MT        
##  2 150859    Oldo… Oldoinyo… Ol Doinyo Oro… -2.49465 36.75141  T.MT        
##  3 156053    Late… Latema    Latema         -3.40372 37.61818  T.MT        
##  4 177903    Zong… Zongoloni Yongalini,Zon… -3.54751 38.41325  T.MT        
##  5 177906    Zombo Zombo     Jombo,Zombo    -4.43559 39.21184  T.MT        
##  6 177934    Zaga… Zagatisi  <NA>           -3.63819 38.52782  T.MT        
##  7 178030    Yama… Yamanyani <NA>           -3.07708 38.46462  T.MT        
##  8 178033    Yama… Yamalu    <NA>           -2.03518 38.33162  T.MT        
##  9 178048    Laka… Lakadema  Lacadema,Laga… 0.56193  38.09528  T.MT        
## 10 178065    Wyisa Wyisa     Wyiga,Wyisa    1.58756  35.3808   T.MT        
## # ... with 571 more rows, and 18 more variables: iso <chr>, cc2 <chr>,
## #   admin1_code <chr>, admin2_code <chr>, admin3_code <chr>,
## #   admin4_code <chr>, population <chr>, elevation <chr>, dem <int>,
## #   timezone <chr>, modification_date <date>, download_date <date>,
## #   feature_class <chr>, feature_code <chr>, feature_name <chr>,
## #   feature_detail <chr>, feature_clean <chr>, multi <lgl>

Regions, Sub-Regions, intermediate regions and continents

To aid with multicountry analysis the package includes two add on tables.

  1. The United Nations regions (M49) from the United Nations Statistical Division
  2. World Bank regional divisions (through the World Development Indicators (WDI) package).

You can call these table directly:

## # A tibble: 249 x 15
##    un_global_code un_global_name un_region_code un_region_name
##             <int> <chr>                   <int> <chr>         
##  1              1 World                       2 Africa        
##  2              1 World                       2 Africa        
##  3              1 World                       2 Africa        
##  4              1 World                       2 Africa        
##  5              1 World                       2 Africa        
##  6              1 World                       2 Africa        
##  7              1 World                       2 Africa        
##  8              1 World                       2 Africa        
##  9              1 World                       2 Africa        
## 10              1 World                       2 Africa        
## # ... with 239 more rows, and 11 more variables: un_sub_region_code <int>,
## #   un_sub_region_name <chr>, un_intermediate_region_code <int>,
## #   un_intermediate_region_name <chr>, un_country_or_area <chr>,
## #   un_m49_code <int>, iso3 <chr>, un_least_developed_countries_ldc <int>,
## #   un_land_locked_developing_countries_lldc <int>,
## #   un_small_island_developing_states_sids <int>,
## #   un_developed_or_developing_countries <chr>

For the World Bank WDI indicators from the WDI package:

## # A tibble: 304 x 3
##    iso3  wb_region                  wb_income          
##    <chr> <chr>                      <chr>              
##  1 ABW   Latin America & Caribbean  High income        
##  2 AFG   South Asia                 Low income         
##  3 AFR   Aggregates                 Aggregates         
##  4 AGO   Sub-Saharan Africa         Lower middle income
##  5 ALB   Europe & Central Asia      Upper middle income
##  6 AND   Europe & Central Asia      High income        
##  7 ANR   Aggregates                 Aggregates         
##  8 ARB   Aggregates                 Aggregates         
##  9 ARE   Middle East & North Africa High income        
## 10 ARG   Latin America & Caribbean  Upper middle income
## # ... with 294 more rows

Additional regional information is available from the countrycode package and may be incorporated into places in future.

The two regional tables are combined with selected fields from the geonames countryinfo table.

The regions file can be called as follows:

## # A tibble: 249 x 22
##    un_global_code un_global_name un_region_code un_region_name
##             <int> <chr>                   <int> <chr>         
##  1              1 World                       2 Africa        
##  2              1 World                       2 Africa        
##  3              1 World                       2 Africa        
##  4              1 World                       2 Africa        
##  5              1 World                       2 Africa        
##  6              1 World                       2 Africa        
##  7              1 World                       2 Africa        
##  8              1 World                       2 Africa        
##  9              1 World                       2 Africa        
## 10              1 World                       2 Africa        
## # ... with 239 more rows, and 18 more variables: un_sub_region_code <int>,
## #   un_sub_region_name <chr>, un_intermediate_region_code <int>,
## #   un_intermediate_region_name <chr>, un_country_or_area <chr>,
## #   un_m49_code <int>, iso3 <chr>, un_least_developed_countries_ldc <int>,
## #   un_land_locked_developing_countries_lldc <int>,
## #   un_small_island_developing_states_sids <int>,
## #   un_developed_or_developing_countries <chr>, iso <chr>,
## #   iso_numeric <chr>, fips <chr>, country <chr>, geonameid <int>,
## #   wb_region <chr>, wb_income <chr>

To join the regions table with a set of results try:

df <- dplyr::left_join(df, regions, by = "iso")