The places
package provides access to the daily data files from the geonames data dump. The aim of the package is to make the geonames dump consisting of over 11 million place names with geographic coordinates available for large scale mapping or text mining projects involving multiple countries or world regions.
places
is intended to complement the rOpenSci geonames()
package by Barry Rowlingson. The geonames package provides access to the geonames API and is recommended for smaller scale projects. The places
package is better suited for larger scale projects involving text mining and mapping place names from literature or other projects involving federating different kinds of data with geoinformation at scale. The package also adds additional tables that make it easier to select subsets of data for geographic regions or subregions or economic status using United Nations and World Bank datasets.
The places package was developed with support from the Research Council of Norway (RCN project number 257631/E10) as part of the Biospolar Project.
The purpose of this walk through is to take you through how to access the data, what is available, and decisions on issues such as tidying the data.
The geonames dump is a set of daily dump files available from the export home page http://download.geonames.org/export/dump/. This page provides .zip files for individual country data along with other data files (such as alternate names) in text files and links to other directories. The mix of files makes it messy to work with.
The places_table()
function downloads and parses the page into a tibble to make it easier to work with.
## # A tibble: 283 x 9
## name last_modified size description file_name file_type url iso
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 AD.zip 2018-07-31 0… 89K <NA> AD zip http:… AD
## 2 AE.zip 2018-07-31 0… 236K <NA> AE zip http:… AE
## 3 AF.zip 2018-07-31 0… 4.5M <NA> AF zip http:… AF
## 4 AG.zip 2018-07-31 0… 26K <NA> AG zip http:… AG
## 5 AI.zip 2018-07-31 0… 10K <NA> AI zip http:… AI
## 6 AL.zip 2018-07-31 0… 300K <NA> AL zip http:… AL
## 7 AM.zip 2018-07-31 0… 1.0M <NA> AM zip http:… AM
## 8 AN.zip 2018-07-31 0… 3.8K <NA> AN zip http:… AN
## 9 AO.zip 2018-07-31 0… 1.1M <NA> AO zip http:… AO
## 10 AQ.zip 2018-07-31 0… 523K <NA> AQ zip http:… AQ
## # ... with 273 more rows, and 1 more variable: other <chr>
The original table contained the following columns:
To make the table easier to work with places_table()
separates out the files into iso
(for country files) and other
for other files. File_name
, file_type
and url
for file paths are added.
The geonames data mainly works on two letter country codes (variously called iso and iso2c). Country names can be expressed in all kinds of different ways. If you don’t know the two letter country code you can look it up with places_lookup()
.
## # A tibble: 1 x 1
## Kenya
## <chr>
## 1 KE
We can also look up ambiguous names:
## # A tibble: 1 x 5
## `Peoples Republic Of China` `Viet Nam` Vietnam Lao Laos
## <chr> <chr> <chr> <chr> <chr>
## 1 CN VN VN LA LA
Behind the scenes the name matching is handled by the countrycode package by Vincent Arul Bundock and collaborators. Country names can be expressed in all kinds of different ways and the countrycode package does a good job of recognising them. At present the places package only implements country names in English.
We use the countrycode package for lookup because of its flexibility. However, Geonames produces its own countryinfo table that can be imported as follows.
## # A tibble: 252 x 19
## iso iso3 isonumeric fips country capital area_in_sq_km population
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int>
## 1 AD AND 020 AN Andorra Andorr… 468 84000
## 2 AE ARE 784 AE United A… Abu Dh… 82880 4975593
## 3 AF AFG 004 AF Afghanis… Kabul 647500 29121286
## 4 AG ATG 028 AC Antigua … St. Jo… 443 86754
## 5 AI AIA 660 AV Anguilla The Va… 102 13254
## 6 AL ALB 008 AL Albania Tirana 28748 2986952
## 7 AM ARM 051 AM Armenia Yerevan 29800 2968000
## 8 AO AGO 024 AO Angola Luanda 1246700 13068161
## 9 AQ ATA 010 AY Antarcti… <NA> 14000000 0
## 10 AR ARG 032 AR Argentina Buenos… 2766890 41343201
## # ... with 242 more rows, and 11 more variables: continent <chr>,
## # tld <chr>, currencycode <chr>, currencyname <chr>, phone <chr>,
## # postal_code_format <chr>, postal_code_regex <chr>, languages <chr>,
## # geonameid <int>, neighbours <chr>, equivalentfipscode <chr>
The advantage of this table is that it includes information such as the capital city, the area in square kilometres, the population of the capital city, continent and so on. Some of this data is included in the regions
table (below). However, for general lookup of country names places_lookup()
will be more flexible and forgiving.
You can download the latest raw data for an individual country using places_download()
and import it as a data frame with tidy column names using places_import()
. A download_date
column is added by default to assist with keeping track of the file history.
Download and import and presently handled in two steps to avoid assigning to the global environment. The pipe %>%
is built into places.
## [1] "http://download.geonames.org/export/dump/KE.zip"
## # A tibble: 30,113 x 20
## geonameid name asciiname alternatenames latitude longitude
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 54121 Mata… Mata Arba <NA> -0.67472 40.97667
## 2 55628 Kolb… Kolbiyow Kolbiyow,Lac … -1.25055 41.49477
## 3 58869 Bur … Bur Gause Bur Gause,Bur… 3.77299 41.75744
## 4 60856 Did … Did Songa Did Songa -1.2 41.26667
## 5 149213 Umba Umba Mto Umba,Umba… -4.66354 39.22688
## 6 149529 Musa… Musangai… Musangairo,Ta… -2.96965 37.6457
## 7 150092 Sere… Serenget… Serengeti Pla… -3.42386 37.92541
## 8 150177 Schl… Schlobach Schlobach -3.35848 37.66424
## 9 150859 Oldo… Oldoinyo… Ol Doinyo Oro… -2.49465 36.75141
## 10 151987 Nama… Namanga Namanga -2.6785 37.01915
## # ... with 30,103 more rows, and 14 more variables: feature_class <chr>,
## # feature_code <chr>, iso <chr>, cc2 <chr>, admin1_code <chr>,
## # admin2_code <chr>, admin3_code <chr>, admin4_code <chr>,
## # population <chr>, elevation <chr>, dem <int>, timezone <chr>,
## # modification_date <date>, download_date <date>
places_download
uses places_lookup
internally meaning that you can simply enter a country name using the country =
argument.
## [1] "http://download.geonames.org/export/dump/AI.zip"
## # A tibble: 234 x 20
## geonameid name asciiname alternatenames latitude longitude
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3573360 Wind… Windward… <NA> 18.27035 -62.9667
## 2 3573361 Wind… Windward… <NA> 18.27547 -62.96621
## 3 3573362 Whit… White Hi… <NA> 18.25327 -62.99573
## 4 3573363 West… West Poi… <NA> 18.27716 -62.96029
## 5 3573364 West… West End… <NA> 18.17191 -63.14941
## 6 3573365 West… West End… West End Pond… 18.16697 -63.15673
## 7 3573366 West… West End… <NA> 18.17185 -63.15675
## 8 3573367 West… West Cay <NA> 18.2773 -63.27498
## 9 3573368 Welc… Welches … Welches,Welch… 18.2446 -63.01217
## 10 3573369 Watt… Wattices <NA> 18.22674 -63.03368
## # ... with 224 more rows, and 14 more variables: feature_class <chr>,
## # feature_code <chr>, iso <chr>, cc2 <chr>, admin1_code <chr>,
## # admin2_code <chr>, admin3_code <chr>, admin4_code <chr>,
## # population <chr>, elevation <chr>, dem <int>, timezone <chr>,
## # modification_date <date>, download_date <date>
places_download
can handle different cases and will fail fast on common errors.
or
If you try and download more than one file at a time things will quickly go wrong.
places_download()
is not vectorised and only handles one country file at a time. For multiple countries or regions it is easier to work with the allcountries
table.
Geonames produces a daily file containing the data for all countries. The allcountries
daily file contains over 11 million place names in a +330MB compressed file that is 1.4Gb when uncompressed.
For many purposes you may be happy with a highly compressed archive of the allcountries file. The archive was created on the 1st of January 2018 as a .rda file and is a 257MB compressed .rda file that can be called with:
This will take a few minutes to download and then to load… so maybe take a break for a cup of tea.
If you would like to retrieve the latest data file, use:
You may want to have another cup of tea while waiting for this.
The geonames tables contains three name fields. The asciiname is the most useful for text mining and you may want to take a look at the alternatenames field for known variants. The built in Kenya dataset (KE) can be useful for getting to grips with the data.
## # A tibble: 39,009 x 19
## geonameid name asciiname alternatenames latitude longitude feature_full
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 54121 Mata… Mata Arba <NA> -0.67472 40.97667 L.AREA
## 2 55628 Kolb… Kolbiyow Kolbiyow -1.25055 41.49477 H.STMI
## 3 55628 Kolb… Kolbiyow Lac Chinoti -1.25055 41.49477 H.STMI
## 4 55628 Kolb… Kolbiyow Lac Colbio -1.25055 41.49477 H.STMI
## 5 55628 Kolb… Kolbiyow Lac Colbìo -1.25055 41.49477 H.STMI
## 6 55628 Kolb… Kolbiyow Lac Gifta Bura -1.25055 41.49477 H.STMI
## 7 55628 Kolb… Kolbiyow Lac Kolbio -1.25055 41.49477 H.STMI
## 8 55628 Kolb… Kolbiyow Lach Colbio -1.25055 41.49477 H.STMI
## 9 55628 Kolb… Kolbiyow Lach Gif-ta B… -1.25055 41.49477 H.STMI
## 10 55628 Kolb… Kolbiyow Laga Kalbio -1.25055 41.49477 H.STMI
## # ... with 38,999 more rows, and 12 more variables: iso <chr>, cc2 <chr>,
## # admin1_code <chr>, admin2_code <chr>, admin3_code <chr>,
## # admin4_code <chr>, population <chr>, elevation <chr>, dem <int>,
## # timezone <chr>, modification_date <date>, download_date <date>
Geonames uses feature codes to describe the georeferenced data. Feature codes are divided between classes and codes. The classes are as follows:
An example of a corresponding code is:
Note that the original featurecode table concatenated the class and feature code as in the examples above, but in the actual country and allcountries
files they are separated into feature_class
and feature_code
. This makes joining awkward. To solve this a new field called feature_full
is created at import while dropping the feature_class and feature code fields to prevent duplicates when joining. That is simpler than it sounds.
To view the featurecodes use:
## # A tibble: 671 x 7
## feature_full feature_class feature_code feature_name feature_detail
## <chr> <chr> <chr> <chr> <chr>
## 1 A.ADM1 A ADM1 first-order… a primary adm…
## 2 A.ADM1H A ADM1H historical … a former firs…
## 3 A.ADM2 A ADM2 second-orde… a subdivision…
## 4 A.ADM2H A ADM2H historical … a former seco…
## 5 A.ADM3 A ADM3 third-order… a subdivision…
## 6 A.ADM3H A ADM3H historical … a former thir…
## 7 A.ADM4 A ADM4 fourth-orde… a subdivision…
## 8 A.ADM4H A ADM4H historical … a former four…
## 9 A.ADM5 A ADM5 fifth-order… a subdivision…
## 10 A.ADM5H A ADM5H historical … a former fift…
## # ... with 661 more rows, and 2 more variables: feature_clean <chr>,
## # multi <lgl>
To join the featurecodes table onto a dataset you could use:
KE <- dplyr::left_join(KE, places::featurecodes, by = "feature_full")
KE %>% dplyr::select(feature_full, feature_name, name, longitude, latitude)
## # A tibble: 29,598 x 5
## feature_full feature_name name longitude latitude
## <chr> <chr> <chr> <chr> <chr>
## 1 L.AREA area Mata Arba 40.97667 -0.67472
## 2 H.STMI intermittent stream Kolbiyow 41.49477 -1.25055
## 3 T.MT mountain Bur Gause 41.75744 3.77299
## 4 T.HLL hill Did Songa 41.26667 -1.2
## 5 H.STM stream Umba 39.22688 -4.66354
## 6 H.STM stream Musangairo 37.6457 -2.96965
## 7 T.PLN plain(s) Serengeti Plains 37.92541 -3.42386
## 8 T.HLL hill Schlobach 37.66424 -3.35848
## 9 T.MT mountain Oldoinyo Orok 36.75141 -2.49465
## 10 H.STMI intermittent stream Namanga 37.01915 -2.6785
## # ... with 29,588 more rows
The featurecodes
table, includes the feature code table divided into four columns that were parsed from the original file.
You may wish to join this table to the main table or to investigate the codes to use as filters. The feature code for a mountain is “MT” but to get all mountainous features you may need to look up other codes (e.g. MTS and MTU)
## # A tibble: 581 x 25
## geonameid name asciiname alternatenames latitude longitude feature_full
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 58869 Bur … Bur Gause Bur Gause,Bur… 3.77299 41.75744 T.MT
## 2 150859 Oldo… Oldoinyo… Ol Doinyo Oro… -2.49465 36.75141 T.MT
## 3 156053 Late… Latema Latema -3.40372 37.61818 T.MT
## 4 177903 Zong… Zongoloni Yongalini,Zon… -3.54751 38.41325 T.MT
## 5 177906 Zombo Zombo Jombo,Zombo -4.43559 39.21184 T.MT
## 6 177934 Zaga… Zagatisi <NA> -3.63819 38.52782 T.MT
## 7 178030 Yama… Yamanyani <NA> -3.07708 38.46462 T.MT
## 8 178033 Yama… Yamalu <NA> -2.03518 38.33162 T.MT
## 9 178048 Laka… Lakadema Lacadema,Laga… 0.56193 38.09528 T.MT
## 10 178065 Wyisa Wyisa Wyiga,Wyisa 1.58756 35.3808 T.MT
## # ... with 571 more rows, and 18 more variables: iso <chr>, cc2 <chr>,
## # admin1_code <chr>, admin2_code <chr>, admin3_code <chr>,
## # admin4_code <chr>, population <chr>, elevation <chr>, dem <int>,
## # timezone <chr>, modification_date <date>, download_date <date>,
## # feature_class <chr>, feature_code <chr>, feature_name <chr>,
## # feature_detail <chr>, feature_clean <chr>, multi <lgl>
To aid with multicountry analysis the package includes two add on tables.
You can call these table directly:
## # A tibble: 249 x 15
## un_global_code un_global_name un_region_code un_region_name
## <int> <chr> <int> <chr>
## 1 1 World 2 Africa
## 2 1 World 2 Africa
## 3 1 World 2 Africa
## 4 1 World 2 Africa
## 5 1 World 2 Africa
## 6 1 World 2 Africa
## 7 1 World 2 Africa
## 8 1 World 2 Africa
## 9 1 World 2 Africa
## 10 1 World 2 Africa
## # ... with 239 more rows, and 11 more variables: un_sub_region_code <int>,
## # un_sub_region_name <chr>, un_intermediate_region_code <int>,
## # un_intermediate_region_name <chr>, un_country_or_area <chr>,
## # un_m49_code <int>, iso3 <chr>, un_least_developed_countries_ldc <int>,
## # un_land_locked_developing_countries_lldc <int>,
## # un_small_island_developing_states_sids <int>,
## # un_developed_or_developing_countries <chr>
For the World Bank WDI indicators from the WDI package:
## # A tibble: 304 x 3
## iso3 wb_region wb_income
## <chr> <chr> <chr>
## 1 ABW Latin America & Caribbean High income
## 2 AFG South Asia Low income
## 3 AFR Aggregates Aggregates
## 4 AGO Sub-Saharan Africa Lower middle income
## 5 ALB Europe & Central Asia Upper middle income
## 6 AND Europe & Central Asia High income
## 7 ANR Aggregates Aggregates
## 8 ARB Aggregates Aggregates
## 9 ARE Middle East & North Africa High income
## 10 ARG Latin America & Caribbean Upper middle income
## # ... with 294 more rows
Additional regional information is available from the countrycode package and may be incorporated into places in future.
The two regional tables are combined with selected fields from the geonames countryinfo table.
The regions file can be called as follows:
## # A tibble: 249 x 22
## un_global_code un_global_name un_region_code un_region_name
## <int> <chr> <int> <chr>
## 1 1 World 2 Africa
## 2 1 World 2 Africa
## 3 1 World 2 Africa
## 4 1 World 2 Africa
## 5 1 World 2 Africa
## 6 1 World 2 Africa
## 7 1 World 2 Africa
## 8 1 World 2 Africa
## 9 1 World 2 Africa
## 10 1 World 2 Africa
## # ... with 239 more rows, and 18 more variables: un_sub_region_code <int>,
## # un_sub_region_name <chr>, un_intermediate_region_code <int>,
## # un_intermediate_region_name <chr>, un_country_or_area <chr>,
## # un_m49_code <int>, iso3 <chr>, un_least_developed_countries_ldc <int>,
## # un_land_locked_developing_countries_lldc <int>,
## # un_small_island_developing_states_sids <int>,
## # un_developed_or_developing_countries <chr>, iso <chr>,
## # iso_numeric <chr>, fips <chr>, country <chr>, geonameid <int>,
## # wb_region <chr>, wb_income <chr>
To join the regions table with a set of results try: