The goal of the dronesr
R data package is to provide access to datasets of scientific and patent literature on drone technology for use in WIPO training workshops in patent analytics. The datasets are exclusively intended for use in training and are deliberately noisy.
The package consists of a set of datasets for the scientific literature (lit) and a set of datasets for patents (pat) that were downloaded from the open access Lens Scholarly and Patent Databases using the Lens Scholarly and Patent APIs.
The datasets are available in public collections on the Lens where you can interact with the data. We encourage you to use the public collections to explore the data.
- The patent collection https://www.lens.org/lens/search/patent/list?collectionId=199031
- The literature collection https://www.lens.org/lens/search/scholar/list?collectionId=199039
To understand how the dronesr
datasets were created please read this article.
We recommend that you sign up for a free account with the Lens.
If using data in this packages please ensure that you provide a link to the Lens and credit the Lens for the data. Please also preserve the lens_id
field if creating any secondary datasets.
Installation
The dronesr
data package is not on CRAN. You can install dronesr
from GitHub with:
# install.packages("devtools")
devtools::install_github("poldham/dronesr")
Downloading Datasets
All datasets are included in the package in a compressed R format and can be access by simply loading library(dronesr)
. However, if you want to use the data outside R in Python, Tableau or other tools you are likely to want to use csv format versions of the data.
csv copies of the datasets are accessible in csv format through an Open Science Framework project folder at https://osf.io/jr87e/.
For convenience the csv files can also be downloaded directly in the package using the drones_download()
function as described below. However,
Using dronesr
The package consists two core datasets:
-
pat
is a patent dataset with 49417 documents that mention drones. This is the core dataset with all other patent datasets linking to it. -
lit
is a dataset of scientific literature mentioning drones with 45266 records.
The other datasets in the package have been split from the main sets. In the case of the patent (pat) data the main split tables are:
- applicants_cleaned
- inventors_cleaned
- ipc (international patent classification codes or symbols for technology areas)
- patent abstract, patent title and pat title nouns for text analysis
For the literature (lit) set the main split tables at present focus on text analysis:
- lit title
- lit abstract
- lit title nouns
As with the scientific literature patent data operates a citation system. Citations take three forms and are included with the package.
- pat_cited. Patents cited in the core patent set (pat). These are back citations to patent documents constituting prior art that shape what can be claimed in the core set.
- pat_citing. These are later filings that cite the core (pat) set. These forward citations are documents that are shaped by the prior art in the core set.
- npl. NPL stands for non-patent literature which means citations to literature, websites, manuals etc that are not patent documents. In
dronesr
the npl dataset consists of scientific literature that is cited in core patent set.
Other linked datasets will be added in due course.
Using drones_download()
The package includes one function that will download the literature and patent datasets from the Open Science Framework project repository where the data is stored. If you will not be using R then you can download the csv files as follows
- literature https://osf.io/96w8j/. Click on the csv icon and then
Download as zip
. - patents https://osf.io/e6h8x/. Click on the csv icon and then
Download as zip
.
In R the sets will download to a destination that you provide. By default the data will download into your working environment in a folder called dronesr_data
. The folder will be created if it does not already exist. Under the hood the function uses the osfr
package and drones_download()
defaults to overwriting any existing dronesr_data
. See the documentation for other options.
library(dronesr)
# download everything
drones_download(data = "all", dest = "dronesr_data")
# download literature only
drones_download(data = "lit", dest = "dronesr_data")
# download patent data only
drones_download(data = "pat", dest = "dronesr_data")
That’s all for now.