In this chapter we will use RStudio to prepare patent data for visualisation in an infographic using online software tools.

Infographics are a popular way of presenting data in a way that is easy for a reader to understand without reading a long report. Infographics are well suited to presenting summaries of data with simple messages about key findings. A good infographic can encourage the audience to read a detailed report and is a tool for engagement with audiences during presentations of the findings of patent research.

Some patent offices have already been creating infographics as part of their reports to policy makers and other clients. The Instituto Nacional de Propiedade Industrial (INPI) in Brazil produces regular two page Technology Radar (Radar Tecnologico) consisting of charts and maps that briefly summarise more detailed research on subjects such as Nanotechnology in Waste Management. WIPO Patent Landscape Reports, which go into depth on patent activity for a particular area, are accompanied by one page infographics that have proved very popular such as the infographic accompanying a recent report on assistive devices.

A growing number of companies are offering online infographic software services such as infogr.am,easel.ly piktochart.com, canva.com or venngage.com to mention only a selection of the offerings out there. The Cool Infographics website provides a useful overview of available tools.

One feature of many of these services is that they are based on a freemium model. Creating graphics is free but the ability to export files and the available formats for export of your masterpiece (e.g. high resolution or .pdf) often depend on upgrading to a monthly account at varying prices. In this chapter we test drive infogr.am as a chart friendly service, albeit with export options that depend on a paid account.

This chapter is divided into two sections.

  1. In part 1 we focus on using RStudio to prepare patent data for visualisation in infographics software using the dplyr, tidyr and stringr packages. This involves dealing with common problems with patent data such as concatenated fields, white space and creating counts of data fields.
  2. In part 2 we produce an infographic from the data using infogr.am.

Much of this chapter focuses on preparing data in R. If you have limited time you may want to skip to the infogr.am section and use the ready made files on the Manual repository that can be downloaded in a zip file [here].

Getting Started

To start with we need to ensure that RStudio and R for your operating system are installed by following the instructions on the RStudio website here. Do not forget to follow the link to also install R for your operating system.

When working in RStudio it is good practice to work with projects. This will keep all of the files for a project in the same folder. To create a project go to File, New Project and create a project. Call the project something like infographic. Any file you create and save for the project will now be listed under the Files tab in RStudio.

R works using packages (libraries) and there are around 7,490 of them for a whole range of purposes. We will use just a few of them. To install a package we use the following. Copy and paste the code into the Console and press enter.

install.packages("readr") # read in .csv files `readxl` for excel files
install.packages("dplyr") # wrangle data
install.packages("tidyr") # tidy data
install.packages("stringr") # work with text strings
install.packages("ggplot2") # for graphing

Packages can also be installed by selecting the Packages tab and typing the name of the package.

To load the package (library) use the following or check the tick box in the Packages pane.

library(readr) 
library(dplyr) 
library(tidyr) 
library(stringr)
library(ggplot2)

We are now ready to go.

Load a .csv file using readr

We will work with the pizza_medium_clean dataset in the online Github Manual repository. If manually downloading a file remember to click on the file name and select Raw to download the actual file.

We can use the easy to use read_csv() function from the readr package to quickly read in our pizza data directly from the Github repository. Note the raw at the beginning of the filename.

pizza <- read_csv("https://raw.githubusercontent.com/poldham/opensource-patent-analytics/master/2_datasets/pizza_medium_clean/pizza.csv")

readr will display a warning for the file arising from its efforts to parse publication dates on import. We will ignore this as we will not be using this field.

As an alternative to importing directly from Github download the file and enter the path in quotes (you must use the full path, e.g. C: etc.). For additional arguments (controls) look up the help for the function using ?read_csv.

pizza_read <- read_csv("yourfilepath")

readr and readxl (for Excel files) are quite new. For more complex data see the Manual articles on reading csv files in R and read excel files in R` packages for importing Excel.

Viewing Data

We can view data in a variety of ways.

  1. In the console:
pizza
## Source: local data frame [9,996 x 31]
## 
##                                                             applicants_cleaned
##                                                                          (chr)
## 1                                                                           NA
## 2  Ventimeglia Jamie Joseph; Ventimeglia Joel Michael; Ventimeglia Thomas Jose
## 3                                             Cordova Robert; Martinez Eduardo
## 4                                                      Lazarillo De Tormes S L
## 5                                                                           NA
## 6                                                           Depoortere, Thomas
## 7                                                             Frisco Findus Ag
## 8                                                   Bicycle Tools Incorporated
## 9                                                           Castiglioni, Carlo
## 10                                                                          NA
## ..                                                                         ...
## Variables not shown: applicants_cleaned_type (chr),
##   applicants_organisations (chr), applicants_original (chr),
##   inventors_cleaned (chr), inventors_original (chr), ipc_class (chr),
##   ipc_codes (chr), ipc_names (chr), ipc_original (chr), ipc_subclass_codes
##   (chr), ipc_subclass_detail (chr), ipc_subclass_names (chr),
##   priority_country_code (chr), priority_country_code_names (chr),
##   priority_data_original (chr), priority_date (chr),
##   publication_country_code (chr), publication_country_name (chr),
##   publication_date (date), publication_date_original (chr),
##   publication_day (int), publication_month (int), publication_number
##   (chr), publication_number_espacenet_links (chr), publication_year (int),
##   title_cleaned (chr), title_nlp_cleaned (chr),
##   title_nlp_multiword_phrases (chr), title_nlp_raw (chr), title_original
##   (chr)
  1. In Environment click on the blue arrow to see in the environment. Keep clicking to open a new window with the data.

  2. Use the View() command (for data.frames and tables)

View(pizza)

If possible use the View() command or environment. The difficulty with the console is that large amounts of data will simply stream past.

Identifying Types of Object

We often want to know what type of object we are working with and more details about the object so we know what to do later. Here are some of the most common commands for obtaining information about objects.

class(pizza) ## type of object
names(pizza) ## names of variables
str(pizza) ## structure of object
dim(pizza) ## dimensions of the object

The most useful command in this list is str() because this allows us to access the structure of the object and see its type.

str(pizza, max.level = 1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    9996 obs. of  31 variables:
##  $ applicants_cleaned                : chr  NA "Ventimeglia Jamie Joseph; Ventimeglia Joel Michael; Ventimeglia Thomas Joseph" "Cordova Robert; Martinez Eduardo" "Lazarillo De Tormes S L" ...
##  $ applicants_cleaned_type           : chr  "People" "People" "People" "Corporate" ...
##  $ applicants_organisations          : chr  NA NA NA "Lazarillo De Tormes S L" ...
##  $ applicants_original               : chr  NA "Ventimeglia Jamie Joseph;Ventimeglia Thomas Joseph;Ventimeglia Joel Michael" "Cordova Robert;Martinez Eduardo" "LAZARILLO DE TORMES S L" ...
##  $ inventors_cleaned                 : chr  "Sanchez Zarzoso, Maria Isabel" "Ventimeglia Jamie Joseph; Ventimeglia Joel Michael; Ventimeglia Thomas Joseph" "Cordova Robert; Martinez Eduardo" "Sanchez Zarzoso, Maria Isabel" ...
##  $ inventors_original                : chr  "Sanchez Zarzoso Maria Isabel" "Ventimeglia Jamie Joseph;Ventimeglia Thomas Joseph;Ventimeglia Joel Michael" "Cordova Robert;Martinez Eduardo" "Sanchez Zarzoso Maria Isabel" ...
##  $ ipc_class                         : chr  "A21: Baking; A23: Foods Or Foodstuffs" "A21: Baking" "A21: Baking" "A21: Baking; A23: Foods Or Foodstuffs" ...
##  $ ipc_codes                         : chr  "A21D 13/00; A23L 1/16" "A21B 3/13" "A21C 15/04" "A21D 13/00; A23L 1/16" ...
##  $ ipc_names                         : chr  "A21D 13/00: Finished or partly finished bakery products; A23L 1/16: Foods or foodstuffs; Their preparation or treatment -> cont"| __truncated__ "A21B 3/13: Parts or accessories of ovens -> Baking-tins; Baking forms" "A21C 15/04: Apparatus for handling baked articles -> Cutting or slicing machines or devices specially adapted for baked article"| __truncated__ "A21D 13/00: Finished or partly finished bakery products; A23L 1/16: Foods or foodstuffs; Their preparation or treatment -> cont"| __truncated__ ...
##  $ ipc_original                      : chr  "A21D 13/00;A21D 13/00;A23L 1/16;A23L 1/16" "A21B 3/13" "A21C 15/04" "A21D 13/00;A23L 1/16" ...
##  $ ipc_subclass_codes                : chr  "A21D; A23L" "A21B" "A21C" "A21D; A23L" ...
##  $ ipc_subclass_detail               : chr  "A21D: Treatment, E.G. Preservation, Of Flour Or Dough For Baking, E.G. By Addition Of Materials; A23L: Foods, Foodstuffs, Or No"| __truncated__ "A21B: Bakers' Ovens" "A21C: Machines Or Equipment For Making Or Processing Doughs" "A21D: Treatment, E.G. Preservation, Of Flour Or Dough For Baking, E.G. By Addition Of Materials; A23L: Foods, Foodstuffs, Or No"| __truncated__ ...
##  $ ipc_subclass_names                : chr  "A21D: Baking; Equipment For Making Or Processing Doughs; Doughs For Baking -> Treatment, E.G. Preservation, Of Flour Or Dough F"| __truncated__ "A21B: Baking; Equipment For Making Or Processing Doughs; Doughs For Baking -> Bakers' Ovens; Machines Or Equipment For Baking" "A21C: Baking; Equipment For Making Or Processing Doughs; Doughs For Baking -> Machines Or Equipment For Making Or Processing Do"| __truncated__ "A21D: Baking; Equipment For Making Or Processing Doughs; Doughs For Baking -> Treatment, E.G. Preservation, Of Flour Or Dough F"| __truncated__ ...
##  $ priority_country_code             : chr  "ES" NA NA "ES" ...
##  $ priority_country_code_names       : chr  "Spain" NA NA "Spain" ...
##  $ priority_data_original            : chr  "200402236U 2004-10-01T23:59:59.000Z ES" NA NA "200402236U 2004-10-01T23:59:59.000Z ES, 2005070132 2005-09-23T23:59:59.000Z ES" ...
##  $ priority_date                     : chr  "2004-10-01T23:59:59.000Z" NA NA "2004-10-01T23:59:59.000Z; 2005-09-23T23:59:59.000Z" ...
##  $ publication_country_code          : chr  "US" "US" "US" "EP" ...
##  $ publication_country_name          : chr  "United States of America" "United States of America" "United States of America" "European Patent Office" ...
##  $ publication_date                  : Date, format: "0021-08-09" "0024-01-14" ...
##  $ publication_date_original         : chr  "21.08.2009" "24.01.2014" "20.09.2013" "23.08.2007" ...
##  $ publication_day                   : int  21 24 20 23 7 22 8 5 16 8 ...
##  $ publication_month                 : int  8 1 9 8 2 2 2 7 5 1 ...
##  $ publication_number                : chr  "US20090208610" "US20140020570" "US20130239763" "EP1820402" ...
##  $ publication_number_espacenet_links: chr  "http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US2009208610" "http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US2014020570" "http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US2013239763" "http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=EP1820402" ...
##  $ publication_year                  : int  2009 2014 2013 2007 2003 2002 1992 1995 2008 2010 ...
##  $ title_cleaned                     : chr  "Pizza" "Pizza Pan" "Pizza Cutter" "Improved Pizza" ...
##  $ title_nlp_cleaned                 : chr  "pizza" "pizza Pan" "pizza Cutter" "improved Pizza" ...
##  $ title_nlp_multiword_phrases       : chr  NA "pizza Pan" "pizza Cutter" "improved Pizza" ...
##  $ title_nlp_raw                     : chr  "pizza" "pizza Pan" "pizza Cutter" "improved Pizza" ...
##  $ title_original                    : chr  "PIZZA" "Pizza Pan" "Pizza Cutter" "IMPROVED PIZZA" ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 2981 obs. of  4 variables:

str() is particularly useful because we can see the names of the fields (vectors) and their type. Most patent data is a character vector with dates forming integers.

Working with Data

We will often want to select aspects of our data to focus on a specific set of columns or to create a graph. We might also want to add information, notably numeric counts.

The dplyr package provides a set of very handy functions for selecting, adding and counting data. The tidyr and stringr packages are sister packages that contain a range of other useful functions for working with our data. We have covered some of these in other chapters on graphing using R but will go through them quickly and then pull them together into a function that we can use across our dataset.

Select

In this case we will start by using the select() function to limit the data to specific columns. We can do this using their names or their numeric position (best for large number of columns e.g. 1:31). In dplyr, unlike most R packages, existing character columns do not require "".

pizza_number <- select(pizza, publication_number, publication_year)

We now have a new data.frame that contains two columns. One with the year and one with the publication number. Note that we have created a new object called pizza_number using <- and that after select() we have named our original data and the columns we want. A fundamental feature of select is that it will drop columns that we do not name. So it is best to create a new object using <- if you want to keep your original data for later work.

Adding data with mutate()

mutate() is a dplyr function that allows us to add data based on existing data in our data frame, for example to perform a calculation. In the case of patent data we normally lack a numeric field to use for counts. We can however assign a value to our publication field by using sum() and the number 1 as follows.

pizza_number <- mutate(pizza_number, n = sum(publication_number = 1))

When we view pizza_number we now have a value of 1 in the column n for each publication number. Note that in patent data a priority, application, publication or family number may occur multiple times and we would want to reduce the dataset to distinct records. For that we would use n_distinct(pizza_number$publication_number) from dplyr or unique(pizza_number$publication_number) from base R. Because the publication numbers are unique we can proceed.

Counting data using count()

At the moment, we have multiple instances of the same year (where a patent publication occurs in that year). We now want to calculate how many of our documents were published in each year. To do that we will use the dplyr function count(). We will use the publication_year and add wt = (for weight) with n as the value to count.

pizza_total <- count(pizza_number, publication_year, wt = n) 
pizza_total
## Source: local data frame [58 x 2]
## 
##    publication_year     n
##               (int) (dbl)
## 1              1940     1
## 2              1954     1
## 3              1956     1
## 4              1957     1
## 5              1959     1
## 6              1962     1
## 7              1964     2
## 8              1966     1
## 9              1967     1
## 10             1968     8
## ..              ...   ...

When we now examine pizza_total, we will see the publication year and a summed value for the records in that year.

This raises the question of how we know that R has calculated the count correctly. We already know that there are 9996 records in the pizza dataset. To check our count is correct we can simply use sum and select the column we want to sum using $.

sum(pizza_total$n)
## [1] 9996

So, all is good and we can move on. The $ sign is one of the main ways of subsetting to tell R that we want to work with a specific column (the others are “[” and “[[”).

Renaming a field using rename()

Next we will use rename() from dplyr to rename the fields. Note that understanding which field require quote marks can take some effort. In this case renaming the character vector publication_year as “pubyear” requires quotes while renaming the numeric vector “n” does not.

pizza_total <- rename(pizza_total, "pubyear" = publication_year, publications = n)

Make a quickplot with qplot()

Using the qplot() function in ggplot2 we can now draw a quick line graph. Note that qplot() is unusual in R because the data (pizza_total) appears after the coordinates. We will specify that we want a line using geom = (if geom is left out it will be a scatter plot). This will give us an idea of what our plot might look like in our infographic and actions we might want to take on the data.

qplot(x = pubyear, y = publications, data = pizza_total, geom = "line")

The plot reveals a data cliff in recent years. This normally reflects a lack of data for the last 2-3 years as recent documents feed through the system en route to publication.

It is a good idea to remove the data cliff by cutting the data 2-3 years prior to the present. In some cases two years is sufficient, but 3 years is a good rule of thumb.

We also have long tail of data with limited data from 1940 until the late 1970s. Depending on our purposes with the analysis we might want to keep this data (for historical analysis) or to focus in on a more recent period.

We will limit our data to specific values using the dplyr function filter().

For more details on graphing in R see the qplot and gglot2 chapters of the manual.

Filter data using filter()

In contrast with select() which works with columns, filter() in dplyr works with rows. In this case we need to filter on the values in the pubyear column. To remove the data prior to 1990 we will use the greater than or equal to operator >= on the pubyear column and we will use the less than or equal to <= operator on the values after 2012.

One strength of filter() in dplyr is that it is easy to filter on multiple values in the same expression (unlike the very similar filter function in base R). The use of filter() will also remove the 30 records where the year is recorded as NA (Not Available). We will write this file to disk using the simple write_csv() from readr. To use write_csv() we first name our data (pizza_total) and then provide a file name with a .csv extension. In this case and other examples below we have used a descriptive file name bearing in mind that Windows systems have limitations on the length and type of characters that can be used in file names.

pizza_total <- filter(pizza_total, pubyear >= 1990, pubyear <= 2012)
write_csv(pizza_total, "pizza_total_1990_2012.csv")
pizza_total
## Source: local data frame [23 x 2]
## 
##    pubyear publications
##      (int)        (dbl)
## 1     1990          139
## 2     1991          154
## 3     1992          212
## 4     1993          201
## 5     1994          162
## 6     1995          173
## 7     1996          180
## 8     1997          186
## 9     1998          212
## 10    1999          290
## ..     ...          ...

When we print pizza_total to the console we will see that the data now covers the period 1990-2012. When using filter() on values in this way it is important to remember to apply this filter to any subsequent operations on the data (such as applicants) so that it matches the same data period.

To see our .csv file we can head over to the Files tab and, assuming that we have created a project, the file will now appear in the list of project files. Clicking on the file name will display the raw unformatted data in RStudio.

Simplify code using pipes %>%

So far we have handled the code one line at a time. But, one of the great strengths of using a programming language is that we can run multiple lines of code together. There are two basic ways that we can do this.

We will create a new temporary object df to demonstrate this.

  1. The standard way
df <- select(pizza, publication_number, publication_year)
df <- mutate(df, n = sum(publication_number = 1))
df <- count(df, publication_year, wt = n)
df <- rename(df, "pubyear" = publication_year, publications = n)
df <- filter(df, pubyear >= 1990, pubyear <= 2012) 
qplot(x = pubyear, y = publications, data = df, geom = "line")