A table in human readable format

OK, time to run new blog have come. So, for the first post, I decided to write about something more meaningful than just ‘Look on what have I found on the Net…’. And first thing that came into my mind was ‘The Table’. It is actually thing that I made recently, to solve some time consuming issue in my work. I needed to merge data from different data structures in R into one table in human readable format. In my opinion there are three reasons why you will enjoy this example:

  1. It is a Real Life Story how to use R to speed up work.
  2. All the Real Life Data you will need to reproduce this example is provided free by well known Regional Plant Protection Organization (European and Mediterranean Plant Protection Organization — EPPO).
  3. To make the table, you will need to deal with SQLite (or any other SQL format provided by EPPO), REST API and web scraping.

Sounds COOL? I know… But there is one more thing that you should know. This is the first working version of code. So, it is not the best code in the world (it’s just a tribute). On the other hand you have an opportunity to train refactoring R code or find better solutions to make table in human readable format.

Since the code is quite long (it took around 390 lines) and it deals with 3 different ways to obtain and manage data, I decided that the solution to the problem will be separated in 3 parts:

  • Background and SQL part
  • API part
  • web scraping and put all together

Background

From time to time, I need to make human readable table with some data on pests. If the report is short (2 or 3 species), there is no problem. It takes maybe 15 minutes to copy/paste data from web and arrange it in columns and rows. The real problem starts when I need to make table for larger number of pests (lats time it was around 60 species). It can take one or two days of horribly dull work for medium-trained chimpanzee. Here’s how table look like (it’s just a sample of all information available):

pest name synonym categorization hosts distribution taxonomy
Xylella fastidosa Grapevine Pierce’s disease agent A1 List Citrus sinensis North America Bacteria

The main obstacle in making this table is, that there is no single data base from which you can extract proper records. Pest names and their synonyms are extracted from SQL database; categorization, hosts and taxonomy is obtained from REST API; to obtain distribution data you need to make some simple web-scraping. In this part I will discus how to import data we need from SQL.

EPPO database

First of all you need to sign in to EPPO database (free of course). After short registration you will be able to download SQL database in format you prefer. Here I use SQLite. You will also obtain your auth code to use with REST API.

EPPO database

EPPO database

Meanwhile we start with code which will allow us to obtain preferred names of pests from SQLite database, you should install following libraries:

  • dplyr and tidyr for data manipulation
  • DBI and RSQLite to connect and extract data from SQL
  • RCurl which will allow us to connect and send queries through REST API
  • jsonlite to deal with json files downloaded from REST API.

In this part you will need to load first four libraries from list above. Than we should define a small sample of species for which we want to make table. For this example, I think that it would be better if we try with small batch of species, preferably one from each taxonomic groups: Xylella fastidiosa, Emerald ash borer, Meloidogyne fallax, Cercospora angolensis, cadang-cadang. Than we will find proper records in EPPO database.

library('dplyr'); library('tidyr'); library('DBI'); library('RSQLite');
library('magrittr')

# define pest list and connect to SQL database
pestTest <- c('Xylella fastidiosa',
             'Emerald ash borer',
             'Meloidogyne fallax',
             'Cercospora angolensis',
             'cadang-cadang')
sqlEppo <- dbConnect(RSQLite::SQLite(),
                     dbname = 'eppocodes.sqlite')

# match names from list with sql DB
sqlEppo %>% 
  dbGetQuery(paste0('SELECT codeid, fullname, codelang, preferred 
                    FROM t_names WHERE fullname LIKE ',
                    paste(paste0("'%", pestTest, "%'"),
                    collapse = " OR fullname LIKE "))) -> pestInDB

OK. So what happened here? The syntax of any kind of SQL is very different of R syntax. You cannot also use simple regex to extract what you want. Even worse, in this case using pure SQLite will result in horribly long query, with each species separated from next one with OR fullname LIKE. Thanks, o Mighty R Core Team for giving us R. With atomic paste0() function we can combine first standard part of query SELECT some_collumns FROM table_name LIKE with vector of pest names surrounded with% signs separated by OR fullname LIKE.

# first use unique codeid variable values, than query the sql DB
# make table of correct names, synonyms and common names in other languages
codeIdDB <- unique(pestInDB$codeid)
 
sqlEppo %>% 
  dbGetQuery(paste0('SELECT codeid, fullname, preferred, codelang, status 
                    FROM t_names WHERE codeid IN (',
                    paste(codeIdDB, collapse = ', '), ')')) %>%
  filter(status == 'A') -> pestCorNames

#query sql for codeid and eppocode
sqlEppo %>% 
  dbGetQuery(paste0('SELECT codeid, eppocode FROM t_codes WHERE codeid IN (',
                    paste0(codeIdDB, collapse = ', '),')')) %>% 
  arrange(eppocode) -> EPPOcodes

When we are exploring pest database we often use names which are not preferred but rather synonyms or common names. We also often use only genus name (for instance Liriomyza) instead of full species name (Liriomyza huidobrensis). In those cases result of our query will be a mess with many synonyms or without preferred species names. However all the synonym and common names have the same unique ID as the preferred species name. Thus we filter our result from first code chunk with unique() function. Than we make similar query as we did in first step, but this time instead of using LIKE and list of names, we use IN and unique IDs. After using filter on status we obtained all the preferred, synonym and common names of pest that we are interested in. In the next query we made a table of EPPO codes and their IDs which we well use in the next chapter.

#whole working table of names
pestCorNames %>% 
  filter(preferred == 1) %>% 
  left_join(pestCorNames[which(pestCorNames$preferred == 0),],
            by = 'codeid') %>%
  select('codeid'         = 1,
         'Preferred_name' = 2,
         'preferred.x'    = 3,
         -matches('codelang.x'),
         -matches('status.x'),
         'Other_names'    = 6,
         'preferred.y'    = 7,
         'language'       = 8,
         -matches('status.y')) %>%
  mutate(Name_type = ifelse(language == 'la', 'Synonym', 'Other languages')) %>%
  mutate(Other_names = replace(Other_names, is.na(Other_names), 'none'),
         language    = replace(language, is.na(language), 'none'),
         Name_type   = replace(Name_type, is.na(Name_type), 'Preferred')) %>%
# using which(is.na()) theoretically should cut time
  arrange(Preferred_name, desc(Name_type), Other_names) -> wholeTB
wholeTB %>% 
  select(codeid, Other_names, Name_type) %>% 
  filter(Name_type != 'Preferred') %>%
  nest(Name_type, Other_names) -> altNamesNest

In this step we made some operations on table structure. First we separated rows with preferred name of our pests. Than we used left_join to match them with their synonyms and common names in other languages - one per row. Next, we piped our result to select() function to reduce unused columns from table. Than we use two properties of mutate() function. First we added new column, in which (using ternary condition) we assigned Synonym value to all Latin names and Other languages to the rest. Second use of mutate() needs some explanation. Since we joined our table with other than preferred names to table with preferred names, we had some rows with values only in Preferred_name column. In other rows we had NAs. Thus we changed this NAs into none and Preferred. To give a little order to what we did so far, we piped everything to arrange() function. Eventually we nest our other names by codeid into tibble, which will allow us to collapse all those names into on continuous string in following step.

# collapse row groups into one string per group
tbIndex <- seq(length(altNamesNest$data))
for (i in 1:length(altNamesNest$data)) {
  altNamesNest$data[[i]] %>% 
    group_by(Name_type) %>% 
    mutate(temp_names = paste(Other_names, collapse = ', ')) %>%
    distinct(temp_names) %>% 
    mutate(temp_names = paste(Name_type, temp_names, sep = ': ')) %>% 
    ungroup() %>%
    select(temp_names) %>%
    transmute(alt_names = paste(temp_names[1], temp_names[2], sep = '; ')) %>%
    distinct() -> tbIndex[i]
}

tbIndex <- gsub('; NA', '', tbIndex)

So, how this simple for loop works? First it makes groups according to Name_type within each codeid nest. Than it creates new column temp_names, in which it pastes all the names from Other_names column, separated with comma. For species that have more than one synonym or common name, there will be repeated rows in temp_names column. We easily get rid of this repetitions using distinct() function. The side effect of using distinct function is that it drops all the columns which are not parsed as arguments, excluding grouping column (in our case Name_type). Next operations in this code chunk are similar to described above, so you will figure out what happens easily. Our work is done for now, however if you want to check how our table in human readable format looks like, you can use below code to obtain table which looks like the one in the bottom.

wholeTB %>%
  distinct(codeid, Preferred_name) %>%
  left_join(altNamesNest %>%
              mutate(data = tbIndex) %>%
              rename(Alternative_names = data), by = 'codeid')
codeid Preferred_name Alternative_names
66001 Agrilus planipennis Synonym: Agrilus feretrius, Agrilus marcopoli, Agrilus marcopoli ulmi; Other languages: agrile du frêne, emerald ash borer
64718 Coconut cadang-cadang viroid Synonym: CCCVd, Coconut cadang-cadang cocadviroid, Palm cadang-cadang viroid; Other languages: cadang cadang, yellow mottling of palms
63576 Meloidogyne fallax Other languages: bedrieglijk maiswortelknobbel nematode, false Columbia root-knot nematode
410 Pseudocercospora angolensis Synonym: Cercospora angolensis, Phaeoramularia angolensis, Pseudophaeoramularia angolensis; Other languages: Blattfleckenkrankheit: Zitrus, cercosporiose des agrumes, fruit spot of citrus, leaf spot of citrus
3243 Xylella fastidiosa Synonym: Grapevine Pierce’s disease agent, Xylella fastidiosa subsp. fastidiosa, Xylella fastidiosa subsp. piercei; Other languages: Anaheim disease, California vine disease, dwarf disease of alfalfa, dwarf disease of lucerne, enanismo de la alfalfa, enfermedad de Pierce, leaf scorch of almond, leaf scorch of maple, maladie de Pierce, Pierce’s disease of grapevine
92965 Xylella fastidiosa subsp. fastidiosa
70536 Xylella fastidiosa subsp. multiplex Other languages: échaudure des feuilles du prunier, leaf scald of plum, leaf scorch of elm, leaf scorch of mulberry, leaf scorch of oak, leaf scorch of plane, peach phony agent, peach virus 4, phony disease of peach, prunus virus 3
70537 Xylella fastidiosa subsp. pauca Other languages: chlorose variégée des agrumes, citrus variegated chlorosis agent, variegated chlorosis of citrus
70538 Xylella fastidiosa subsp. sandyi Other languages: leaf scald of oleander