A table in human readable format PART II

A table in human readable format Part II

Last time i showed you how easily you can extract information from EPPO SQLite database and put it into easily understandable table. Now, I will show you how to expand this table with data obtained from EPPO REST API. But first lets recall one thing we did last time. We defined variable EPPOcodes, which we will use in this part.

#query sql for codeid and eppocode
sqlEppo %>% 
  dbGetQuery(paste0('SELECT codeid, eppocode FROM t_codes WHERE codeid IN (',
                    paste0(codeIdDB, collapse = ', '),')')) %>% 
  arrange(eppocode) -> EPPOcodes

We also need to set some constant variables to build up list of URL links for our REST queries.

#eppo Url's constants
apiEppoUrl <- 'https://data.eppo.int/api/rest/1.0/taxon/'
apiType <- c('/categorization', '/hosts', '/names', '/taxonomy')
#apiToken <- <your token from EPPO site>

We will merge EPPO pest codes with some words to build three vectors of URLs to query the REST API.

library(dplyr);library(tidyr); library(RCurl); library(jsonlite)

#define variables
hostsLists <- paste0(EPPOcodes[,2],'List')
catLists <- paste0(EPPOcodes[,2],'Cat')
taxoLists <- paste0(EPPOcodes[,2],'Taxo')
hostUrls <-paste0(apiEppoUrl, EPPOcodes[,2], apiType[2], apiToken)
catUrls <-paste0(apiEppoUrl, EPPOcodes[,2], apiType[1], apiToken)
taxoUrls <-paste0(apiEppoUrl, EPPOcodes[,2], apiType[4], apiToken)

Next we will use pass this links to for loop and assign() JSON to variables which names are stored in hostsList. This step will be later repeated in the beginning of creating categorization and taxonomy columns. Further, we need to create address table, containing lengths of JSON queries which will reflect how many subtables are in each of the queries.

# hosts download from EPPO api, assign to proper lists/df and create addres df
for (i in 1:length(hostsLists)) {
  assign(hostsLists[i], fromJSON(getURL(hostUrls[i])))
}

EPPOcodes %>% 
  mutate(listLength = sapply(hostsLists, function(x) length(get(x))),
         hostsLists = paste0(eppocode, 'List')) %>%
  filter(listLength != 0) -> addresTable

In following step we will assign result of each JSON query to data frame named as in address table. Than, a bit tricky part starts. We want to have all the subtables of our query in one data frame, so we use bind_rows() which automatically merges them row-wise. We also want to add eppocode variable from addresTable to properly address each of the subtables in new data frame. However, previous version of code worked just fine, now, for some unknown reasons, it started to treat new column with table name as a factor class. Thus we need to add stringAsFactors = F to cbind() function. Last part of this step nests the full_name and labelclass variables.

# collapse lists into df, than merge by row with new collumn = pest eppo code
for (i in 1:length(addresTable$hostsLists)) {
  assign(addresTable$hostsLists[i], 
         bind_rows(cbind(bind_rows(get(addresTable$hostsLists[i])),
                         pestEppocode = addresTable$eppocode[i], stringsAsFactors = F)))  
}

bind_rows(mget(addresTable$hostsLists)) %>% 
  select(labelclass, full_name, eppocode = pestEppocode) %>% 
  nest(labelclass, full_name) -> nestHosts

The following code groups data by labelclass. Than using mutate() we define new column with hosts names concatenated. For each label class in first column, cells in temp_names column are repeated, thus we use distinct() function to filter unique cells from third column. In next step we merge the labelclass with hosts names from temp_names column, than we transmute this column by concatenating all hosts with their label classes into one cell and assigning results to our hostIndex variable.

# collapse row per groups per eppocode
hostIndex <- seq(length(nestHosts$data))
for (i in 1:length(nestHosts$data)) {
  nestHosts$data[[i]] %>% 
    group_by(labelclass) %>% 
    mutate(temp_names = paste(full_name, collapse = ', ')) %>%
    distinct(temp_names) %>% 
    mutate(temp_names = paste(labelclass, temp_names, sep = ': ')) %>% 
    ungroup() %>%
    select(temp_names) %>%
    transmute(alt_names = paste(temp_names, collapse = '; ')) %>%
    distinct() %>% 
    unlist()-> hostIndex[i]
}

To make column with categorization, we first need to download stuff.

# download and assign pest categorization
for (i in 1:length(catLists)) {
  assign(catLists[i], fromJSON(getURL(catUrls[i])))
}

Next step is similar to what we did in hosts part. Executing following snippet will result in creating table with lengths of list of particular query. This time however we add two columns using mutate() function - EPPO and EU both filled with Absent.

EPPOcodes %>% 
  mutate(listLength = sapply(catLists, 
                             function(x) length(get(x)), simplify = 'vector'),
         catLists = paste0(eppocode, 'Cat')) %>%
  filter(listLength != 0) %>% 
  mutate(EPPO = 'Absent', EU = 'Absent') -> RPPOTable

This one might seem a bit complex. In fact, it’s a simple snippet using for loop and conditional if ... else statements to check which of the pest in RPPOTable is on EPPO list and extract proper values for EPPO - list type, years of addition, deletion and transfer. Extracted strings are pasted into EPPO column of our table. The same procedure is repeated for EU.

for (i in 1:length(RPPOTable$EPPO)) {
  if (sum(get(RPPOTable$catLists[[i]])$country == 'EPPO')) {
    
    RPPOTable[i, 5] <- paste(get(RPPOTable$catLists[[i]])[which(get(
                                 RPPOTable$catLists[[i]])$country == 'EPPO'),
                                 c(5:8)],
                             collapse = ' ')
  }
}

for (i in 1:length(RPPOTable$EU)) {
  if (sum(get(RPPOTable$catLists[[i]])$country == 'EU')) {
    
    RPPOTable[i, 6] <- paste(get(RPPOTable$catLists[[i]])[which(get(
      RPPOTable$catLists[[i]])$country == 'EU'),
      c(5:8)],
      collapse = ' ')
  }
}

In the end, we need to bind all the queries into one table, filter out rows containing EU or EPPO value (since they are in separate columns), choose only columns that are of our interest and nest everything by catLists. However, in contrast to what we did previously, we use .id option in bind_rows() function, which automatically adds column with name of data frame identifier. Finally we use sequence of grouping, mutating and transmuting functions to concatenate the strings of country, list type, and years into one cell per each pest.

bind_rows(mget(catLists), .id = 'catLists') %>%
  filter(country != 'EU') %>%
  filter(country != 'EPPO') %>%
  select(-isocode, -qlist, -yr_del, -yr_trans) %>% 
  nest(nomcontinent, country, qlistlabel, yr_add) -> nestCat

catIndex <- seq(length(nestCat$data))
for (i in 1:length(nestCat$data)) {
nestCat$data[[i]] %>% 
  group_by(nomcontinent) %>% 
  mutate(temp_names = paste(country, qlistlabel, yr_add, collapse = '; ')) %>%
  distinct(temp_names) %>% 
  mutate(temp_names = paste(nomcontinent, temp_names, sep = ': ')) %>% 
  ungroup() %>%
  select(temp_names) %>%
  transmute(alt_names = paste(temp_names, collapse = '/ ')) %>%
  distinct()  %>% 
  unlist() -> catIndex[i]
}

The last part of the code is fairly easy. The one thing you should know is that in taxonomy tables in 3rd position we can have Bacteria, Animalia, Viruses and viroids, Fungi. In case particular pest is an animal, we need to go further and extract sub-level - Arthropoda or Nematoda. Code below is very easy, so you will grasp the idea in no time. You can also play around and try to refactor the code and use ifelse statement instead of traditional if ... else.

# download and assign pest taxonom
for (i in 1:length(taxoLists)) {
  assign(taxoLists[i], fromJSON(getURL(taxoUrls[i])))
}

EPPOcodes %>% 
  mutate(listLength = sapply(taxoLists, 
                             function(x) length(get(x)), simplify = 'vector'),
         taxoLists,
         taxonomy = NA) %>%
  filter(listLength != 0) -> taxoTable

taxoMain <- as.data.frame(taxoTable)
for (i in 1:length(taxoTable$taxoLists)) {
  if (get(taxoTable$taxoLists[i])[1,3] != 'Animalia') {
  taxoMain[i,5] <- get(taxoTable$taxoLists[i])[1,3]
  } else {
   taxoMain[i,5] <- get(taxoTable$taxoLists[i])[2,3]
  }
}

This time I won’t show you how the human readable table should look at this point. I will do it in the final post. One good advice for you is to check the code step by step. If some transformations or functions seams to complex or complicated, you can divide it and check what each function does to data. Hope you enjoyed this part.

Cheers!

A table in human readable format PART II

A table in human readable format Part II

Michal J. Czyz

References in LaTeX table

Creating package from dirty code

Generic pest spread models

A table in human readable format III

A table in human readable format PART II

A table in human readable format PART I

Posts