Creating package from dirty code

Previously I wrote some post on creating nice, human readable tables from EPPO database. However the code was full of fragile parts, formatting was awful and it needed some refactoring. As I wanted to push myself, I decided it is good time to try to work on making first package. And this means a lot of work. Don’t get me wrong, making package from what I wrote previously would probably take less than several hours of work. What I meant is that if you want to make well documented, fully tested, robust R package it takes time. But it is worth it.

Writing about whole process and documenting every step is bit to much in my opinion, thus I decided to describe here only the most important things I’ve learned (or not learned) during this process. Especially, it is my first package and some things (like TDD) are quite new, thus I think that others might struggle with similar problems.

If you look on my github page, you will see that package is building correctly on TravisCI and all main functions are working. Nonetheless, there is still much work to do. So, as always, please feel free to participate.

TDD in R

First of all, the famous Test Driven Development. Writing tests is really nice in R. It forces you to think about all the stuff that can break. You need to anticipate the outcome before even you start declaring function. But sometimes it is hard to figure out what the outcome can be. Especially, I had this problem with functions which connects to API. The outcome of those can (and probably will) change during time. So hardcoding those values to test against outcome of function is pointless. On the other hand, comparing outcome of function with some workflow might end as a self-fulfilling prophecy.

Yet another thing to consider, is that even if we want to only check the structure of outcome, we find a problem, that we need to somehow encrypt the token we are using. It’s rather reckless to leave your token unencrypted in public repo… And this is on my TODO list…

Code refactoring

This part is pure joy. When you take a code that looks like garbage, full of stupid things like assign or get(someTable[[i]])[which(get(someTable == something))] and turn it into something readable and robust, with nice piping, nice format and stuff you feel great. Of course, since this package is still in the development, there is still some work to do. Nonetheless, it starts looking like written by sane person ;)

Documentation

In the case of R and packages on CRAN the documentation is usually nice. If you are working solely with R you do not have opportunity to compare it to packages/function documentation of other languages. Unfortunately, for some time I was working with other languages or programs with horrible documentation. In my opinion, documentation is very crucial for writing good package, nonetheless is not an easy task. You need to be precise, yet write in easy words, so users with different backgrounds and different skill level in R understand why/what/how.

Package development vs. script writing

There are tons of differences between those two frameworks. Hadley wrote great book on package development which covers most important differences. There are some other non-obvious things, however. For instance, dplyr and tidyr, when piping, use something called non-standard evaluation. To avoid notes from devtools::check() you need to somehow declare your column. The solution is not complicated – just add rlang package to your NAMESPACE and add .data$ to column name, as in example below:
some_df %>% dplyr::mutate(new_column = .data$old_column)

To be continued…

That’s it for now. I hope that soon I will be back on development of this package, and learn new things, so I can share it on this blog.

Cheers!