I have work in projects that involve the integration of various data sources, data transformation and reporting. Scripts are mostly in R, but sometimes I resort to other languages.
I have created the following directories: report/
, script/
, lib/
, data/
, raw-data/
and doc/
. The source code is script/
and lib/
, the data in data/
and raw-data/
and the reports in report/
. The general idea is to create small R scripts that transform the data successively until arriving at formatted data to be used in reports.
In raw-data/
save data that has been created or obtained manually, usually in files .csv
or similar. Scripts read data from raw-data
(or of data/
), possibly perform some transformations (filter, grouping etc.) and create files in data/
, usually using the function saveRDS
, so that they can be read quickly using the function readRDS
. Each script is small and usually only writes one file rds
containing a data frame. If there are functions used in more than one script, they are in files in the folder lib/
and are loaded using the function source
(with the option chdir=TRUE
). Scripts use the package extensively dplyr.
In the folder doc/
i try to keep two diagrams up to date: one with the data frames and their columns and the other describing the data transformation pipeline (a diagram with scripts and data files, indicating for each script, which files it reads and which it writes). The advantage of documenting the pipeline is that when a file changes (for example, because of the arrival of more current data), it is easy to determine which scripts need to be executed and in what order to update the data used in the final reports. Use the yEd to create the diagrams.
Some of the scripts in script/
generate reports. They are written to be compiled with the knitr:spin and create HTML files in report/
, often containing graphics generated with the rCharts.
The project is kept under version control using the Git. I avoid keeping the files in version control data/
, since they can be large and many of them can be generated from scripts and data in raw-data/
. The exception is for data files that are derived from external databases. In this case I put the file in version control to ensure that people without access to the database can run the project.
An example project that uses this workflow can be found in https://github.com/rodrigorgs/arch-violations-bugs
The advantage of using several specialized scripts is that if any data source is updated or if a new column needs to be calculated in a data frame, you only need to run the scripts that deal with that data frame again. This is especially important when the project involves slow data transformations or access to external data sources such as the web and database management systems.
The documentation of the pipeline and even the execution of scripts in the right order is a process that can be automated. I haven’t found any solution ready for this, but if anyone is interested in developing one, let’s talk.
– rodrigorgs
Very good your answer, much more interesting even than those of SOEN!
– Carlos Cinelli
Don’t you usually compile the functions in a package? This is something I’m increasingly finding worthwhile, even when there are few functions!
– Carlos Cinelli
I never stopped to study packaging in R, but one day I have to stop and look. You use https://github.com/hadley/devtools ?
– rodrigorgs
Use, but not for everything - use much the Roxygen2 too, is a hand on the wheel to document the functions!
– Carlos Cinelli