R workflow: strategies to organize a data analysis project

Question

R workflow: strategies to organize a data analysis project

Asked 11 years, 9 months ago

Viewed 829 times

6

Based on this question from the SOEN, ask:

What strategies do you recommend to organize a data analysis project on R? The project usually involves the following steps (not necessarily in that order):

Load and "clean" a "raw database";
Manipulate the database to leave it in the required formats for visualization and analysis;
Perform the analyses, construct the charts and tables;
Produce the final reports;
Also, in general, you will need to create your own functions to perform the previous steps.

Further specifying some points that can be addressed (no need to address all points, are suggestions to guide responses):

What is the structure of folders and files .R utilise?

It is recommended to put all functions in the same file or in separate files?

It would be worth creating a package for these functions, instead of giving source?

What functions, packages etc are recommended to manage this process? Version control is recommended in a project of this type?

With regard to the parsing execution scripts, it would be better to put all parsing in a single script or separate a script by activity (create database, clean database etc)?

1 answer

Browser other questions tagged r workflow data-analysis versioning

You are not signed in. Login or sign up in order to post.

by rodrigorgs • **6,635** points · Answer 1 · 2014-03-08T19:35:25+00:00

I have work in projects that involve the integration of various data sources, data transformation and reporting. Scripts are mostly in R, but sometimes I resort to other languages.

I have created the following directories: report/, script/, lib/, data/, raw-data/ and doc/. The source code is script/ and lib/, the data in data/ and raw-data/ and the reports in report/. The general idea is to create small R scripts that transform the data successively until arriving at formatted data to be used in reports.

In raw-data/ save data that has been created or obtained manually, usually in files .csv or similar. Scripts read data from raw-data (or of data/), possibly perform some transformations (filter, grouping etc.) and create files in data/, usually using the function saveRDS, so that they can be read quickly using the function readRDS. Each script is small and usually only writes one file rds containing a data frame. If there are functions used in more than one script, they are in files in the folder lib/ and are loaded using the function source (with the option chdir=TRUE). Scripts use the package extensively dplyr.

In the folder doc/ i try to keep two diagrams up to date: one with the data frames and their columns and the other describing the data transformation pipeline (a diagram with scripts and data files, indicating for each script, which files it reads and which it writes). The advantage of documenting the pipeline is that when a file changes (for example, because of the arrival of more current data), it is easy to determine which scripts need to be executed and in what order to update the data used in the final reports. Use the yEd to create the diagrams.

Documentação do pipeline

Some of the scripts in script/ generate reports. They are written to be compiled with the knitr:spin and create HTML files in report/, often containing graphics generated with the rCharts.

The project is kept under version control using the Git. I avoid keeping the files in version control data/, since they can be large and many of them can be generated from scripts and data in raw-data/. The exception is for data files that are derived from external databases. In this case I put the file in version control to ensure that people without access to the database can run the project.

An example project that uses this workflow can be found in https://github.com/rodrigorgs/arch-violations-bugs

The advantage of using several specialized scripts is that if any data source is updated or if a new column needs to be calculated in a data frame, you only need to run the scripts that deal with that data frame again. This is especially important when the project involves slow data transformations or access to external data sources such as the web and database management systems.