What are the main functions for creating a minimum reproducible example in R?

Asked

Viewed 605 times

12

What are the main functions to create a reproducible minimum example in r?

More specifically, I would like the answers to address the following topics::

  • What are the functions to ensure that the sample database can be replicated?
  • What functions to ensure that simulation results are replicated?
  • What functions to obtain system information?
  • Which parts of the code should be placed to ensure reproducibility?
  • How to ensure that the provided example will be reproduced correctly on other machines?

Other important programming functions and practices that you may have forgotten to mention in the topics and help in creating a minimum reproducible example are also welcome.

  • 2

    I think that asking several questions in one only escapes a little of the scope. You asked 6 questions in one (counting the title).

  • 2

    @dvd This question has been discussed here

  • @Danielfalbel you do not want to put an answer explaining how to use the package reprex?

2 answers

6

The basis of a good reproducible question is that it should be possible for the your problem¹ appears as a problem for those who will try to understand it and solve it.

Broad lines

So that we can reproduce your problem the following step-by-step can be followed:

  1. Try playing your problem on your machine before sending it to Stackoverflow.
  2. Provide the code that you produced (and that you should reproduce on someone else’s computer) the behavior that prompts the question.
  3. Provide data capable of reproducing the problem.
  4. Provide the expected result for the code provided in 1.

1. How to reproduce my problem?

Open a new script and environment. If you are using Rstudio you can start a new section by clicking on Session in the upper bar and then New Session. If you are using the R (Rgui, R by command line, etc), just open the program one more time.

In this new environment copy the original script and go running line by line until you come across the problem again. This method allows isolating the problem in its fundamental determinants. If you are working on a 200-line script, but the error happens on line 53, there is no reason to share the 147 lines that follow the error and probably most of the first 53 lines can also be deleted from the code that will be shared.

Once you have identified the source of the problem, provide that line of code and only the other lines needed to reproduce the problem. Let’s say the error was found in:

sum(x)

In this case we also need to know what x, that is, provide the object(s) (s) x in the state in which they entered the function call (see item 3).

2. How to share my code?

The most appropriate way is to copy and paste the text of your code. It seems trivial, but this is not the only way to provide the code.

If you are encountering an error or warning, please provide the message.

sum(x)
Error in sum(x) : invalid 'type' (character) of argument

3. How to provide data?

As commented above, your data should be provided in the state they were in when the error occurred. For this when you encounter the error, use the function dput to provide its object as it lies.

dput(x)
c("1", "2", "3", "4", "5", "6")

The function dput allows its object to be recreated on another machine, even if it has been obtained from a database or file or otherwise. If your object is too large use dput(head(objeto, 30)) or some other way to limit the size of the object.

There are those who like to provide the lines of code that created the object. It happens that among beginners it is very common to change the object later and, therefore, the state of the object in the original line and in the line that generated the error can (I would say should) change. For this reason use dput ensures greater reproducibility of the code and should be preferred.

This is the case in the error example I’m using here:

x <- 1:5
x <- c(x, '6')
sum(x)
Error in sum(x) : invalid 'type' (character) of argument

If your code needs some simulation, use set.seed(1) (or any other number) to ensure that the results will be the same on your machine and on those who intend to assist you.

4. How to share the expected result?

This can be done in many ways. You can use a link or image that contains the expected result (in the case of a graph, like this example). It is also possible to describe in words what you expect, how in this case.

EDITED

To obtain system information as version of R, operating system, etc, just call the function sessionInfo() (no arguments anyway) and then paste the result into your question.

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   

1: Problem here need not be understood as an error, but simply as the motivation of the question

2: This reply was originally published for this question. On account of this debate, she’s being republished here.

2

As stated in the question link, a minimum reproducible example should have the following contents:

  1. A small data set;
  2. The smallest code possible that is executable and which reproduces the error in the mentioned small data set;
  3. Information on the version of R and the system on which the code is running, as well as the packages used;
  4. If using random data, ensure that the results are the same;

In this answer I will list some of the main functions in R to fulfil these tasks.

It is worth remembering that the examples the help pages of the functions of R can be of great value to have an idea of the structure of a minimum reproducible example. In general, the codes of the examples of the R meet those requirements.

Producing the data set

To use your own data set, the function dput(), along with head() can be quite useful. For example the code below provides the first 10 observations of the database iris already with the structure necessary to "reassemble" the database. So, for those who try to answer your question, just copy and paste the code into structure().

dput(head(iris, 10))
#> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 
#> 5, 4.4, 4.9), Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 
#> 3.4, 2.9, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 
#> 1.4, 1.5, 1.4, 1.5), Petal.Width = c(0.2, 0.2, 0.2, 0.2, 0.2, 
#> 0.4, 0.3, 0.2, 0.2, 0.1), Species = structure(c(1L, 1L, 1L, 1L, 
#> 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("setosa", "versicolor", "virginica"
#> ), class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", 
#> "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
#> 10L), class = "data.frame")

Reproducing the data:

dados <- structure(list(Sepal.Length = c(
  5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6,
  5, 4.4, 4.9
), Sepal.Width = c(
  3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4,
  3.4, 2.9, 3.1
), Petal.Length = c(
  1.4, 1.4, 1.3, 1.5, 1.4, 1.7,
  1.4, 1.5, 1.4, 1.5
), Petal.Width = c(
  0.2, 0.2, 0.2, 0.2, 0.2,
  0.4, 0.3, 0.2, 0.2, 0.1
), Species = structure(c(
  1L, 1L, 1L, 1L,
  1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c(
  "Sepal.Length", "Sepal.Width",
  "Petal.Length", "Petal.Width", "Species"
), row.names = c(
  NA,
  10L
), class = "data.frame")

A less ideal solution than this would be to provide the data in text format, for example in the case below:

texto <- "Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa"

In this case, the user who will answer your question can reassemble the database using the function read.table():

dados <- read.table(text=texto)

Another way to produce a data set is by generating random values, for example with the function rnorm() (you can also generate from other distributions without being normal, if relevant) or with the function sample() for a sample of values of some vector. A useful case may be the function letters(), to generate characters or factors. In this case, be sure to provide the seed for the example to be reproducible.

Example:

set.seed(1) # garantir reproducibilidade
dados <- data.frame(x = rnorm(10), y = sample(letters, 10))
dados
#>             x y
#> 1  -0.6264538 y
#> 2   0.1836433 f
#> 3  -0.8356286 p
#> 4   1.5952808 c
#> 5   0.3295078 z
#> 6  -0.8204684 i
#> 7   0.4874291 a
#> 8   0.7383247 h
#> 9   0.5757814 x
#> 10 -0.3053884 v

Other interesting functions in this case are the type functions as, as as.factor(), as.Date() etc, for you to convert the data to the required format.

Producing the minimum code

Try to identify the smallest necessary part of your code that generates the error or doubt you have. Before sending the code, make sure that you listed the necessary packages for it to be playable. For this, it is good to test your code after restarting the R, to make sure that everything necessary is there.

Example:

library(lattice) # a biblioteca utilizada
set.seed(1) # a seed
dados <- data.frame(x = as.character(rnorm(10)), y = sample(letters, 10)) # o conjunto de dados
densityplot(as.numeric(dados$x))

as.numeric(dados$x)
#>  [1]  2  5  4 10  6  3  7  9  8  1

This example would correspond to a question of the type: "I’m trying to make a density graph with the lattice as in the code above, because when I convert the data to Numeric they saw 2, 5, 4 ... and do not remain as the original data of the rnorm?"

System information

Finally, when necessary, you can provide your system information with sessionInfo(), which gives detailed information of your section. In my case, this information was:

R version 3.0.1 (2013-05-16)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lattice_0.20-15

loaded via a namespace (and not attached):
[1] grid_3.0.1  tools_3.0.1

Reprex package

To help create the playable example the reprex package can be quite useful, even previous examples have been generated in it. This is a package designed specifically to help create and run reproducible examples (the name reprex is short for Reproducible ExAmple), already with formatting for sites like Github and Stackoverflow.

A simple way to create a playable example with the package is to copy the code into R to your clipboard. Then just load the package with library(reprex) and rotate the command reprex(venue = "so") that the code with the commented results already formatted will be available to be pasted to the chosen Venue (in this example "so" is Venue stackoverflow). All generated images are placed on Imgur and the link is generated automatically for posting, just paste the result.

The package has other quite useful functions. For example, you can automatically include system information with the argument si = TRUE and also automatically format your code using the style suggested by Hadley with the argument style = TRUE. For more information see the package page.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.