Build database

Asked

Viewed 129 times

2

I need to build a database with 200 rows and 4 columns (with the 4 variables I idealized). I devised a study to estimate the presence of snakes in a given city, I have no data. I thought that the presence of snakes is due to these 4 variables: Precipitation, temperature, Presence of the snake (1-yes; 0-no), Seasons of the year (1-spring, 2-summer, 3-autumn, 4-winter). Anyone can help?

  • Ta, but what do you need help with ??... Want to know the most viable database ? How to do it ? What will you need ? Your question is a little vague...

  • want to know how do script to build the data, when I have no data :|

4 answers

4

In Postgresql you can use the function random() to simulate such data without the need for an external language, only with SQL. I imagine it’s something easily transportable to other database systems.

First, I created a table with the specifications given in the question:

create table presenca_de_cobras (
  precipitacao numeric(4,1) not null,
  temperatura int not null,
  presenca boolean not null,
  estacao int not null,
  constraint estacoes check (estacao in (1,2,3,4))
);

Each field should have a different data amplitude, such as precipitation in hundreds of millimeters, I imagine, and the presence of snakes in Boolean "true" or "false".

The function random() returns a random value from 0 to 1, so multiply by the maximum limit of the value in question. In addition to random(), I used the function floor() to round down the random value of the season and the generate_series() to build a table of a hundred records, but I don’t need the data it returns. All this is passed to INSERT that will popularize the table:

-- insere os dados
insert into presenca_de_cobras 
select 
  (random()*200)::numeric(4,1) precipitacao,  -- precipitacao de 0-200 milimetros
  (random()*40)::int temperatura,             -- temperatura de 0-40 graus
  (random() > 0.5) presenca,                  -- presenca TRUE ou FALSE
  floor(random()*4+1)::int estacao            -- estação de 1-4
from generate_series(1,100) as s;

Finally, just consult the table itself:

select * from presenca_de_cobras limit 10;
| precipitacao | temperatura | presenca | estacao |
|--------------|-------------|----------|---------|
|        137.2 |          16 |    false |       4 |
|        164.5 |          38 |    false |       3 |
|          4.3 |          14 |     true |       3 |
|        134.9 |          38 |    false |       2 |
|          3.9 |          18 |    false |       3 |
|         67.3 |          37 |    false |       1 |
|        140.3 |          34 |     true |       3 |
|         34.2 |          35 |     true |       4 |
|         56.3 |          37 |     true |       1 |
|        171.5 |           5 |     true |       4 |

Follow SQL Fiddle with all steps: http://sqlfiddle.com/#! 17/693a3/31/0

  • 1

    Did not return any station 1, will the random is only picking from 2 to 4?

  • 1

    @Filipericardo thanks for pointing out the error! Really, I defined a "range" wrong when multiplying the random()

4

I would solve this problem as follows using the R. I’ll set the number of lines in my database:

n <- 200

I will create n random values for precipitation, temperature and presence of snakes with exponential, normal and binomial distributions, respectively:

set.seed(1234) # para que os resultados sejam reproduziveis
precipitacao <- round(rexp(n, rate=2), digits=1)
temperatura  <- round(rnorm(n, mean=25), digits=0)
presenca     <- rbinom(n, size=1, prob=0.2)

Then I’ll create a vector called estacao, with the seasons defined:

estacao <- rep(c(1, 2, 3, 4), each=n/4)

Finally, I put everything together in one object, called data:

dados <- data.frame(precipitacao, temperatura, presenca, estacao)
head(dados)
##    precipitacao temperatura presenca estacao
## 1           1.3          25        0       1
## 2           0.1          23        0       1
## 3           0.0          26        0       1
## 4           0.9          26        0       1
## 5           0.2          27        0       1
## 6           0.0          26        0       1
## 7           0.4          26        1       1
## 8           0.1          25        0       1
## 9           0.4          25        1       1
## 10          0.4          27        0       1

Note that I did not assume any kind of dependency between the variables. For example, the generation of numbers that indicate the presence or not of snakes will not, in this example, be related to precipitation or temperature recorded. Everything here was generated randomly and independently. Perhaps the data collected in the future does not show this behavior.

  • Marcus Nunes because you put set.Seed (1234) ? Why this number?

  • For results to be reproducible. set.seed determines which seed to use to generate the numbers. I chose 1234 because I like this value.

  • How do I add 4 lines to the database where, alternately, each of the 4 variables has an Missing (NA)?

2

I believe the bank can be created with the following commands:

CREATE DATABASE Estudo;
CREATE TABLE Persons (
id int,
precipitacao varchar(255),
temperatura varchar(255),
estacao varchar(255),
presenca Serpente boolean 
);

Then just popular the table as you like, I hope I’ve helped.

-2

can start with the function

CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ....
);

Browser other questions tagged

You are not signed in. Login or sign up in order to post.