Handle values of a data.frame

Asked

Viewed 172 times

5

On the link https://1drv.ms/u/s! Alyjgc4yfdharhtsiuubju12gzho it is possible to find a dataset in csv on solar irradiation in Brazil. The column the_geom_4326 represents the longitudes and latitudes of a polygon, which represents a solar irradiation grid. That is, within this grid, the solar irradiation values in the other columns occur. I would like to extract these values and put in separate columns, forming the total of 8 new columns, with the names

  • Grid1_long
  • Grid1_lat
  • Grid2_long
  • Grid2_lat
  • Grid3_long
  • Grid3_lat
  • Grid4_long
  • Grid4_lat

Any tips on how to handle these values? The problem is that these values do not always have a character number pattern.

  • I can’t access this link because my network is blocked. Can paste some lines from the database?

2 answers

1

Follow a solution in R only.

It’s not as short as the one that uses the terminal tools, but it works. I made the separate steps also to make it clearer.

# Lendo os dados
dat <- read.csv("swera_ghi_inpe_hi_res.csv", stringsAsFactors=FALSE)

# Removendo da coluna tudo que não é informação útil: texto e parêntesis
dat$the_geom_4326 <- gsub("[A-Z]|(\\s\\(\\()|(\\)\\))", "", dat$the_geom_4326)

#Separando os valores pela ocorrência de vírgula e espaço
poly <- strsplit(x = dat$the_geom_4326, ",?\\s")

#Transformando texto em números
poly <- lapply(poly, as.numeric)

#Transformando a lista em matriz depois em data.frame
mat <- as.data.frame(do.call(rbind, poly))

#Colocando os nomes das colunas
colnames(mat) <- paste0(paste0("Grid", rep(1:5, each = 2)), c("_Long", "_Lat"))

#Juntando com os dados iniciais.
dat2 <- cbind(dat, mat)

Just out of curiosity, a form in pipe to do the same thing would be the following:

library(magrittr)
dat <- read.csv("swera_ghi_inpe_hi_res.csv", stringsAsFactors=FALSE) 

dat2 <- dat %>%
  .$the_geom_4326 %>%
  gsub("[A-Z]|(\\s\\(\\()|(\\)\\))", "", .) %>%
  strsplit(",?\\s") %>% 
  lapply(as.numeric) %>%
  do.call(rbind, .) %>% 
  as.data.frame %>%
  set_colnames(paste0(paste0("Grid", rep(1:5, each = 2)), c("_Long", "_Lat"))) %>%
  cbind(dat, .)

1

I couldn’t solve this problem in R. However, it can be solved easily on the Linux or Mac terminal, using tools that are already installed by default:

cat swera_ghi_inpe_hi_res.csv | awk 'BEGIN {FS=","}; {print $3" "$4" "$5" "$6" "$7}' | 
sed s/"\"POLYGON (("//g | sed s/"))\""//g | sed s/"  "/" "/g | tail -n +2 > 
LongLat.dat

The commands I used were:

  • cat to read the contents of the csv file on the computer screen

  • awk to print only the contents of the polygon columns

  • several sed to remove unwanted content, such as strings "POLYGON ((, ))\" and two spaces in sequence

  • tail not to consider the original file header

  • > to save the result in the Longlat.dat file

Now just read the Longlat.dat file inside the R and match the previous data:

dados <- read.csv(file="swera_ghi_inpe_hi_res.csv")

LongLat <- read.table(file="LongLat.dat", sep=" ", header=FALSE)
names(LongLat) <- c("Grid1_Long", "Grid1_Lat", "Grid2_Long", "Grid2_Lat", 
"Grid3_Long", "Grid3_Lat", "Grid4_Long", "Grid4_Lat", "Grid5_Long", "Grid5_Lat")

dados <- cbind(dados, LongLat)

Notice that I ended up finding polygons of 5 sides, and not of 4 as in the original question. You have to see why there.

  • Unfortunately I don’t own Linux or Mac. The last information of the polygon is equal to the first, because polygons in GIS are defined as the last point equal to the first to "close" the polygon.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.