Remove all line breaks from just one column of a csv file on Linux

Asked

Viewed 1,499 times

2

I have a file csv with more than 500k of lines and need to remove line breaks from just one column efficiently in addition to deleting all the links it contains, a snippet of the file:

"created_at","text","location","name","screen_name","followers_count","friends_count"
"2016-09-17T21:48:10.000Z","MAIS UM CLIENTE REVIVALL #Eleições #Vereador #BH

VÍDEO MARKETING - Solicite uma Proposta Revivall Marketing:... https://t.co/d7qDOguDoA","Belo Horizonte","Objetivo Comunicação","ObjeCom",15,0
"2016-09-17T23:05:01.000Z","Não ia votar no Kalil não. Mas diante de tanta sacanagem nas propagandas políticas meu voto agora é dele. Fazer raiva nas MARIAS.","Belo Horizonte","LEONARDO BARBOSA","LEO_BEAGALO",113,282
"2016-09-17T23:26:10.000Z","RT @brnou3: nossa candidato que legal várias pessoas na rua balançando uma bandeira com seu nome isso conquistou meu voto

disse nenhuma pe…","Belo Horizonte","Júlia","julinhasr",589,306

It has to stay that way:

"created_at","text","location","name","screen_name","followers_count","friends_count"
"2016-09-17T21:48:10.000Z","MAIS UM CLIENTE REVIVALL #Eleições #Vereador #BH VÍDEO MARKETING - Solicite uma Proposta Revivall Marketing:... ","Belo Horizonte","Objetivo Comunicação","ObjeCom",15,0
"2016-09-17T23:05:01.000Z","Não ia votar no Kalil não. Mas diante de tanta sacanagem nas propagandas políticas meu voto agora é dele. Fazer raiva nas MARIAS.","Belo Horizonte","LEONARDO BARBOSA","LEO_BEAGALO",113,282
"2016-09-17T23:26:10.000Z","RT @brnou3: nossa candidato que legal várias pessoas na rua balançando uma bandeira com seu nome isso conquistou meu voto disse nenhuma pe…","Belo Horizonte","Júlia","julinhasr",589,306
  • Any specific language? When you say Linux, do you think about using shell scripting or something like that?

  • Use Vim/Vi to resolve this. Search Google for the exact command to remove line breaks from Windows and enter new line breaks. It is a very common problem... you will find even scripts ready q do it...

  • I think of first removing all lines with blanks: grep -v ^$ arquivo_csv and then paste the lines that don’t start with the " character with the previous line. Surely you can do this with the awk but I haven’t been able to.

2 answers

7


I know you requested help for "awk" and "sed" (based on the tags used). But, if I may suggest, why don’t you do it in Python? It’s quite simple because you treat the file as CSV itself (which also gives you flexibility for any other future needs involving treating these fields).

Here’s an example of code:

import csv
import re

# Expressão regular ***bem simples*** para identificar URLs HTTP
url_re = 'http[s]:\/\/(www\.)?.*[^ ]'

# Abre o arquivo original para leitura
fin = open('teste.csv', 'r')
# Abre o arquivo de destino para gravação
fout = open('teste2.csv', 'w', newline='')

# Cria os objetos de leitura e gravação de/em CSV (no caso do arquivo de destino,
# força a configuração de todos os campos entre aspas - para manter igual ao seu
# exemplo)
reader = csv.DictReader(fin, delimiter=',')
writer = csv.DictWriter(fout, delimiter=',', quotechar='"', quoting=csv.QUOTE_ALL,
                              fieldnames=reader.fieldnames)

# Grava o cabeçalho no arquivo de destino
writer.writeheader()

# Lê o arquivo de entrada linha a linha, e grava no arquivo de saída;
# Aplica as conversões requeridas apenas na coluna 'text'
for row in reader:
    # Substitui as ocorrências de nova linha por um único espaço
    row['text'] = re.sub('\n+', ' ', row['text'])

    # Substitui a ocorrência de URL por um único espaço
    row['text'] = re.sub(url_re, ' ', row['text'])

    # Grava a linha toda na saída
    writer.writerow(row)

fin.close()
fout.close()

Anyway, if you prefer to do it in "sed", for example, just use the same idea (and probably the same regular expressions) to make the exchange.

  • 1

    Well played! + 1 ( CSV-Aware tool is clearly the way)

  • 1

    very well thought friend! Thank you very much!

3

With a file this size, it’s clearly recommended to use a parser/CSV modules.

However assuming that the "" are correct and that there is no \"Inside the quotes we can try to use black magic...

1 awk and sed:

(a) Since the first field ("created in") is complex, it is always enclosed in quotes. Thus, " at the beginning of the line is always a record separator.

sed 's/^"/@@"/' in.csv  | 
     awk '{gsub(/(\n|http[^ \n"]*)+/," "); print }' RS="@@"
  • sed 's/^"/@@"/' explicitly marks registry tabs with "@@"
  • awk '..awk-program..' RS="@@" processes records repaired by "@@"
  • ...{gsub(/(\n|http[^ \n"]*)+/," ");print} within each registration, I susbtitui \n and Urls by space

2 perl: substitute

According to the same premise (a)

perl -p0e 's/\n(?!")/ /g; s/https?:[^\s"]+/ /g' in.csv
  • s/\n(?!")/ /g; replaces \n by space as long as there are no quotation marks below
  • s/https?:[^\s"]+/ /g' removes the Urls

3 Perl: substitute with Eval

This next version is a bit more cryptic:

  • replace each string between quotation marks with the result of the substitution in this of the " n" and URLS for nothing:
perl -p0e 's{(".*?")}{ $1 =~ s!(\n|https?:[^\s"]+)+! !gr }sge' in.csv

Browser other questions tagged

You are not signed in. Login or sign up in order to post.