Extract data from all rows of a file and create a dataframe

Question

Extract data from all rows of a file and create a dataframe

Asked 3 years, 9 months ago

Viewed 322 times

0

I have a file . txt with 2000 lines (Whatsapp chat) from where I need to extract to a pandas dataframe the date, time and sender of the message. I can do this with the function below:

def parse(file):
  
    data = re.search(r'\d{2}/\d{2}/\d{4}',file )
    hora = re.search(r'\d{2}:\d{2}', file)
    pessoa = re.search(r'(?<=\-)(.*?)(?=\:)',file)
    return data.group(0), hora.group(0), pessoa.group(0)

which works perfectly for a line of the type:

    file = ('20/05/2020 20:35 - Rodrigo Toledo:')
    parse(file)

But I want a way to apply the parse function to all lines of the file . txt, and then turn it into a dataframe.

Could you give an example of an error? Another type of data that your code should work with

– Evilmaax

2020/09/22 at 22:12
The code should always work with a txt file whose lines follow the pattern ('20/05/2020 20:35 - Rodrigo Toledo:'). txt has 2000 lines, so the parse function will need to traverse these 2000 lines, savanda each line executed in another file that will serve as a basis for creating a dataframe.

– StatsPy

2020/09/22 at 22:27

1 answer

Browser other questions tagged python python-3.x regex pandas

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-09-23T01:20:24+00:00

If the format is always this, you can use a regex to extract all the data at once, go saving the results in a list and at the end create the dataframe:

import pandas as pd
import re

r = re.compile(r'(\d{2}/\d{2}/\d{4}) (\d{2}:\d{2}) - ([^:]+)')
itens = []
with open('dados.txt') as arq:
    for linha in arq: # para cada linha do arquivo
        m = r.match(linha)
        if m: # se a regex encontrou um match, adiciona na lista
            itens.append(m.groups())

# cria o dataframe
df = pd.DataFrame(itens, columns=['data', 'hora', 'nome'])

In the regex I put the passages corresponding to the date, time and name. For date and time, I used the same one you were already using: the number of numbers and the separators.

For the name, I used [^:]+, which is "one or more characters that nay are :". So I take everything after the hyphen :.

And each of these passages is in parentheses to form a catch group, so I can take it all at once with the method groups, that returns a tuple with all groups.

At the end of loop, the list itens will have several tuples, each containing the date, time and name.

Then just create the dataframe and choose the column names (as an example, I used the creative names "date", "time" and "name").