Organize string data flow by default

Asked

Viewed 88 times

1

Friends, I am working on a scraping project. At some point, I capture a table on the screen in the shape of a giant string, more or less like this:

list = ('0004434-48.2010 n EU n (30 working days) 03/07/2017 n 13/07/2017 n 0008767-77.2013 n 2017 n (10 working days) 03/07/2017 n 13/07/2017).

I dealt with this by giving a command "split", having as parameter the " n", which made the list look like this:

list = ['0004434-48.2010', 'UNION', '(30 working days) 03/07/2017', '13/07/2017', '0008767-77.2013', '2017', '(10 working days) 03/07/2017', '13/07/2017']

Now my difficulty is: the first item on the list is the table row reference number. It identifies a particular contract, which goes to the item containing the second date. Next comes ANOTHER line (other contract) and subsequent items will belong to this second contract.

Doubt: how can I separate it? Because I will still treat the date, contracts will only be "clicked" within certain conditions. I tried to put together a noose like this:

for x in range(len(lista)):
    if len(lista[x]) == 15: #identificar o processo
        organizaProcessos.append(lista[x])

But so I create a list of processes, without their corresponding items, and when I try to nest another list within the variable "organizaProcesses", it doesn’t work...

2 answers

2


(TL;DR) If I understand what you want to do:

lista = ['0004434-48.2010',
 'UNIÃO',
 '(30 dias úteis) 03/07/2017',
 '13/07/2017',
 '0008767-77.2013',
 '2017',
 '(10 dias úteis) 03/07/2017',
 '13/07/2017']

def chunks(_list, parts):
     for i in range(0, len(_list), parts):
         yield _list[i:i+parts]

for i, chunk in enumerate(chunks(lista, 4)):
    locals()["part{0}".format(i)] = chunk

print ('Primeira parte: ',part0)
print ('\nSegunda parte: ',part1)

Output:

Primeira parte:  ['0004434-48.2010', 'UNIÃO', '(30 dias úteis) 03/07/2017', '13/07/2017']

Segunda parte:  ['0008767-77.2013', '2017', '(10 dias úteis) 03/07/2017', '13/07/2017']

I mean, you’ll have n (Depending on how many contracts you have on the line) lists of 4 elements, ca list representing a contract, the first element being the identfication of the contract.

See working on repl.it.

  • Buddy, I owe you some money now! kkkkkkkkkkkkk Thanks again!

  • Ahahah! Good to know I’m helping. Consider giving the acceptance and upvote. :-)

1

Use parse() of dateutil.parser, that tests whether a string is a date or not.

#!/usr/bin/python
#-*- coding: utf-8
from dateutil.parser import parse

def chunks(string):
    try:
        int(string)
        return False
    except:
        try:
            parse(string)
            return True
        except:
            return False

def split(string,num):
    c = 0
    i = 0
    list = string.split(' ')
    for x in range(0,len(list)):
        c += chunks(list[i])
        i += 1
        if c == num: break
    return list[0:i],list[i+1::]

string = '0004434-48.2010 \n UNIÃO \n (30 dias úteis) 03/07/2017 \n 13/07/2017 \n 0008767-77.2013 \n 2017 \n (10 dias úteis) 03/07/2017 \n 13/07/2017'
a,b = split(string,2)
print(a)
print(b)

That will be the output.

['0004434-48.2010', '\n', 'UNIÃO', '\n', '(30', 'dias', 'úteis)', '03/07/2017', '\n', '13/07/2017']
['0008767-77.2013', '\n', '2017', '\n', '(10', 'dias', 'úteis)', '03/07/2017', '\n', '13/07/2017']

Note that I can even work with a variable number of dates per line. Suppose I want to separate the lines after the third date, rather than after the second.

Just exchange

a,b = split(string,2)

for

a,b = split(string,3)

and the result will be

['0004434-48.2010', '\n', 'UNIÃO', '\n', '(30', 'dias', 'úteis)', '03/07/2017', '\n', '13/07/2017', '\n', '0008767-77.2013', '\n', '2017', '\n', '(10', 'dias', 'úteis)', '03/07/2017']
['13/07/2017']

Browser other questions tagged

You are not signed in. Login or sign up in order to post.