python cleaning raw data manually

Question

python cleaning raw data manually

Asked 5 years, 4 months ago

Viewed 107 times

-1

import pandas as pd
data_r = open('rosalind_gc.txt', 'r')
data_r1 = data_r.readlines()
data_r2 = []
data_r3 = []
#tirar os \n do texto
for i in data_r1:
    data_r2.append(i.rstrip())
data_index = []
#filtrar Rosalind em index --done
for i in data_r2:
    if 'Rosalind' in i:
        data_index.append(data_r2.index(i))
#criar dicts com rosalind
for linha in data_r2:
    linha_index = data_r2.index(linha) 
    if linha_index in data_index: # só acontece nas linhas == Rosalind
        out_index = linha_index + 1
        data_r3.append({linha:''})
#sequenciar cadeia de dados

I am trying to sequence the data manually, to make a dictionary that returns the Rosalind species with their respective sequence, but always goes without value in the last key of the dictionary.

here is an example of abbreviated dataset: (as in txt file)

Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGAGCAGACACGC Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT

2 answers

1

Assuming your input file is something like:

Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG
Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC
Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT

Here is an example of commented code that can solve your problem:

import pandas as pd

registros = []

# Abre arquivo texto para leitura
with open('rosalind_gc.txt') as arquivo:

    # Para cada linha no arquivo...
    for linha in arquivo:

        # Remove o(s) caractere(s) de final de linha
        linha = linha.rstrip()

        # Quebra a linha em dois campos
        # usando espaco como separador
        registro = linha.split(' ', 1)

        # Verifica se a linha foi quebrada corretamente
        if len(registro) == 2:

            # Monta dicionario com os dois campos lidos
            dicionario = {
                'nome_especie': registro[0],
                'sequencia': registro[1]
            }

            # Adiciona dicionario na lista de registros
            registros.append(dicionario)

# Converte lista de dicionarios
# em um data frame
df = pd.DataFrame(registros)

# Exibe data frame
print(df)

Exit:

    nome_especie                                          sequencia
0  Rosalind_6404  CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGG...
1  Rosalind_5959  CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGC...
2  Rosalind_0808  CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTC...

See working on Repl.it

Browser other questions tagged python database pandas

You are not signed in. Login or sign up in order to post.

by Luciano B. M. • 26 points · Answer 1 · 2020-04-06T22:54:36+00:00

Hi, my friend.

When you use a language, you have to be used to its resources.

Try to understand how to use Python and specifically the Pandas library. Very powerful. Python was created, among other things, to be highly productive.

Come on. I don’t like to curl. Your solution is below:

dados_lista = []

dados = open('SEU_ARQUIVO_AQUI', 'r').readlines()

for um_dado in dados:
    nome, sequencia_dna = um_dado.split(' ', 1)

    dados_lista.append({'nome' : nome, 'sequencia_dna' : sequencia_dna})

df = pd.DataFrame(dados_lista)

df is already Dataframe that contains all your data, just the way you need it.

The result for me was as below, using the data you sent in the message.

I hope I’ve helped.

Big hug.