Read fasta files in python and skip the first line

Asked

Viewed 1,228 times

-1

I need to read a file boredom, but I don’t know how to eliminate the first line of the sequence, example:

>sequence A

ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc

doing some tests I realized that if letters are added in the first line >sequence aaaA is being included in the count.

How do I discard the first line of my letter count?

  • Can you read a whole line? If so, do the first reading and throw it in the trash

  • Can I explain it better? What is the content of the file exactly? Which line do you want to delete? You’ve already been able to read the file with Python?

  • ctagc reminds me of DNA reading. Electrophoresis?

  • 1

    @Jeffersonquesado edited the question. Maybe it gets easier :)

  • What you mean by 'first line'? in what you called example, has how many lines? You can show a part of a real file?

  • 1

    @Guilhermenascimento much better! And it really had to do with bioinformatics, think of a well-given kick

  • 1

    @Jeffersonquesado was really :D

Show 2 more comments

1 answer

4


Assuming the file has a format similar to this:

>SEQUENCE 1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE 2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

I assume that what you want to remove are the lines with this format >SEQUENCE xxxx (or similar), beforehand I already tell you that I do not understand anything of this format, except what I read in Wikipedia a little, but I think your goal is simple, if it is really just read line by line from FASTA file.

arquivo = 'foo.dat'; # Seu arquivo "fasta"

f = open(arquivo, 'r') # Abre para leitura
lines = f.readlines() # Lê as linhas e separa em um vetor

relist = [] # cria um novo array para pegar somente as linhas de interesse

for line in lines:
    if line.find('>') != 0: # ignora as linhas que começam com >
        relist.append(line) 

print(relist) # Mostra o array no output

Now if what you want is to actually remove the first line, whatever it is, just use .pop(0), thus:

arquivo = 'foo.dat';

f = open(arquivo, 'r')
lines = f.readlines() # Lê as linhas e separa em um vetor

firstLine = f.pop(0) #Remove a primeira linha

print(lines)

To make the array in string ("text") just use the str.join(array), should look like this for the first example:

''.join(relist)

And so for the second:

''.join(lines)
  • From what I read in the format, each sequence is identified starting with >. What if the content of the sequence is desired? , ie, between two lines starting with >?

  • 1

    @Jeffersonquesado ae would be a multi-dimensional vector, for each line with > found is generated a new sub-vector, the first item of the vector would be the header and the others would be the lines and/or data, I believe... It is not difficult to do and I think I would get mass :D, I’ll just wait for the AP answer if that’s what you want, because if I’m not going to get too many xD lines

  • 1

    @Jeffersonquesado Nesta reply I do a treatment similar to what you said, if I understand correctly.

  • @Andersoncarloswoss sensational his response

Browser other questions tagged

You are not signed in. Login or sign up in order to post.