How to count multiple characters from a string using Python

Asked

Viewed 190 times

1

I want to count how many basic cracks of the type ATC (for example) have in sequence seq.

What I’ve done so far is:

# CONTAGEM DE NUCLEOTIDEOS
# CONTAGEM DE TRINCAS/CÓDONS DE UMA SEQUENCIA seq

seq = 'ATC CAA GTC AGC TAG CGT ATC ATC GTC ATG CTC AAA CAC TAC GAT GCT AAT'.replace(" ", "")

conta = contt = contc = contg = contador = contATC = 0

for x in seq:
    if x == 'A':
      conta += 1
    if x == 'T':
      contt += 1
    if x == 'C':
      contc += 1
    if x == 'G':
      contg += 1
    if x == 'ATC':       # trinca ATC
      contATC += 1

    contador += 1  # total

print(f'''A quantidade de nucleotideos "A" é {conta},
      de "T" é {contt}, de "C" é {contc}, de "G" é {contg},"ATC" é {contATC}
      assim, o total foi {contador}.''')

You can criticize me constructively about other things too, I’m open to learning!

1 answer

3


If the cracks are separated by space, an alternative is to separate them using split and count how many are equal to "ATC":

seq = 'ATC CAA GTC AGC TAG CGT ATC ATC GTC ATG CTC AAA CAC TAC GAT GCT AAT'
atc = 0 # quantidade de ATC
for s in seq.split():
    if s == 'ATC':
        atc += 1

Already to count the letters individually, you can use a Counter:

from collections import Counter
c = Counter(seq)

And to get the totals just catch them from the Counter:

print(f'A quantidade de nucleotideos "A" é {c["A"]}, de "T" é {c["T"]}, de "C" é {c["C"]}, de "G" é {c["G"]}, de "ATC" é {atc}')

As for the overall total, it would be sufficient to add up the values of the Counter, ignoring the spaces:

total = sum(qtd for s, qtd in c.items() if s in ('A', 'T', 'C', 'G'))

Detail: when you iterate for a string (as in for x in seq), with each iteration the x will be one of the characters of the string, and therefore it will never be equal to "ATC". Therefore, your code will not work.


Now if you don’t have spaces, an alternative is to iterate through the string and pick up pieces 3 by 3:

seq = 'ATCGATCTA'
atc = 0
for i in range(0, len(seq), 3):
    if seq[i:i + 3] == 'ATC':
        atc += 1

I use the syntax of slicing to pick up a particular string snippet: seq[i:i + 3] take 3 characters from position i, and the i goes from zero to string size, jumping from 3 to 3. That is, in the example above I get first "ATC", then "GAT" and then "CTA", and therefore the amount of "ATC" will be 1.

In the comments you said you used count, but beware that can make a difference. For the above example, seq.count('ATC') results in 2, because "ATC" occurs 2 times in the string:

ATCGATCTA
^^^ ^^^
 |   |
 |   \_ segunda ocorrência
 \_____ primeira ocorrência

Only that the second occurrence I understand is wrong, because the "AT" is part of the three "GAT" and the "C" is part of the "CTA", so in fact it is not an occurrence of "ATC".

  • Thank you very much!! What if the sequence did not have the spaces? How to do to identify a three-pronged 'ATC''? seq = 'ATCCAAGTCAGCTAGCGTATCATCGTCATGCTCAAACACTACGATGCTAAT'

  • 1

    Suppose in your original sequence you had '... AAT CTC ...' when removing the spaces you would consider this ATC as valid? (...AATCTC...). Explain the requirements better.

  • Ahhh, I get it, I got it here... I used seq.count('ATC') and that’s enough... Thank you all very much!

  • 1

    don’t forget the "Count" method - available for both lists and strings. To count the occurrences of a single letter or sequence, it works fine - but it will go through the entire main sequence for each search, so to count the occurrence of multiple sequences at the same time, it may be another way. If the nucleotides do not start in multiples of 3, regular-Expressions with ". finditer" - but still only for sequences without overlapping. If nucleotide overlaps are possible, the thing may be more complicated.

  • @Welingtonsilvadev If I understand correctly, count not always going to work - I updated the answer explaining this detail better

  • @jsbueno, that’s exactly what... Since I’m still using the basics, I suppose it’s a sequence of multiple lengh of 3... I still can’t say about the nucleotide overlap, but in case it falls on the problem and does not solve, I’ll consult here again. Very obgd!

  • @hkotsubo became excellent, very clear, thank you very much!!!!

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.