How to detect Start Codon and Stop Codon in a nucleotide sequence using Python?

Asked

Viewed 127 times

3

I am trying to solve an exercise in which a function with a start and a stop, where the start should occur when you find "ATG" and the stop when you find the "TAA", "TAG", "TGA".

With the help of the comments, I was able to adjust the code a little, but the stop is limited to only what I put in the s.find(). I’d like help getting him to look for one of the three stop codons proposed in the exercise.

Follows the statement of the exercise that generated the question:

Do a function contacodon in Python receiving a sequence of letters representing nucleotides (e. g., A, C, G, T), check whether the sequence is valid (i.e. contains only As, Cs, Gs, and Ts) and return with the number of occurrences of each Odon counting from the start Codon "ATG" up to one of the stop codons, "TAA", "TAG", "TGA"

Input example:

contacodon("AGCGATCGAGATGAGCATCGCATCGCGGACTACCGCGCGCGCGCGCGGGAGATGAGCATCGACGACTCGACTAG")

Exit to the above entrance:

{
'ATG': 1,
'AGC': 1,
'ATC': 1,
'GCA': 1,
'TCG': 1,
'CGG': 1,
'ACT': 1,
'ACC': 1,
'GCG': 2,
'CGC': 2,
'GGG': 1,
'AGA': 1,
'TGA': 1
} 

My code:

def verificar (s):
    s = s.upper()
    for ent in s:
        if not ent in "ACGT":
                return False
    return True
   
while True:
    s = str(input("Entre com a seq: \n")).upper()
    if verificar(s):
        break
    print("Seq inválida")
  
count={}
for i in range(s.find('ATG'),s.rfind('TAA')+1,3):
    codon = s[i:i+3]
    if codon in count:
        count[codon] += 1
    else:
        count[codon] = 1
print('\n', count, '\n',)
  • It has the full statement of the exercise. Because the explanation is vague.

  • 2

    The sequence that the statement passed as an example does not have a multiple length of three. Is that right? If yes, the approach you took (from going through the string three by three) will not work.

  • It is a deliberate error, the codons that will be tested will be multiples of 3, this in the statement is an example of input and what has to come out from the start "ATG" and Stop in the "TAA" or in the "TAG" or in the "TGA".

  • I’d like to help but there’s a lot of errors in your code and I’m not gonna rewrite it from scratch and I’d have to rate them all

  • 1

    If it’s intentional, Marcelo, as I said, the approach to go through the string (going three by three range) is completely invalid. You’ll probably have to think of something else (and in that case, we won’t do it for you here, as that’s not the purpose of the site).

  • Ok! Still thank you for your attention, I will think of some other solution for the exercise.

  • 1

    Use str.find() to locate the Index of the first occurrence of ATG.

  • Thanks Augusto Vasques, I will read this document and redo the code.

  • 2

    Your logic is almost correct, only instead of yours range start at zero, you should start from the index of the first Codon ATG (which you can find using the method s.find() of its input string, as @Augustovasques commented). Something else, on range there is no need to calculate the exact end of the sequence with len(s)-len(s)%3, since you pick up 3 characters with Slices, and Slices never give IndexError.

  • 1

    @Jfaccioni he can take a slice s[s.find("ATG"):] and fragment into three-character portions with this function https://answall.com/a/496160/137387

  • With the tips given I was able to tweak the code a little using this solution: for i in range(s.find('ATG'),s.find('TAA')+1,3): ----- prints the desired range, but the stop can occur in three cases: 'TAA', 'TAG' and 'TGA'. I would like some guidance on how to do this. It can be here or by documentation indication.

  • 1

    Luiz Felipe, I reformulated the question.

Show 7 more comments

2 answers

3

With the help of the native module re it is possible to create a regular expression validating its nucleotide sequence and separating the portion of the sequence containing the codons. With an object of the class Collection. Counter count the nucleotides.

import re
from collections import Counter

pattern = r"^[ACGT]*(?<=ATG)(?P<codons>([ACGT]{3})+?)(?=TAA|TAG|TGA)[ACGT]*$"

#Define a função contacodon(entrada). O parâmetro regex é inicializado na primeira chamada da função e não deve ser utilizado.
def contacodon(entrada, regex= re.compile(pattern)):
  #Verifica se houver correspondência...
  if m:= re.match(regex, s):
      #se houver correspondência separa o grupo contendo os códons e retorna o dicionário contendo a contagem dos códons.
      cdns = m["codons"]                    
      return dict(Counter([cdns[i:i+3] for i in range(0, len(cdns), 3)]))
  else:
      #se não houver correspondência retorna um dicionário vazio.
      return dict()

#s = input("Entre com a seq: \n").upper()
s = "GCGATCGAGATGAGCATCGCATCGCGGACTACCGCGCGCGCGCGCGGGAGATGAGCATCGACGACTCGACTAG"

print(contacodon(s))

Test the code on Repl.it

The regular expression ^[ACGT]*(?<=ATG)(?P<codons>([ACGT]{3})+?)(?=TAA|TAG|TGA)[ACGT]*$ can be understood as:

  • ^[ACGT]* the string must start with zero or more characters between A,C,G or T.
  • (?<=ATG) the group(?P<codons>([ACGT]{3})+?) will only be captured if it is preceded by the start codon ATG.
  • (?P<codons>([ACGT]{3})+?) defines the catch group codons which consists of one or more groups of three characters between A,C,G or T.
  • (?=TAA|TAG|TGA) the group(?P<codons>([ACGT]{3})+?) will only be captured if succeeded by one of the stop codes TAA, TAG or TGA.
  • [ACGT]*$ the string must end with zero or more characters between A,C,G or T.

EDIT
As reported in the comments, by Luiz Felipe, the same behavior can also be obtained from the function contacodon() using the following regular expression pattern:

pattern = r"^[ACGT]*ATG(?P<codons>(?:[ACGT]{3})+)(TAA|TAG|TGA)[ACGT]*$"

where:

  • ^[ACGT]* the string must start with zero or more characters between A,C,G or T.
  • ATG corresponds to the start codon ATG.
  • (?P<codons>([ACGT]{3})+?) defines the catch group codons which consists of one or more groups of three characters between A,C,G or T.
  • (TAA|TAG|TGA) corresponds to stop codons TAA, TAG or TGA.
  • [ACGT]*$ the string must end with zero or more characters between A,C,G or T.

EDIT
Another relevant hint of pattern reported in the comments, by Hkotsubo:

pattern = "^[ACGT]*?ATG(?P<codons>([ACGT]{3})+?)T(?:A[AG]|GA)[ACGT]*$"
  • ^[ACGT]*? the string must start with zero or more characters between A,C,G or T.
  • ATG corresponds to the start codon ATG.
  • (?P<codons>([ACGT]{3})+?) defines the catch group codons which consists of one or more groups of three characters between A,C,G or T.
  • T(?:A[AG]|GA) corresponds to stop codons TAA, TAG or TGA.
  • [ACGT]*$ the string must end with zero or more characters between A,C,G or T.
  • 2

    Excellent answer! I just wondered if the lookarounds are really needed out there. You think an expression like ^[ACGT]*ATG(?P<codons>(?:[ACGT]{3})+)(TAA|TAG|TGA)[ACGT]*$ would be enough?

  • 2

    @Luizfelipe, interesting question. This regular expression pattern presents the same behavior as the above pattern, I will make it available in the answer.

  • 3

    I did some tests here and at least for the question string, without the lookarounds gets a little faster (or "less slow", but anyway, if you’re using regex, it’s because speed is not the priority :-D - And for small strings the difference will be derisory). Anyway, you can still improve a little more: https://regex101.com/r/ZGpfgi/1 - Of course, I haven’t done many tests, I don’t know if there are any bizarre sequences that cause problems (with regex you never know), but at first I think that...

  • 1

    @hkotsubo I will climb your regex only I will change the capture group (?:[ACGT]{3})+?) for (?P<codons>([ACGT]{3})+?) to fit the example.

3


The code works, so that will carry out the count from the codon ATG up to the TAA.

The problem is there are three stop codons and the program only recognizes the TAA. See, the problem is where you create the interval for iterations:

range(s.find('ATG'), s.find('TAA') + 1, 3)

You are basically creating an interval, which increases by three, of the substring’s first occurrence index "ATG" up to the index of the first occurrence of the substring "TAA".

I see two problems when specifying a final index as a result of s.find('TAA'):

  1. You run the risk of creating a crease running over a sequence with an invalid number of codons (there may be only 5 characters between ATG and TAA, for example).
  2. You don’t recognize the other two codons of completion, "TAG" and "TGA".

To fix these two problems, simply do not set an upper limit on range. Simply scroll to the full length of the string.

In that sense, you can do the stop Codon within the loop itself, so as to use the break to interrupt repetitions if you have found a finishing codon.

Sort of like this:

# Lista de stop codons:
stops = ["TAA", "TAG", "TGA"]

count = {}
for i in range(s.find('ATG'), len(s) + 1, 3):
    codon = s[i: i + 3]
    if codon in count:
        count[codon] += 1
    else:
        count[codon] = 1

    # Se tiver encontrado um stop codon, pare de iterar:
    if codon in stops:
        break
print(count)

I have briefly omitted the rest of the code. See it working on Ideone.

Remember that although it works with a common dictionary, Python offers other better approaches to count, such as the class Counter, available in the module collections. Behold:

from collections import Counter

stops = ["TAA", "TAG", "TGA"]
count = Counter()
for i in range(seq.find('ATG'), len(seq) + 1, 3):
    codon = seq[i: i + 3]
    count.update([codon])
    if codon in stops:
        break

print(dict(count))

See it working on Ideone.

Another alternative is to use the defaultdict(int). Though less powerful than the Counter, already makes implicit the initialization of values 0 in the dictionary, what is done manually in the first example of code. See more about them in the Soen.

It is worth noting, finally, that the two codes presented in this reply will print the result of the count even if one stop Codon is not found. Moreover, if the last "codon" of this unfinished string has one or two characters (not three as expected), it will also be accounted for. To prevent this, you can implement some kind of verification. An example is to use a regular expression, as excellently demonstrated by answer by Augusto Vasques, in order to invalidate sequences that deviate from the expected pattern. I leave another example in Ideone.

  • 1

    Luiz Felipe thanks for the very enlightening explanation I just have to thank the help

Browser other questions tagged

You are not signed in. Login or sign up in order to post.