Identify string between two known snippets in a string

Asked

Viewed 423 times

3

I would like you to help me on the following:

Given this

CGC UUC GCU UUG GAA AAU UUG UGU GUU UUU UGU GGC UGC UCG CUG CUC AAA UUG UUC GCU GCU UUU UGU GUC CUG GCU GCU UUU AUU AUU UUA CGC UGC UUG GCG CUG CUY UUA CGC UGC UUG GGC UUG UUG UGG CUU UGG UUG UUU GUU UAU UAY GCU GCU CUU GUU GUU GUU GCU UGU UGU GCC UAU GGC 

I have to do a program that reads this sequence and that when finding a UAG, keep all the letters until you find one UAA.

For example, UAG UGG GAU UUA UAA.

How do I do this?

  • Welcome to the site. I invite you to do the [tour] to learn the basics of how the site works and already read the [Ask] guide. Could you improve your question by better detailing what was the difficulty found? Have you ever tried to do anything? Did you make a mistake? Which one? Please use the [Edit] button to add this information.

  • In this example string contains no occurrence of `UAG'.

  • Your example data does not have the sequence UAG UGG GAU UUA UAA, right ?

  • What if you find multiple sequences? Or stop when you finish the first one?

  • No, my example does not have the sequence.

2 answers

4

You can build a finite state machine with only 2 states to solve your problem:

def pesquisar( seq, inicio, fim ):
    estado = 0
    ret = []
    aux = []

    for x in seq:
        if estado == 0:
            if x == inicio:
                aux = [ x ]
                estado = 1
        elif estado == 1:
            aux.append( x );
            if x == fim:
                ret.append(aux)
                estado = 0

    return ret


sequencia = ['CGC','UUC','GCU','UUG','GAA','AAU','UUG','UGU','GUU','UUU','UGU','GGC','UGC','UCG','CUG','CUC','AAA','UUG','UUC','GCU','GCU','UUU','UGU','GUC','CUG','GCU','GCU','UUU','AUU','AUU','UUA','CGC','UGC','UUG','GCG','CUG','CUY','UUA','CGC','UGC','UUG','GGC','UUG','UUG','UGG','CUU','UGG','UUG','UUU','GUU','UAU','UAY','GCU','GCU','CUU','GUU','GUU','GUU','GCU','UGU','UGU','GCC','UAU','GGC']

print(pesquisar( sequencia, inicio = 'UGU', fim = 'UGC' ))

Exit:

[['UGU', 'GUU', 'UUU', 'UGU', 'GGC', 'UGC'],
 ['UGU', 'GUC', 'CUG', 'GCU', 'GCU', 'UUU', 'AUU', 'AUU', 'UUA', 'CGC', 'UGC']]

EDIT:

State machines can be built in Python with the use of yield, follows an alternative way of solving the problem with an even more compact code:

def pesquisar( seq, inicio, fim ):
    ret = []
    for i in seq:
        if i == inicio or ret:
             ret.append(i)
        if i == fim and ret:
            yield ret
            ret = []

sequencia = ['CGC','UUC','GCU','UUG','GAA','AAU','UUG','UGU','GUU','UUU','UGU','GGC','UGC','UCG','CUG','CUC','AAA','UUG','UUC','GCU','GCU','UUU','UGU','GUC','CUG','GCU','GCU','UUU','AUU','AUU','UUA','CGC','UGC','UUG','GCG','CUG','CUY','UUA','CGC','UGC','UUG','GGC','UUG','UUG','UGG','CUU','UGG','UUG','UUU','GUU','UAU','UAY','GCU','GCU','CUU','GUU','GUU','GUU','GCU','UGU','UGU','GCC','UAU','GGC']

print(list(pesquisar( sequencia, inicio = 'UGU', fim = 'UGC')))

Exit:

[['UGU', 'GUU', 'UUU', 'UGU', 'GGC', 'UGC'],
 ['UGU', 'GUC', 'CUG', 'GCU', 'GCU', 'UUU', 'AUU', 'AUU', 'UUA', 'CGC', 'UGC']]
  • 1

    Hmm... something pythonica tells me that this can be simplified much more by means of a List Comprehension...

2

I think this might help you.

comeco = "CGC"
fim = "AAA"
string = "CGC UUC GCU UUG GAA AAU UUG UGU GUU UUU UGU GGC UGC UCG CUG CUC AAA UUG UUC GCU GCU UUU UGU GUC CUG GCU GCU UUU AUU AUU UUA CGC UGC UUG GCG CUG CUY UUA CGC UGC UUG GGC UUG UUG UGG CUU UGG UUG UUU GUU UAU UAY GCU GCU CUU GUU GUU GUU GCU UGU UGU GCC UAU GGC"
seqs = string.split(" ")
resp = ""
for i in range(0, len(seqs)):
    if seqs[i] == comeco:
        resp += seqs[i]
        while seqs[i] != fim:
            i += 1
            if i == len(seqs):
                break
            else:
                resp += " "+seqs[i]
        break

print(resp)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.