Identify string between two known snippets in a string

Question

Identify string between two known snippets in a string

Asked 7 years, 8 months ago

Viewed 423 times

3

I would like you to help me on the following:

Given this

CGC UUC GCU UUG GAA AAU UUG UGU GUU UUU UGU GGC UGC UCG CUG CUC AAA UUG UUC GCU GCU UUU UGU GUC CUG GCU GCU UUU AUU AUU UUA CGC UGC UUG GCG CUG CUY UUA CGC UGC UUG GGC UUG UUG UGG CUU UGG UUG UUU GUU UAU UAY GCU GCU CUU GUU GUU GUU GCU UGU UGU GCC UAU GGC

I have to do a program that reads this sequence and that when finding a UAG, keep all the letters until you find one UAA.

For example, UAG UGG GAU UUA UAA.

How do I do this?

Welcome to the site. I invite you to do the [tour] to learn the basics of how the site works and already read the [Ask] guide. Could you improve your question by better detailing what was the difficulty found? Have you ever tried to do anything? Did you make a mistake? Which one? Please use the [Edit] button to add this information.

– Woss

2017/10/31 at 20:52
In this example string contains no occurrence of `UAG'.

– Bruno Camargo

2017/10/31 at 21:11
Your example data does not have the sequence UAG UGG GAU UUA UAA, right ?

– Lacobus

2017/10/31 at 21:18
What if you find multiple sequences? Or stop when you finish the first one?

– Miguel

2017/10/31 at 21:21
No, my example does not have the sequence.

– 2583estbarreiro

2017/11/01 at 12:09

2 answers

Browser other questions tagged python string python-3.x

You are not signed in. Login or sign up in order to post.

by Lacobus • **13,510** points · Answer 1 · 2017-10-31T21:24:41+00:00

You can build a finite state machine with only 2 states to solve your problem:

def pesquisar( seq, inicio, fim ):
    estado = 0
    ret = []
    aux = []

    for x in seq:
        if estado == 0:
            if x == inicio:
                aux = [ x ]
                estado = 1
        elif estado == 1:
            aux.append( x );
            if x == fim:
                ret.append(aux)
                estado = 0

    return ret


sequencia = ['CGC','UUC','GCU','UUG','GAA','AAU','UUG','UGU','GUU','UUU','UGU','GGC','UGC','UCG','CUG','CUC','AAA','UUG','UUC','GCU','GCU','UUU','UGU','GUC','CUG','GCU','GCU','UUU','AUU','AUU','UUA','CGC','UGC','UUG','GCG','CUG','CUY','UUA','CGC','UGC','UUG','GGC','UUG','UUG','UGG','CUU','UGG','UUG','UUU','GUU','UAU','UAY','GCU','GCU','CUU','GUU','GUU','GUU','GCU','UGU','UGU','GCC','UAU','GGC']

print(pesquisar( sequencia, inicio = 'UGU', fim = 'UGC' ))

Exit:

[['UGU', 'GUU', 'UUU', 'UGU', 'GGC', 'UGC'],
 ['UGU', 'GUC', 'CUG', 'GCU', 'GCU', 'UUU', 'AUU', 'AUU', 'UUA', 'CGC', 'UGC']]

EDIT:

State machines can be built in Python with the use of yield, follows an alternative way of solving the problem with an even more compact code:

def pesquisar( seq, inicio, fim ):
    ret = []
    for i in seq:
        if i == inicio or ret:
             ret.append(i)
        if i == fim and ret:
            yield ret
            ret = []

sequencia = ['CGC','UUC','GCU','UUG','GAA','AAU','UUG','UGU','GUU','UUU','UGU','GGC','UGC','UCG','CUG','CUC','AAA','UUG','UUC','GCU','GCU','UUU','UGU','GUC','CUG','GCU','GCU','UUU','AUU','AUU','UUA','CGC','UGC','UUG','GCG','CUG','CUY','UUA','CGC','UGC','UUG','GGC','UUG','UUG','UGG','CUU','UGG','UUG','UUU','GUU','UAU','UAY','GCU','GCU','CUU','GUU','GUU','GUU','GCU','UGU','UGU','GCC','UAU','GGC']

print(list(pesquisar( sequencia, inicio = 'UGU', fim = 'UGC')))

Exit:

[['UGU', 'GUU', 'UUU', 'UGU', 'GGC', 'UGC'],
 ['UGU', 'GUC', 'CUG', 'GCU', 'GCU', 'UUU', 'AUU', 'AUU', 'UUA', 'CGC', 'UGC']]

by Bruno Camargo • **519** points · Answer 2 · 2017-10-31T21:22:18+00:00

I think this might help you.

comeco = "CGC"
fim = "AAA"
string = "CGC UUC GCU UUG GAA AAU UUG UGU GUU UUU UGU GGC UGC UCG CUG CUC AAA UUG UUC GCU GCU UUU UGU GUC CUG GCU GCU UUU AUU AUU UUA CGC UGC UUG GCG CUG CUY UUA CGC UGC UUG GGC UUG UUG UGG CUU UGG UUG UUU GUU UAU UAY GCU GCU CUU GUU GUU GUU GCU UGU UGU GCC UAU GGC"
seqs = string.split(" ")
resp = ""
for i in range(0, len(seqs)):
    if seqs[i] == comeco:
        resp += seqs[i]
        while seqs[i] != fim:
            i += 1
            if i == len(seqs):
                break
            else:
                resp += " "+seqs[i]
        break

print(resp)