Regex for two string hypotheses simultaneously

Asked

Viewed 42 times

2

I have the following input, and I do a line-by-line interaction at all.

D      b1308 pspE; thiosulfate sulfurtransferase PspE   K03972 pspE; phage shock protein E
B  09193 Unclassified: signaling and cellular processes-6
C    99977 Transport
D      b2347 yfdC; inner membrane protein YfdC  K21990 yfdC; formate-nitrite transporter family protein
D      b3657 yicJ; putative xyloside transporter YicJ   K03292 TC.GPH; glycoside/pentoside/hexuronide:cation symporter, GPH family
D      b3876 yihO; putative sulfoquinovose transporter  K03292 TC.GPH; glycoside/pentoside/hexuronide:cation symporter, GPH family
D      b0361 insD-1; IS2 element protein    K07497 K07497; putative transposase
D      b1402 insD-2; IS2 insertion element protein InsB K07497 K07497; putative transposase

However using the following Regex for each line to extract the gene name (for example b2347 yfdC):

[b]\d{4}\s[a-zA-z]{3,4}

But this Regex does not extract the full name in cases like b1402 insD-2.

There is a single Regex to extract both cases?

  • The format is always 3 or 4 letters, optionally followed by "hyphen + 1 number"?

  • Or is the hyphenated or hyphenated, always 3-4 letter format, and yes hyphen and a number

  • I put a "generic" answer, but if you are using a specific language/site/tool, you can [Dit] the question and add this information, because each language implements regex in one way and not always everything works the same way at all

  • 1

    I just tested it, it’s perfect, thank you

1 answer

2


You can use this regex:

b\d{4}\s[a-zA-Z]{3,4}(-\d)?

Detail: the brackets define a character class. For example, [ab] means "the letter a or the letter b" (any of them). But when you only want a single letter, you don’t need the brackets. So, [b] is the same as b, so I removed the brackets around the b.

Another detail is that you were using [a-zA-z] (with the z twice). This works by coincidence, since A-z is the interval between A and z, and if you look at ascii table you will see that this range takes all uppercase and lowercase letters. The problem is that in this range there are also other characters, such as [, \ and ]. So it’s best to leave them out and use the correct range: [a-zA-Z] (with Z capital letters at the end).

Finally, if you want to catch so much yihO how much insD-1, use:

  • [a-zA-Z]{3,4}: to take 3 to 4 letters
  • (-\d)?: a hyphen followed by a digit (\d). I enclose everything in parentheses and put one ?, which makes this whole group optional

See this regex working on regex101.com.


Just remembering that \s usually (in most languages/Engines) corresponds not only to whitespace, but also to other characters, such as TAB and line breaks (the exact list of characters varies). If you want regex to only consider whitespace, just switch to:

b\d{4} [a-zA-Z]{3,4}(-\d)?

Notice that now there is a gap between the \d{4} and the [.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.