Regex - Picking text up to a given string

Question

Regex - Picking text up to a given string

Asked 7 years, 8 months ago

Viewed 24,927 times

9

I would like to take the text up to the characters a) and if possible and separate responses also using Regex?

pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta<br /><br />

pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta<br /><br />

  a) resposta a<br />
  b) resposta b<br />
  c) resposta c<br />
  d) resposta d<br />
  e) resposta e<br />

I’m getting the answers with this simple rule http://regexr.com/3hb4p but the question is difficult. Remembering that the question can have several paragraphs.

Do you use any programming language? do you want to just take the question? or do you want to take the question and the answers separately?

– rray

2017/11/28 at 19:08
@Rodneymendonça I already changed so that they stay in different lines

– Paz

2017/11/28 at 19:08
It might just be the regular expression anyway. I’ll turn around here.

– Rod

2017/11/28 at 19:09
If I can separate the question from the answers I can already register in the bank separately.

– Rod

2017/11/28 at 19:10
But there will only be one question in that string, or there will be more?

– Guilherme Nascimento

2017/11/28 at 19:15
1

I use Asp 3 but only need the ER. I want to get the question and the answers separately. I believe I will have to do 6 regex. One for the question and five for the answers. I believe the question would have to make some rule that picks the text up to the a) that is the beginning of the first answer.

– Rod

2017/11/28 at 19:18
with that I could already get the answers, since the process would be the same.

– Rod

2017/11/28 at 19:20
I’m unable to limit the question until the)

– Rod

2017/11/28 at 19:21
only one question. I put in two paragraphs to represent the line break. @Guilhermenascimento

– Rod

2017/11/28 at 19:23
try the following /(. *? a))/s https://regex101.com/r/okE9nz/3

– Caique Romero

2017/11/28 at 19:37
show @Caiqueromero perfect. without wanting to abuse and to get the answers individually?

– Rod

2017/11/28 at 19:46
1

Questions: 1. Is this the actual formatting of the file? (with these spaces between the questions and at the beginning of each answer) 2. This pergunta is just an analogy to a correct real question? Then it could be of correct mathematical content?

– Guilherme Lautert

2017/11/29 at 11:48

Show 7 more comments

4 answers

5

Use that regex:

((.|\n)*?)(a\).*?)\n*?(b\).*?)\n*?(c\).*?)\n*?(d\).*?)\n*?(e\).*?)$|\n*?

It will separate the text into groups where:

Group 1 - Contains text before option a).
Group 2 - Captures nothing, but encapsulates the group 1 capture options.
Group 3 - Contains the contents of the option a up to the line break (where you would start option b in your example).
Group 4 - Contains the contents of the option b until the line breaks.
Group 5 - Contains the contents of the option c until the line breaks.
Group 6 - Contains the contents of the option d until the line breaks.
Group 7 - Contains the contents of the option e up to the line break or end of the text.

You can see how this regex works here

Explanation of the regex

((.|\n)*?) - It will capture any character and line break of the content until the first delimiter arrives.

(a\).*?)\n*? - a\) is equal to a) and it will be used as a delimiter, so that the first capture group stops capturing in the first occurrence of the sequence a), after that the regex will capture all the content until the first line break.

(b\).*?)\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the b).

(c\).*?)\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the c).

(d\).*?)\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the d).

(e\).*?)\$|\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the e)and stops at the end of the text or in a line break, in case you use this regex in a file that has many questions and texts.

I think it was almost there. the question is going on in the answers. I would like you to take up the).

– Rod

2017/11/28 at 19:42
the answers if there is line break does not take this break.

– Rod

2017/11/28 at 19:43
perfect guy. I really appreciate it. I’m gonna take this code and see if I can put together what I want.

– Rod

2017/11/28 at 19:48
@Rod upgraded to work the way you wanted.

– Paz

2017/12/05 at 18:00
Thank you my friend.

– Rod

2017/12/05 at 18:50
analyzing this code again, it’s the clearest I’ve seen so far about what I want. You have to fix it to eliminate the BRS just from the end of each group?

– Rod

2018/05/31 at 16:16

Show 1 more comment

Browser other questions tagged regex

You are not signed in. Login or sign up in order to post.

by Mariano • **1,098** points · Answer 1 · 2017-11-29T10:10:09+00:00

Marrying question and answers with an expression

If you really want to get married anything up to a letter followed by a letter ), or the end of the string, you can use this Regexp: _^(regex.)

([\s\S]+?(?=\b[a-z][)]|$))

[\s\S] is a way to match all characters including line breaks.
_{Normally, we’d use the flag singleline to change point behavior, but does not exist in ASP.}
(?=…) is a Lookahead (gives a "peek" ahead), ensuring it was followed by a letter and a ), or at the end of the string. But it has the peculiarity of matching the pattern, while it is not part of the married fragment.
\b is a word edge.

However, this expression is not very efficient, especially with long texts, and can give a false positive with cases like: "pergunta (veja o segmento b) pergunta".

Bearing in mind that every answer is preceded by (at least) a line break, can use it as an additional condition. This way, we can combine whole lines, provided that one line internal don’t start with [a-z][)].

Regex: _^(regex101)

[^\r\n]+(?:\r?\n(?!\s*[a-z][)])[^\r\n]*)*

Explanation:

[^\r\n]+ A whole line.
(?:\n(?!\s*[a-z][)]).*)* No capture group, to repeat this sub-pattern (0-infinite):
- \r?\n A break in line.
- (?!\s*[a-z][)]) Negative Lookahead, ensuring it is not followed by a letter and a ) (with possible spaces from the beginning of the line to the letter).
- [^\r\n]* A whole line.

Since you are using ASP:

Dim texto
texto = "Os Embargos de Terceiros fazem parte do procedimento especial, previsto no Código de Processo Civil," & _
        " sendo possível sua utilização por quem, não sendo parte no processo, sofre constrição ou sofre" & _
        " ameaça de constrição sobre bens que possua ou sobre os quais tenha direito incompatível com o ato constritivo." & _ 
        " Sobre o ajuizamento dos embargos, assinale a alternativa INCORRETA." & vbNewline & _
        "" & vbNewline & _
        "Considere o segmento â€œ[...] o Estado só percebe o eco enfraquecido.â€ (2º§). Pode-se afirmar que" & _
        " a partir do recurso de linguagem utilizado pelo enunciador na escolha da palavra â€œEstadoâ€, identifica-se 7" & vbNewline & _
        "" & vbNewline & _
        "a) o estabelecimento de uma comparação entre â€œEstadoâ€ e â€œgovernantesâ€." & vbNewline & _
        "b) o emprego de uma palavra redundante objetivando reforçar a ideia expressa." & vbNewline & _
        " questões que irão ter várias quebra de linhas" & vbNewline & _
        "" & vbNewline & _
        "" & vbNewline & _
        " questões que irão ter várias quebra de linhas" & vbNewline & _
        "c) uma transferência de percepções resultando em uma fusão de impressões sensoriais." & vbNewline & _
        "d) a evocação de um termo em lugar de uma palavra, com a qual se acha relacionada não sendo sinônimos. "

Set re = New RegExp
re.Global = true  'casar todas as coincidências
re.Pattern = "[^\r\n]+(?:\r?\n(?!\s*[a-z][)])[^\r\n]*)*"

'corresponder com a regex
Set matches = re.Execute(texto)
If (matches.Count) Then
    'a primeira é a pergunta <- matches(0)
    Response.Write("Pergunta === " & matches(0))

    'o resto são as respostas <- matches(m)
    For m = 1 To matches.Count - 1
        Response.Write(vbNewline & "Resposta " & m & " === ")
        Response.Write(matches(m))
    Next
End If

Set matches = Nothing
Set re = Nothing

Upshot:

Pergunta === Os Embargos de Terceiros fazem parte do procedimento especial, previsto no Código de Processo Civil, sendo possível sua utilização por quem, não sendo parte no processo, sofre constrição ou sofre ameaça de constrição sobre bens que possua ou sobre os quais tenha direito incompatível com o ato constritivo. Sobre o ajuizamento dos embargos, assinale a alternativa INCORRETA.

Considere o segmento â€œ[...] o Estado só percebe o eco enfraquecido.â€ (2º§). Pode-se afirmar que a partir do recurso de linguagem utilizado pelo enunciador na escolha da palavra â€œEstadoâ€, identifica-se 7
Resposta 1 === a) o estabelecimento de uma comparação entre â€œEstadoâ€ e â€œgovernantesâ€.
Resposta 2 === b) o emprego de uma palavra redundante objetivando reforçar a ideia expressa.
 questões que irão ter várias quebra de linhas


 questões que irão ter várias quebra de linhas
Resposta 3 === c) uma transferência de percepções resultando em uma fusão de impressões sensoriais.
Resposta 4 === d) a evocação de um termo em lugar de uma palavra, com a qual se acha relacionada não sendo sinônimos.

I clicked on a free lodging if you want to test it:
http://mariano.somee.com/258904/index.asp

by Caique Romero • **7,039** points · Answer 2 · 2017-11-28T19:49:23+00:00

9

You can use /(.*?a\))/s to obtain all characters including to)

You can use /(.*?)a\)/s for all characters prior to to)

Explanation:

.*? Identifies all characters

a Identifies a a literally (differentiates capital from lowercase)

\( Identifies the first parentheses

perfect. thank you.

– Rod

2017/11/28 at 19:58
You can also, instead of changing the operator’s "gluttony" *, ask to marry "not-a": /([^a]*a)/ to include the first a, and /([^a]*)a/ to delete the first a

– Jefferson Quesado

2017/11/29 at 10:27
@Mariano, you’re right, I talked nonsense... I hadn’t paid attention that it was until a). When I had read the answer (and also the title of the question led me to this error) I understood that it was up to the character a. To make this alternative without using the .*? a little more than just the denied list. (([^a]|a[^)])*)a\) I believe to be the alternative excluding a) of the selection

– Jefferson Quesado

2017/11/29 at 10:41
@Mariano : https://regex101.com/r/rq8IwI/1

– Jefferson Quesado

2017/11/29 at 10:46
1

@Jeffersonquesado The problem is that the point does not match line breaks, so .*?must be changed to [\s\S]*? (can read my answer)... Your regex has a theoretical problem, if a) is preceded by another a... But if you want the most performative, it’s "unrolling the loop": [^a]*(?:a(?!\))[^a]*)*

– Mariano

2017/11/29 at 11:03

by Guilherme Lautert • **15,097** points · Answer 3 · 2017-11-29T12:58:55+00:00

If you are using a library with more resources like PCRE.

Could use :

/(?(?=\s+)|(?(?=\w\))(?<a>\w\).*)|(?<q>.*?\n)))/g

See working on REGEX101

Here I am using "conditional" and "nominal groups"

Explanation

(?(?=\s+)|...) - This parole basically says you should ignore "spaces", because if you find the first condition is "do nothing".
(?(?=\w\)) - Here in fact we have the condition for questions and answers, because if you find [a-z0-9_] followed by ) is answer, if not, is a question.

Note

I used \w to facilitate, the correct I believe would be [[:alpha:]], or if you use the modifier i simplifies to [a-z]

Problems

This Regex ends up generating a lot of garbage by not capturing the first condition.