Regex - Picking text up to a given string

Asked

Viewed 24,927 times

9

I would like to take the text up to the characters a) and if possible and separate responses also using Regex?

pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta<br /><br />

pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta pergunta<br /><br />

  a) resposta a<br />
  b) resposta b<br />
  c) resposta c<br />
  d) resposta d<br />
  e) resposta e<br />

I’m getting the answers with this simple rule http://regexr.com/3hb4p but the question is difficult. Remembering that the question can have several paragraphs.

  • Do you use any programming language? do you want to just take the question? or do you want to take the question and the answers separately?

  • @Rodneymendonça I already changed so that they stay in different lines

  • It might just be the regular expression anyway. I’ll turn around here.

  • If I can separate the question from the answers I can already register in the bank separately.

  • But there will only be one question in that string, or there will be more?

  • 1

    I use Asp 3 but only need the ER. I want to get the question and the answers separately. I believe I will have to do 6 regex. One for the question and five for the answers. I believe the question would have to make some rule that picks the text up to the a) that is the beginning of the first answer.

  • with that I could already get the answers, since the process would be the same.

  • I’m unable to limit the question until the)

  • only one question. I put in two paragraphs to represent the line break. @Guilhermenascimento

  • try the following /(. *? a))/s https://regex101.com/r/okE9nz/3

  • show @Caiqueromero perfect. without wanting to abuse and to get the answers individually?

  • 1

    Questions: 1. Is this the actual formatting of the file? (with these spaces between the questions and at the beginning of each answer) 2. This pergunta is just an analogy to a correct real question? Then it could be of correct mathematical content?

Show 7 more comments

4 answers

12

Marrying question and answers with an expression

If you really want to get married anything up to a letter followed by a letter ), or the end of the string, you can use this Regexp:   (regex.)

([\s\S]+?(?=\b[a-z][)]|$))
  • [\s\S] is a way to match all characters including line breaks.
    Normally, we’d use the flag singleline to change point behavior, but does not exist in ASP.

  • (?=) is a Lookahead (gives a "peek" ahead), ensuring it was followed by a letter and a ), or at the end of the string. But it has the peculiarity of matching the pattern, while it is not part of the married fragment.

  • \b is a word edge.

However, this expression is not very efficient, especially with long texts, and can give a false positive with cases like: "pergunta (veja o segmento b) pergunta".



Bearing in mind that every answer is preceded by (at least) a line break, can use it as an additional condition. This way, we can combine whole lines, provided that one line internal don’t start with [a-z][)].


Regex:   (regex101)

[^\r\n]+(?:\r?\n(?!\s*[a-z][)])[^\r\n]*)*


Explanation:

Debuggex.com

  • [^\r\n]+ A whole line.
  • (?:\n(?!\s*[a-z][)]).*)* No capture group, to repeat this sub-pattern (0-infinite):
    • \r?\n A break in line.
    • (?!\s*[a-z][)]) Negative Lookahead, ensuring it is not followed by a letter and a ) (with possible spaces from the beginning of the line to the letter).
    • [^\r\n]* A whole line.


Since you are using ASP:

Dim texto
texto = "Os Embargos de Terceiros fazem parte do procedimento especial, previsto no Código de Processo Civil," & _
        " sendo possível sua utilização por quem, não sendo parte no processo, sofre constrição ou sofre" & _
        " ameaça de constrição sobre bens que possua ou sobre os quais tenha direito incompatível com o ato constritivo." & _ 
        " Sobre o ajuizamento dos embargos, assinale a alternativa INCORRETA." & vbNewline & _
        "" & vbNewline & _
        "Considere o segmento “[...] o Estado só percebe o eco enfraquecido.†(2º§). Pode-se afirmar que" & _
        " a partir do recurso de linguagem utilizado pelo enunciador na escolha da palavra “Estadoâ€, identifica-se 7" & vbNewline & _
        "" & vbNewline & _
        "a) o estabelecimento de uma comparação entre “Estado†e “governantesâ€." & vbNewline & _
        "b) o emprego de uma palavra redundante objetivando reforçar a ideia expressa." & vbNewline & _
        " questões que irão ter várias quebra de linhas" & vbNewline & _
        "" & vbNewline & _
        "" & vbNewline & _
        " questões que irão ter várias quebra de linhas" & vbNewline & _
        "c) uma transferência de percepções resultando em uma fusão de impressões sensoriais." & vbNewline & _
        "d) a evocação de um termo em lugar de uma palavra, com a qual se acha relacionada não sendo sinônimos. "
Set re = New RegExp
re.Global = true  'casar todas as coincidências
re.Pattern = "[^\r\n]+(?:\r?\n(?!\s*[a-z][)])[^\r\n]*)*"

'corresponder com a regex
Set matches = re.Execute(texto)
If (matches.Count) Then
    'a primeira é a pergunta <- matches(0)
    Response.Write("Pergunta === " & matches(0))

    'o resto são as respostas <- matches(m)
    For m = 1 To matches.Count - 1
        Response.Write(vbNewline & "Resposta " & m & " === ")
        Response.Write(matches(m))
    Next
End If

Set matches = Nothing
Set re = Nothing


Upshot:

Pergunta === Os Embargos de Terceiros fazem parte do procedimento especial, previsto no Código de Processo Civil, sendo possível sua utilização por quem, não sendo parte no processo, sofre constrição ou sofre ameaça de constrição sobre bens que possua ou sobre os quais tenha direito incompatível com o ato constritivo. Sobre o ajuizamento dos embargos, assinale a alternativa INCORRETA.

Considere o segmento “[...] o Estado só percebe o eco enfraquecido.†(2º§). Pode-se afirmar que a partir do recurso de linguagem utilizado pelo enunciador na escolha da palavra “Estadoâ€, identifica-se 7
Resposta 1 === a) o estabelecimento de uma comparação entre “Estado†e “governantesâ€.
Resposta 2 === b) o emprego de uma palavra redundante objetivando reforçar a ideia expressa.
 questões que irão ter várias quebra de linhas


 questões que irão ter várias quebra de linhas
Resposta 3 === c) uma transferência de percepções resultando em uma fusão de impressões sensoriais.
Resposta 4 === d) a evocação de um termo em lugar de uma palavra, com a qual se acha relacionada não sendo sinônimos. 


I clicked on a free lodging if you want to test it:
http://mariano.somee.com/258904/index.asp

9

You can use /(.*?a\))/s to obtain all characters including to)

You can use /(.*?)a\)/s for all characters prior to to)

Explanation:

.*? Identifies all characters

a Identifies a a literally (differentiates capital from lowercase)

\( Identifies the first parentheses

  • perfect. thank you.

  • You can also, instead of changing the operator’s "gluttony" *, ask to marry "not-a": /([^a]*a)/ to include the first a, and /([^a]*)a/ to delete the first a

  • @Mariano, you’re right, I talked nonsense... I hadn’t paid attention that it was until a). When I had read the answer (and also the title of the question led me to this error) I understood that it was up to the character a. To make this alternative without using the .*? a little more than just the denied list. (([^a]|a[^)])*)a\) I believe to be the alternative excluding a) of the selection

  • @Mariano : https://regex101.com/r/rq8IwI/1

  • 1

    @Jeffersonquesado The problem is that the point does not match line breaks, so .*?must be changed to [\s\S]*? (can read my answer)... Your regex has a theoretical problem, if a) is preceded by another a... But if you want the most performative, it’s "unrolling the loop": [^a]*(?:a(?!\))[^a]*)*

5


Use that regex:

((.|\n)*?)(a\).*?)\n*?(b\).*?)\n*?(c\).*?)\n*?(d\).*?)\n*?(e\).*?)$|\n*?

It will separate the text into groups where:

  • Group 1 - Contains text before option a).
  • Group 2 - Captures nothing, but encapsulates the group 1 capture options.
  • Group 3 - Contains the contents of the option a up to the line break (where you would start option b in your example).
  • Group 4 - Contains the contents of the option b until the line breaks.
  • Group 5 - Contains the contents of the option c until the line breaks.
  • Group 6 - Contains the contents of the option d until the line breaks.
  • Group 7 - Contains the contents of the option e up to the line break or end of the text.

You can see how this regex works here

Explanation of the regex

((.|\n)*?) - It will capture any character and line break of the content until the first delimiter arrives.

(a\).*?)\n*? - a\) is equal to a) and it will be used as a delimiter, so that the first capture group stops capturing in the first occurrence of the sequence a), after that the regex will capture all the content until the first line break.

(b\).*?)\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the b).

(c\).*?)\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the c).

(d\).*?)\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the d).

(e\).*?)\$|\n*? - The operation of that catch group shall be equal to that of group 3, only that it is taken from the e)and stops at the end of the text or in a line break, in case you use this regex in a file that has many questions and texts.

  • I think it was almost there. the question is going on in the answers. I would like you to take up the).

  • the answers if there is line break does not take this break.

  • perfect guy. I really appreciate it. I’m gonna take this code and see if I can put together what I want.

  • @Rod upgraded to work the way you wanted.

  • Thank you my friend.

  • analyzing this code again, it’s the clearest I’ve seen so far about what I want. You have to fix it to eliminate the BRS just from the end of each group?

Show 1 more comment

4

If you are using a library with more resources like PCRE.

Could use :

/(?(?=\s+)|(?(?=\w\))(?<a>\w\).*)|(?<q>.*?\n)))/g

See working on REGEX101

Here I am using "conditional" and "nominal groups"

Explanation

  • (?(?=\s+)|...) - This parole basically says you should ignore "spaces", because if you find the first condition is "do nothing".
  • (?(?=\w\)) - Here in fact we have the condition for questions and answers, because if you find [a-z0-9_] followed by ) is answer, if not, is a question.

Note

  • I used \w to facilitate, the correct I believe would be [[:alpha:]], or if you use the modifier i simplifies to [a-z]

Problems

  • This Regex ends up generating a lot of garbage by not capturing the first condition.
  • Detail: I think you could use .* instead of .*?\n... And there’s a benefit to using the conditionals instead of \s+|(?<a>\w\).*)|(?<q>.*)?

  • @Mariano in fact the second conditional is not necessary, but would still maintain the first to not generate capture with all those spaces, even facilitating in the process of removing these indexes. (?(?=\s+)|(?:(\w\).*)|(.*?\n)))

  • 1

    I understand, although it is a matter of seeing if there are any captured groups ... Or use (*SKIP)(*FAIL) in the PCRE ;-)

  • @Mariano very interesting, had not seen these directives yet, I have to read a little more :D

  • 1

    http://www.rexegg.com/backtracking-controlverbs.html is what explains it best

Browser other questions tagged

You are not signed in. Login or sign up in order to post.