Find content between 2 strings inside a giant string with Regexp

Asked

Viewed 60 times

3

I need to create a function that takes the frame number and takes its contents from a file . txt with more than 3000 lines.

Every table has the following configuration: Starts with "Frame X" and ends with "Source: (some source)"

Quadro 30–Tabela de tipo de demandante
Tabela de Tipo do Demandante
Código
Descrição da categoria
1Operadora
2Prestador de serviço
3Consumidor
4Gestor
5ANS
Fonte: Elaborado pelos autores.

That’s what I’ve been able to do so far:

const getBoardContent = (board) => {
  fs.readFile("return.txt", "utf-8", (err, data) => {
    if (err) console.log(err);
    
    const text = String(data)
    
    const boardContent = text.match(/quadro \d+([\w\s]*)fonte:.*/gim);
    console.log(boardContent)
  }
}

The problem is that it always returns null, but if for example I give match on /Quadro \d+/gim he finds all the frames and if I give match on /Fonte: /gim he also finds all sources.

1 answer

2

The problem is that after "frame" and the number, has a . But in your regex you used [\w\s], whereas the \w is shortcut representing an alpha-numeric character (a letter, number or _) and the \s corresponds to spaces and line breaks. None of them match the , so she can’t find a match.

If the idea is to take "anything", including line breaks, an alternative is:

fs.readFile("return.txt", "utf-8", (err, data) => {
    if (err) console.log(err);
    const boardContent = data.match(/^quadro \d+[\s\S]*?^fonte:.*$/gim);
    console.log(boardContent);
});

How did you use the flag m, then the markers ^ and $, which usually indicate only the beginning and end of the string, they also indicate the beginning and end of a line. I did so to ensure that I get the lines that start with "frame" and "source".

Among them I use [\s\S], which is basically the \s (spaces and line breaks) and \S (everything that is not \s). That is, it takes any type of character. The quantifier *? ensures that I will pick up as few characters as possible, so it stops when I find a line that starts with "source" (about the behavior of *?, has more information here, here and here).


But you said you’d get the number from the board and just extract the contents from this one. Then you can extract the regex number and only add in the results if it is the number you want. For example, if I just want the 30 frame:

fs.readFile("return.txt", "utf-8", (err, data) => {
    if (err) console.log(err);
    const boardContent = [];
    for (const match of data.matchAll(/^quadro (\d+)[\s\S]*?^fonte:.*$/gim)) {
        let numeroQuadro = parseInt(match[1]);
        if (numeroQuadro == 30) { // só quero o quadro 30 (aqui você coloca a condição que quiser)
            boardContent.push(match[0]);
        }
    }
    console.log(boardContent);
});

Now the \d+ is in brackets to form a capture group. With this I can get the contents of it with match[1]. If it’s the number I want, add it to the results (using match[0], which will contain all the string that was captured by regex).


But of course you can also do without regex. Since you implied that the file is large, it might be better to read it one line at a time, instead of loading it all at once into memory:

  • if the line starts with "Frame [frame number]", you start a record
  • concatenating until you find a line that starts with "Source:"

Sort of like this:

const fs = require('fs');
const readline = require('readline');
var lineReader = readline.createInterface({
  input: fs.createReadStream('return.txt', { encoding: 'utf-8' })
});

var contents = [];
var current = '';
var numeroQuadro = 30;
lineReader.on('line', function (line) {
  if (line.startsWith(`Quadro ${numeroQuadro}–`)) {
    current = line; // iniciou o conteúdo do quadro
  } else if (current) { // se está no meio do conteúdo do quadro
    if (line.startsWith('Fonte:')) { // verifica se terminou
      // se terminou, adiciona no array de resultados e zera o conteúdo
      contents.push(`${current}\n${line}`);
      current = '';
    } else { // se não terminou, só adiciona ao conteúdo atual
      current += `\n${line}`;
    }
  } 
});

// depois que leu tudo, imprime o conteúdo encontrado
lineReader.on('close', function () {
    console.log('Quadro encontrado: ', contents);
});

Browser other questions tagged

You are not signed in. Login or sign up in order to post.