How to remove column fields from a regular expression select?

Asked

Viewed 161 times

2

I need to use a regular Javascript expression to know which columns are passed in a query select.

For example: passing "SELECT nome, sobrenome FROM table" the expression would return to me "nome,sobrenome". Is there any way to achieve that?

  • Does it have to be with regular expression? Da to do it normally, since it will always be a SELECT and a FROM in the middle of the columns

  • In the case, using substr?

  • Yes, but I’m setting up a regex here

  • Matheus, I saw that you removed the acceptance of the answer. Something was missing?

  • In this case, the explanation was quite complete but none of the alternatives worked for me.

  • Well, the alternative with map and trim gives exactly the same result as the other answer. But all right, I’m glad you found the solution :-)

Show 1 more comment

2 answers

2

It depends. If you guarantee that the queries are always valid, and that everything will be on the same line, the simplest is:

let query = "SELECT nome, sobrenome FROM table";
let m = query.match(/select(.+)from/i);
console.log(m[1]); // nome, sobrenome

Basically, the dot corresponds to "anything", and the quantifier + means "one or more occurrences". This means that between "select" and "from" may have one or more characters.

The .+ is in parentheses to form a catch group, so I can catch him with m[1] (i use 1 because it is the first capture group, since it is the first pair of parentheses of the regex).

To flag i makes the regex case insensitive (does not differentiate between upper and lower case letters), so it does not matter if the query has "select", "SELECT", "Select", etc. The same goes for the "from".

The above code returns the column names the way they are in the query, so if you have multiple spaces between them, they will be returned as well. But if you want the exit to be commas-only, no spaces, just sort everything with split and then re-join with join (or use map along with trim, that eliminates the spaces):

let query = "SELECT nome,     sobrenome   ,    outrocampo FROM table";
let m = query.match(/select(.+)from/i);
console.log(m[1]); // nome,     sobrenome   ,    outrocampo 

console.log(m[1].split(/\s*,\s*/).join(',')); // nome,sobrenome,outrocampo 
console.log(m[1].split(/,/).map(s => s.trim()).join(',')); // nome,sobrenome,outrocampo 

I use the shortcut \s (space-grabbing, TAB, line breaks, etc - see the full list in documentation) with the quantifier * (zero or more occurrences), so I eliminate the commas and the spaces they may have before or after. Finally, no join, I use only the comma to join the column names.

Note that the second option with map and trim also eliminates spaces at the beginning and end of the string. You could also delete these spaces using regex /select\s+(.+)\s+from/i.


That regex is well naive. It does not validate anything, so the query can be "SELECT FROM" or "abcSELECT *** FROMxyz", that she still takes what she has between "select" and "from":

let r = /select(.+)from/i;
console.log("select from".match(r)[1]); // imprime um espaço em branco
console.log("abcselect *** fromxyz".match(r)[1]); // ***


If you want to validate that you have at least something between "select" and "from", you can increment regex. For example, for the simplest case, with one or more column names:

function extraiColunas(s) {
    let r = /\bselect\s+([a-z]+(\s*,\s*[a-z]+)*)\s+from\b/i;
    let m = s.match(r);
    if (m) {
        console.log(m[1]);
    } else {
        console.log('query inválida');
    }
}

extraiColunas("select nome, sobrenome from tabela"); // nome, sobrenome
extraiColunas("abcselect nome, sobrenome fromxyz"); // query inválida
extraiColunas("select **** from table"); // query inválida

Now I use the word Boundary \b to ensure that before "select" and after "from" there are no other letters. I also use the shortcut \s, now with + (for one or more), sometimes with * (to zero or more).

For the names of the columns, I used the most naive approach: the character class [a-z]+ (one or more letters of a to z). And how I used to flag i, regex also considers capital letters of A to Z.

The part (\s*,\s*[a-z]+)* says that the whole section "spaces, comma, spaces, one or more letters" can be repeated zero or more times (indicated by * after the parentheses). That is, I can only have one name (indicated by the first [a-z]+, before these parentheses), or several names separated by comma.


Of course, it’s not over yet. What if the query has something like "select count(*)" or "select nome as primeiro_nome", a regex would have to be adapted to address these cases. Or "select total1 + total2" and other valid expressions (note that it also does not include "select * from" in the above examples, and the regex does not even validate if it has something after "from").

Then you have to decide whether to further complicate the regex to consider all cases, or whether to use the simplest expression (.+) at the risk of accepting invalid queries.


Another problem of the point is that it can pick up too many things, for example:

let query = "SELECT nome, sobrenome FROM table_from where x > 1";
let m = query.match(/select(.+)from/i);
console.log(m[1]); // nome, sobrenome FROM table_

Notice he took it "nome, sobrenome FROM table_". This is because the quantifiers + and * sane greedy by default, and try to catch as many characters as possible. In this case, the dot goes to the last occurrence of "from" you find. If you had a subquery later, for example, regex would go to "from" from.

In the case of simpler queries like this, I could solve with \s, since after "select" and before "from" there must be at least one space:

let query = "SELECT nome, sobrenome FROM table_from where x > 1";
let m = query.match(/select\s+(.+)\s+from/i);
console.log(m[1]); // nome, sobrenome

But still, depending on the queries you will evaluate, there may be other problems. There is no way, if you want to make regex simpler, the chance of false positives increases, and if you want to treat more cases and decrease the chance of catching invalid queries, the complexity of regex increases. It is up to you to evaluate how far it is worth complicating the regex.


Or just use a parser sql. Regex is not always the best solution.


PS: to another answer uses lookbehind and Lookahead, which also works, but is a little more costly. Compare here and here the number of steps executed (this for a small query, the difference is almost twice as many steps). Of course, for a few small strings the difference will be milliseconds or even less, but it is important to know the implications of using one thing or another.

In the other answer is also used .*, but like the * means "zero or more occurrences", it will accept including the string "selectfrom". It is what I have already said, the simpler the regex, the greater the chance of false positives (and not what to use .+ be it so better so that .*, as explained above).

1


You could do using substring but since you want with regex I mounted one here, to get the values between SELECT and FROM.

const texto = "SELECT nome, sobrenome FROM table";

const colunas = texto.match(/(?<=SELECT)(.*)(?=FROM)/i)[0].replace(/\s/g, "").split(",").join();

console.log(colunas);

Explaining to Regex

?<= - Positive Lookbehind = Basically puts as starting point of regex something that is after it, in case ?<=SELECT puts as a starting point nome, sobrenome..., taking everything after SELECT.

(.*) - Picks up any character

?= - Positive Lookahead = Basically does the same as the lookbehind but in reverse, takes everything before.

/i - It means it’s not going to be case sensitive, that is, no matter how small or large

  • 1

    Refactoring: As you are already using the flag /i, it becomes unnecessary to use SELECT|select.

  • Good @Valdeirpsr, I’m mounting the explanation of regex, then I fix

Browser other questions tagged

You are not signed in. Login or sign up in order to post.