Add a string before a given word using Regex

Asked

Viewed 227 times

2

I’m reading a file with a C code and turning it into another.

I would like to replace the name of a variable that comes before or after an operation character by replacing it with "x" + start_start_name that comes before the operator if it is one of the variables of interest, which in the following example are a, b and c.

This would be the code before applying the substitution.

for (i=0;i<10;i++)
{
    a=b+c;
}

And this would be the code after applying the substitution.

for (i=0;i<10;i++)
{
    xa=xb+xc;
}

It is worth noting that the variable i is not affected as it is not in the list of variables to be replaced. I am using the following Python code with Regex.

import regex as re
texto=[]
texto.append("int a,b,c;\n")
texto.append("{\n")
texto.append("\ta=b+c;\n")
var=[]
var.append("a")
var.append("b")
var.append("c")
texto.append("}\n")
for line in texto:
    for vari in var:
        line = re.sub(vari+"=","L1_structure."+vari+"=",line.rstrip())
        line = re.sub(vari+">","L1_structure."+vari+">",line.rstrip())
        line = re.sub(vari+"<","L1_structure."+vari+"<",line.rstrip())
        line = re.sub(vari+"\+","L1_structure."+vari+"+",line.rstrip())
        line = re.sub(vari+"\-","L1_structure."+vari+"-",line.rstrip())
        line = re.sub(vari+"\*","L1_structure."+vari+"*",line.rstrip())
        line = re.sub(vari+"\/","L1_structure."+vari+"/",line.rstrip())
        line = re.sub("="+vari,"L1_structure."+vari,line.rstrip())
        line = re.sub(">"+vari,"L1_structure."+vari,line.rstrip())
        line = re.sub("<"+vari,"L1_structure."+vari,line.rstrip())
        line = re.sub("\+"+vari,"L1_structure."+vari,line.rstrip())
        line = re.sub("\-"+vari,"L1_structure."+vari,line.rstrip())
        line = re.sub("\*"+vari,"L1_structure."+vari,line.rstrip())
        line = re.sub("\/"+vari,"L1_structure."+vari,line.rstrip())
    teste.append(line)
print(teste)

3 answers

3


Maybe regex is not the best solution for your case (for reasons that will be detailed below). Anyway, if you are only dealing with simple lines like the ones in the question, I can do this:


You can take the list that has variable names and create a single regex, using alternation for the alternatives, and \b to demarcate the "word boundary" (I mean, just take the name a as this is "loose" in the text, and not in the middle of a larger name, as amarelo, for example):

import re

var = ['a', 'b', 'c']
# criar regex com os nomes das variáveis: \b(a|b|c)\b
r = re.compile(r'\b({})\b'.format('|'.join(var)))

In this case, the resulting regex is \b(a|b|c)\b (the names a or b or c, demarcated by \b before and after - that is, it avoids cases where the letters are in the middle of other words, such as "avocado").

For names with only one letter I could also use one character class (getting \b[abc]\b), but use toggle (with |) is more guaranteed as it also works for names with more than one letter (for example, \b(nome|idade)\b search for variables called nome or idade).

Then just make the replacements:

resultado = []
for line in texto:
    resultado.append(r.sub(r'L1_structure.\1', line))

print(''.join(resultado))

In regex I put the options in parentheses, because this forms a catch group. With this, the variable name is captured and I can use it in the substitution, through the special variable \1 (since it is the first pair of parentheses of regex). Therefore the substitution made is L1_strucuture. + the name of the variable that was captured.

The result is:

int L1_structure.a,L1_structure.b,L1_structure.c;
{
    L1_structure.a=L1_structure.b+L1_structure.c;
}

You can also use the syntax of comprehensilist on, much more succinct and pythonic:

resultado = [r.sub(r'L1_structure.\1', line) for line in texto]

Attention, this will not work for any C code

As stated at the beginning, the above solution works for simpler cases, such as the ones you put in the question.

But if you are going to work with any valid C code, then it is much more complicated to solve only with regex. For example, I can have this:

int f() {
    int a = 1;
    ...
}

int main(void) {
    int a = f();
    ....
}

Notice I have two variables a: one within the main and another within the function f(). Which of the two should be changed? If you use regex and process row by row, both will be changed, but that is not always what you want - maybe you just want to change the variable a of a specific scope, for example, and in this case the regex will not work because you would need to analyze the context in which each variable is found (and doing this with regex is quite complicated).

There is also the case where a can be the name of a function:

int a(int x) {
    ...
    return whatever;
}

int f() {
    int a;
    ...
}

int main() {
    int result = a(2);
    ...
}

How not to confuse the function a with the variable a within the function f? Maybe if you specify in regex that you cannot have ( soon after - but still will not detect the case in which function a is passed to a function pointer:

functionPtr = &a;

Unless regex also includes a case where there is no & before, but then you don’t replace the cases where the variable is assigned to a pointer - you realize that the more cases come up, the more complicated it gets?


Another case is if you have text in a string:

printf("essa é a mensagem");

The regex would have to ignore the a above, because it is inside a string. And do this with regex it’s not that simple.

The above message could also be in a comment:

/*
Esta é a mensagem dentro de um comentário.

int a; <- ignorar este também
*/

Again, both of you a's above should be ignored, and make a regex to detect that comment it’s very complicated. And note that in this case it is no use to evaluate line by line, the regex should extend by more than one line and evaluate the text as a whole. And you would still have to join this regex with the previous ones - the one that checks if it’s inside a string, the other that checks if it’s a function, etc.


Perhaps your case (renaming variables) is easier to solve with an IDE (most have functions that, with one or a few clicks, easily rename variables). Or, if you really want to do it programmatically, look for parsers specific.

0

With regex you can save the occurrence with parentheses and use the OR symbol (the character |) for more than one word.

line = re.sub('(a|b|c)', "L1_structure." + r'\1', line.rstrip())
  • The problem is that words like yellow would look like L1_structure.amL1_structure.arelo . Besides, although it is not clear in the example the variables I saved are not previously known to me so I was using a loop of repetition.

0

In general, processing C with regular expressions can be complicated... For simple situations we can try something like

text = ''' i=b+c2; a="uma e a outra"; /* ignorando o a,b, */ '''

relev=['a','b','c2']

text =re.sub(r'"[^"]*"|/\*.*?\*/|\b(\w+)\b',
             lambda n: "x"+n[1] if n[1] in relev else n[0],
             text)

In the end the value of text is

i=xb+xc2; xa="uma e a outra"; /* ignorando o a,b, */

In addition to the identifiers, for proof of concept, we are contemplating simple strings and comments a single line (the full C would need more...).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.