Regex to capture dimensions of a product with unit of measure

Asked

Viewed 286 times

4

I have a python function to capture the dimensions of a product in Lxcxa format but I can’t make it work for cases where the unit of measure between the values appears, regex is this one:

def findDimensions(text):
    p = re.compile(r'(?P<l>\d+(\.\d+)?)\s*x\s*(?P<w>\d+(\.\d+)?)\s*x\s*(?P<h>\d+(\.\d+)?)')
    m = p.search(text)
    if (m):
        return m.group("l"), m.group("w"), m.group("h")
    return None

It works for the 2 cases below:

23,6 x 34 x 17,1

14,5 x 55 x 22

But it doesn’t work for this one for example:

14,5cmx55x22cm

I would like to make it work for situations where any amount of spaces or letters appear in each group of values separated by x. I tried using w* W* but it doesn’t solve for all cases like this:

14,5 cmx55 cmx22 cm

Example in regex101: https://regex101.com/r/bFywrT/3

inserir a descrição da imagem aqui

I accept suggestions of cleaner expression contact that meets the examples shown

  • First: Do you just want to know how does regex work or would you accept another suggestion (type, without regex) for your problem? After all, you didn’t report for what you need it for. Maybe it could be solved with the answer already given, on the other hand, I think it’s unclear if you need to have the units of measure next to the numbers.

  • Wallace, I would like to solve just by adjusting the expression to suit the case I mentioned. Even if it has to be a completely new regex. In view of the last mentioned case.

  • The point is not to be a new regex, the question is: "You accept a solution without regex?"

  • No. Thank you but I want to settle only with the adjustment in the expression.

5 answers

4

"Simply simplify" using (\s+)? into the spaces be optional, regex does not have to be very simple, but in your case you can simplify a little, like this:

(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?

Online example in Reger: https://regexr.com/3rpmr


Explaining the regex

The first part of the regex would be this:

(\d+(,\d+)?)(\s+)?(cm)?
  • The (,\d+)? optionally search the number post comma

  • The (\s+)? search one or more spaces opitionally

  • The (cm)? seeks the measure opitionally

Okay, after that just use one x between repeating the expression, of course you can do it in other ways, but the result would be almost the same, so it’s repetitive but more comprehensive

If the goal is to search one entry at a time then apply the \b at the beginning and end should already solve also, for example:

\b(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?\b

Multiple values

Now if the input has multiple values so do it this way:

import re

expressao = r'(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?'

entrada = '''
23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm
''';

resultados = re.finditer(expressao, entrada)

for resultado in resultados:
    valores = resultado.groups()
    print("Primeiro:", valores[0])
    print("Segundo:", valores[6])
    print("Terceiro:", valores[12])
    print("\n")

Note that the group at regex is 6 in 6 to catch each number between the X, that is, each group returns something like:

('23,6', ',6', ' ', None, None, ' ', '34', None, ' ', None, None, ' ', '17,1', ',1', '\n', None)
('14,5', ',5', ' ', None, None, ' ', '55', None, ' ', None, None, ' ', '22', None, '\n', None)
('14,5', ',5', None, 'cm', ' ', ' ', '55', None, ' ', None, None, ' ', '22', None, None, 'cm')
('14,5', ',5', None, 'cm', None, None, '55', None, None, None, None, None, '22', None, None, 'cm')
('14,5', ',5', ' ', 'cm', None, None, '55', None, ' ', 'cm', None, None, '22', None, ' ', 'cm')

So that’s why you’ll only use the valores[0], valores[6] and valores[12], example in repl.it: https://repl.it/@inphinit/regex-python-Extract


Using values for mathematical operations

Note that , does not make the number to be considered a "number" for Python, so if you are going to do a mathematical operation convert to float, thus:

float('1000,00001'.replace(',', ','))

It must be something like that:

for resultado in resultados:
    valores = resultado.groups()

    primeiro = float(valores[0].replace(',', '.'))
    segundo = float(valores[6].replace(',', '.'))
    terceiro = float(valores[12].replace(',', '.'))

    print("Primeiro:", primeiro)
    print("Segundo:", segundo)
    print("Terceiro:", terceiro)
    print("Resultado:", primeiro * segundo * terceiro)
    print("\n")
  • 1

    The numbers I am going to extract are just these (values[0], values[6] and values[12]). That is what I need. In case I have resolve to 'cm' but if the unit is different would have to replace by any comminution of characters?

  • @rodrigorf added example of how to use mathematical operations and added the examples online, both from regex in https://regexr.com/3rpmr and from python script in https://repl.it/@inphinit/regex-python-Extract

3

You can do this without using regex. Just "clean" the string by removing spaces and "cm", then break into array by "x":

str = "4,5cmx55x22cm";
str = str.replace('cm', '').replace(' ', '')
str = str.split('x')
print str # ['4,5', '55', '22']

Check it out at Ideone

By converting the string into array you have the values separated by indexes, and you can use them as you like. If you want the result to be in the format Lcm x Acm x Ccm, you can convert the array to string by adding the cm x:

str = "4,5cm x55x 22cm ";
str = str.replace('cm', '').replace(' ', '').split('x')
str = 'cm x '.join(str)+"cm"
print str # retorna 4,5cm x 55cm x 22cm

Regex

(?P<l>[\d|,]+)(.*?)x(.*?)(?P<w>[\d|,]+)(.*?)x(.*?)(?P<h>[\d|,]+)(.*?)

The (.*?) checks whether or not there is any character between the number and the x. The [\d|,]+ captures numbers or comma. Naming groups allows you to pick the value by name.

Code:

import re
str = "4,5cm x55x 22cm ";
regex = "(?P<l>[\d|,]+)(.*?)x(.*?)(?P<w>[\d|,]+)(.*?)x(.*?)(?P<h>[\d|,]+)(.*?)"
resultado = re.match(regex, str)
print resultado.groupdict()['l'] # retorna 4,5
print resultado.groupdict()['w'] # retorna 55
print resultado.groupdict()['h'] # retorna 22

Check it out at Ideone

  • 3

    despite being explicit in the question, I also prefer not to use regex where it does not need regex. :-)

  • 1

    in this case, the performance is negligible, and if you compare with the Replaces in earnest, etc.;. the regex should be even more performative. The biggest problem there is legibility and even maintenance

  • 1

    @jsbueno I get it. It’s actually much more readable. But the answer is just a suggestion. :)

  • Thank you for your comments. I’m using regex because I already have several treatments with regular expression to capture text information, I believe that in terms of performance should not have much difference enre one and the other but as I put in the description I want to adjust to meet the last case(with the cm) via regex.

  • @rodrigorf Blz! I’m trying to set up a regex here.

  • Beauty @dvd, I thank you in advance for the effort.

  • @rodrigorf but you want to take only the numbers?

  • @dvd this. Only the numbers

  • @rodrigorf I put a regex a little simpler.

Show 4 more comments

3

Regex can actually extract the three values "into a single line of code," but realize that this is an illusion - you are at a point where (?P<l>\d+(\,\d+)?)\s*x\s*(?P<w>\d+(\,\d+)?)\s*x\s*(?P<h>\d+(\,\d+)?) is too simple and has to be complicated - and even someone who practices regexes every day, has to read this much calmer than someone reading 4 or 5 lines of Python code, which separates the values in one step in each line.

But as you ask explicitly for regex, let’s see:

The simplest, instead of repeating the logic of regex 3 times, is to use the "findall" method of the Python regexes - they can already extract all the numbers - so we can use:

In [19]: a = ["23,6 x 34 x 17,1", "14,5 x 55x 22", "14,5cmx55x22cm", "23  cmx 12.1cmx 14,36"]
In [20]: [re.findall(r"([\d,.]+)\s*?(?:cm)?", text) for text in a]
Out[20]: 
[['23,6', '34', '17,1'],
 ['14,5', '55', '22'],
 ['14,5', '55', '22'],
 ['23', '12.1', '14,36']]

What allows the "cm" to be optional is the part (?:cm) - although in this expression it does not even need it, it will simply extract all numbers that have either "," or "no markers." as decimals.

It’s a much simpler expression than its original - and with the findall recovers 3 numbers, if any - an "if" in Python can ignore the data, or generate an exception if you don’t have the 3 numbers.

It has to be borne in mind that regular expressions are literally a language apart from the language of the program - in this case, the expression has become quite simple and reasonable to maintain, although it ignores many corner-cases - in Python, you could get the same result with:

In [21]: a = ["23,6 x 34 x 17,1", "14,5 x 55x 22", "14,5cmx55x22cm", "23  cmx 12.1cmx 14,36"]


In [22]: [[dimensao.replace("cm", "").strip()  for dimensao in dado.split("x")]   for dado in a]
Out[22]: 
[['23,6', '34', '17,1'],
 ['14,5', '55', '22'],
 ['14,5', '55', '22'],
 ['23', '12.1', '14,36']]

(As in the example with regexp, the comprehension external only goes through all the examples of dimensions in "a") - That is, in this case, you extract the numbers using a comprehensilist on nor need more than one line of code.

3

You can use the following expression:

[^0-9,.]

That is able to replace anything other than numeric digits, semicolons:

inserir a descrição da imagem aqui

In a function:

import re

def findDimensions(text):
    s = re.sub('[^0-9,.]', ' ', text ).replace(',', '.').split()
    return tuple([ float(n) for n in s])

Testing:

import re

def findDimensions(text):
    s = re.sub('[^0-9,.]', ' ', text ).replace(',', '.').split()
    return tuple([ float(n) for n in s])

print(findDimensions("14,5 x 55 x 22,0"))
print(findDimensions("14,5 x 55cm x 22"))
print(findDimensions("14,5cm x 55 x 22cm"))
print(findDimensions("14,5cmx55x22cm"))
print(findDimensions("14,5 cmx55 cmx22 cm"))
print(findDimensions("14,5 cm x 55.0 x 22.0 cm"))

Exit:

(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)

See regular expression working on regex101.com.

See the test code running on Ideone.com

2


Regex

With the following regular expression: ([\d,]+)[\s\D]* it is possible to capture each given value.

And with the regular expression ([\d,]+)[\s\D]*([\d,]+)[\s\D]*([\d,]+)[\s\D]* and the demo, it is possible to obtain the dimensions.

Explanation

The following regular expression can be repeated three times to obtain the dimensions in each capture group.

  • 1° Capture Group ([\d,]+)

    • Matches an item present in the list between []
    • \d: Corresponds to a digit between 0 and 9
    • ,: Corresponds literally to the comma character
    • +: Quantifier that corresponds from one to unlimited times, as many times as possible (Greedy).
  • Followed by [\s\D]*
    • Matches an item present in the list between []
    • \s: Corresponds to any blank space (equal to [ r n t f v])
    • \D: Matches any character that is not a digit (other than [ 0-9])
    • *: Quantifier that corresponds from zero to unlimited times, as many times as possible (Greedy).

Code Dimensions

Follow an example Python implementation code:

import re

regex_pattern= re.compile(r"([\d,]+)[\s\D]*([\d,]+)[\s\D]*([\d,]+)[\s\D]*")
regex_string="""23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm"""

matches = re.finditer(regex_pattern, regex_string)

for submatch in matches:
    if submatch:
        print("L: " + submatch.group(1) + " C: " + submatch.group(2) + " A: " + submatch.group(3))

Upshot:

L: 23,6 C: 34 A: 17,1
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22

Code Each Value

Or the example for each string value:

import re

regex_pattern= re.compile(r"([\d,]+)[\s\D]*")
regex_string="""23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm"""

matches = re.finditer(regex_pattern, regex_string)

for submatch in matches:
    if submatch:
        print(submatch.groups())

Upshot

('23,6',)
('34',)
('17,1',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
  • Putz fantastic solution Daniel, much leaner and readable. And still works for future situations where may appear other characters in the middle.

  • I applied for the case where I have Lxa here and it worked perfectly, just took the last repetition of the expression you commented. It was not my initial intention but it was already 2 birds in one stone. ^^

  • 1

    To make it easier for other readers, since the question has the Python tag it would be nice to put the full example - what functions are called, what is the output, etc...

  • @jsbueno I will add there to anyone who wants to use it. Thanks for the suggestion.

  • @rodrigorf the reason I haven’t used [\d,]+ as in Daniel’s is that if there is any value in the wrong input, as for example 1,00 ,00 he’s getting married too, which could be trouble, so I tried to be more strict when writing the regex

Browser other questions tagged

You are not signed in. Login or sign up in order to post.