Regex to capture dimensions of a product with unit of measure

Question

Regex to capture dimensions of a product with unit of measure

Asked 7 years, 1 month ago

Viewed 286 times

4

I have a python function to capture the dimensions of a product in Lxcxa format but I can’t make it work for cases where the unit of measure between the values appears, regex is this one:

def findDimensions(text):
    p = re.compile(r'(?P<l>\d+(\.\d+)?)\s*x\s*(?P<w>\d+(\.\d+)?)\s*x\s*(?P<h>\d+(\.\d+)?)')
    m = p.search(text)
    if (m):
        return m.group("l"), m.group("w"), m.group("h")
    return None

It works for the 2 cases below:

23,6 x 34 x 17,1

14,5 x 55 x 22

But it doesn’t work for this one for example:

14,5cmx55x22cm

I would like to make it work for situations where any amount of spaces or letters appear in each group of values separated by x. I tried using w* W* but it doesn’t solve for all cases like this:

14,5 cmx55 cmx22 cm

Example in regex101: https://regex101.com/r/bFywrT/3

I accept suggestions of cleaner expression contact that meets the examples shown

First: Do you just want to know how does regex work or would you accept another suggestion (type, without regex) for your problem? After all, you didn’t report for what you need it for. Maybe it could be solved with the answer already given, on the other hand, I think it’s unclear if you need to have the units of measure next to the numbers.

– Wallace Maxters

2018/06/29 at 20:24
Wallace, I would like to solve just by adjusting the expression to suit the case I mentioned. Even if it has to be a completely new regex. In view of the last mentioned case.

– rodrigorf

2018/06/29 at 20:43
The point is not to be a new regex, the question is: "You accept a solution without regex?"

– Wallace Maxters

2018/06/29 at 20:46
No. Thank you but I want to settle only with the adjustment in the expression.

– rodrigorf

2018/06/29 at 20:51

5 answers

2

Regex

With the following regular expression: ([\d,]+)[\s\D]* it is possible to capture each given value.

And with the regular expression ([\d,]+)[\s\D]*([\d,]+)[\s\D]*([\d,]+)[\s\D]* and the demo, it is possible to obtain the dimensions.

Explanation

The following regular expression can be repeated three times to obtain the dimensions in each capture group.

1° Capture Group ([\d,]+)
- Matches an item present in the list between []
- \d: Corresponds to a digit between 0 and 9
- ,: Corresponds literally to the comma character
- +: Quantifier that corresponds from one to unlimited times, as many times as possible (Greedy).
Followed by [\s\D]*
- Matches an item present in the list between []
- \s: Corresponds to any blank space (equal to [ r n t f v])
- \D: Matches any character that is not a digit (other than [ 0-9])
- *: Quantifier that corresponds from zero to unlimited times, as many times as possible (Greedy).

Code Dimensions

Follow an example Python implementation code:

import re

regex_pattern= re.compile(r"([\d,]+)[\s\D]*([\d,]+)[\s\D]*([\d,]+)[\s\D]*")
regex_string="""23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm"""

matches = re.finditer(regex_pattern, regex_string)

for submatch in matches:
    if submatch:
        print("L: " + submatch.group(1) + " C: " + submatch.group(2) + " A: " + submatch.group(3))

Upshot:

L: 23,6 C: 34 A: 17,1
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22
L: 14,5 C: 55 A: 22

Code Each Value

Or the example for each string value:

import re

regex_pattern= re.compile(r"([\d,]+)[\s\D]*")
regex_string="""23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm"""

matches = re.finditer(regex_pattern, regex_string)

for submatch in matches:
    if submatch:
        print(submatch.groups())

Upshot

('23,6',)
('34',)
('17,1',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)
('14,5',)
('55',)
('22',)

Putz fantastic solution Daniel, much leaner and readable. And still works for future situations where may appear other characters in the middle.

– rodrigorf

2018/06/29 at 21:01
I applied for the case where I have Lxa here and it worked perfectly, just took the last repetition of the expression you commented. It was not my initial intention but it was already 2 birds in one stone. ^^

– rodrigorf

2018/06/29 at 21:03
1

To make it easier for other readers, since the question has the Python tag it would be nice to put the full example - what functions are called, what is the output, etc...

– jsbueno

2018/06/29 at 21:03
@jsbueno I will add there to anyone who wants to use it. Thanks for the suggestion.

– rodrigorf

2018/06/29 at 21:04
@rodrigorf the reason I haven’t used [\d,]+ as in Daniel’s is that if there is any value in the wrong input, as for example 1,00 ,00 he’s getting married too, which could be trouble, so I tried to be more strict when writing the regex

– Guilherme Nascimento

2018/06/29 at 21:09

Browser other questions tagged python regex

You are not signed in. Login or sign up in order to post.

by Guilherme Nascimento • **98,651** points · Answer 1 · 2018-06-29T20:50:09+00:00

"Simply simplify" using (\s+)? into the spaces be optional, regex does not have to be very simple, but in your case you can simplify a little, like this:

(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?

Online example in Reger: https://regexr.com/3rpmr

Explaining the regex

The first part of the regex would be this:

(\d+(,\d+)?)(\s+)?(cm)?

The (,\d+)? optionally search the number post comma
The (\s+)? search one or more spaces opitionally
The (cm)? seeks the measure opitionally

Okay, after that just use one x between repeating the expression, of course you can do it in other ways, but the result would be almost the same, so it’s repetitive but more comprehensive

If the goal is to search one entry at a time then apply the \b at the beginning and end should already solve also, for example:

\b(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?\b

Multiple values

Now if the input has multiple values so do it this way:

import re

expressao = r'(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?(\s+)?x(\s+)?(\d+(,\d+)?)(\s+)?(cm)?'

entrada = '''
23,6 x 34 x 17,1
14,5 x 55 x 22
14,5cm x 55 x 22cm
14,5cmx55x22cm
14,5 cmx55 cmx22 cm
''';

resultados = re.finditer(expressao, entrada)

for resultado in resultados:
    valores = resultado.groups()
    print("Primeiro:", valores[0])
    print("Segundo:", valores[6])
    print("Terceiro:", valores[12])
    print("\n")

Note that the group at regex is 6 in 6 to catch each number between the X, that is, each group returns something like:

('23,6', ',6', ' ', None, None, ' ', '34', None, ' ', None, None, ' ', '17,1', ',1', '\n', None)
('14,5', ',5', ' ', None, None, ' ', '55', None, ' ', None, None, ' ', '22', None, '\n', None)
('14,5', ',5', None, 'cm', ' ', ' ', '55', None, ' ', None, None, ' ', '22', None, None, 'cm')
('14,5', ',5', None, 'cm', None, None, '55', None, None, None, None, None, '22', None, None, 'cm')
('14,5', ',5', ' ', 'cm', None, None, '55', None, ' ', 'cm', None, None, '22', None, ' ', 'cm')

So that’s why you’ll only use the valores[0], valores[6] and valores[12], example in repl.it: https://repl.it/@inphinit/regex-python-Extract

Using values for mathematical operations

Note that , does not make the number to be considered a "number" for Python, so if you are going to do a mathematical operation convert to float, thus:

float('1000,00001'.replace(',', ','))

It must be something like that:

for resultado in resultados:
    valores = resultado.groups()

    primeiro = float(valores[0].replace(',', '.'))
    segundo = float(valores[6].replace(',', '.'))
    terceiro = float(valores[12].replace(',', '.'))

    print("Primeiro:", primeiro)
    print("Segundo:", segundo)
    print("Terceiro:", terceiro)
    print("Resultado:", primeiro * segundo * terceiro)
    print("\n")

by Sam • **79,597** points · Answer 2 · 2018-06-29T20:21:20+00:00

You can do this without using regex. Just "clean" the string by removing spaces and "cm", then break into array by "x":

str = "4,5cmx55x22cm";
str = str.replace('cm', '').replace(' ', '')
str = str.split('x')
print str # ['4,5', '55', '22']

Check it out at Ideone

By converting the string into array you have the values separated by indexes, and you can use them as you like. If you want the result to be in the format Lcm x Acm x Ccm, you can convert the array to string by adding the cm x:

str = "4,5cm x55x 22cm ";
str = str.replace('cm', '').replace(' ', '').split('x')
str = 'cm x '.join(str)+"cm"
print str # retorna 4,5cm x 55cm x 22cm

Regex

(?P<l>[\d|,]+)(.*?)x(.*?)(?P<w>[\d|,]+)(.*?)x(.*?)(?P<h>[\d|,]+)(.*?)

The (.*?) checks whether or not there is any character between the number and the x. The [\d|,]+ captures numbers or comma. Naming groups allows you to pick the value by name.

Code:

import re
str = "4,5cm x55x 22cm ";
regex = "(?P<l>[\d|,]+)(.*?)x(.*?)(?P<w>[\d|,]+)(.*?)x(.*?)(?P<h>[\d|,]+)(.*?)"
resultado = re.match(regex, str)
print resultado.groupdict()['l'] # retorna 4,5
print resultado.groupdict()['w'] # retorna 55
print resultado.groupdict()['h'] # retorna 22

Check it out at Ideone

by jsbueno • **30,668** points · Answer 3 · 2018-06-29T21:01:29+00:00

Regex can actually extract the three values "into a single line of code," but realize that this is an illusion - you are at a point where (?P<l>\d+(\,\d+)?)\s*x\s*(?P<w>\d+(\,\d+)?)\s*x\s*(?P<h>\d+(\,\d+)?) is too simple and has to be complicated - and even someone who practices regexes every day, has to read this much calmer than someone reading 4 or 5 lines of Python code, which separates the values in one step in each line.

But as you ask explicitly for regex, let’s see:

The simplest, instead of repeating the logic of regex 3 times, is to use the "findall" method of the Python regexes - they can already extract all the numbers - so we can use:

In [19]: a = ["23,6 x 34 x 17,1", "14,5 x 55x 22", "14,5cmx55x22cm", "23  cmx 12.1cmx 14,36"]
In [20]: [re.findall(r"([\d,.]+)\s*?(?:cm)?", text) for text in a]
Out[20]: 
[['23,6', '34', '17,1'],
 ['14,5', '55', '22'],
 ['14,5', '55', '22'],
 ['23', '12.1', '14,36']]

What allows the "cm" to be optional is the part (?:cm) - although in this expression it does not even need it, it will simply extract all numbers that have either "," or "no markers." as decimals.

It’s a much simpler expression than its original - and with the findall recovers 3 numbers, if any - an "if" in Python can ignore the data, or generate an exception if you don’t have the 3 numbers.

It has to be borne in mind that regular expressions are literally a language apart from the language of the program - in this case, the expression has become quite simple and reasonable to maintain, although it ignores many corner-cases - in Python, you could get the same result with:

In [21]: a = ["23,6 x 34 x 17,1", "14,5 x 55x 22", "14,5cmx55x22cm", "23  cmx 12.1cmx 14,36"]


In [22]: [[dimensao.replace("cm", "").strip()  for dimensao in dado.split("x")]   for dado in a]
Out[22]: 
[['23,6', '34', '17,1'],
 ['14,5', '55', '22'],
 ['14,5', '55', '22'],
 ['23', '12.1', '14,36']]

(As in the example with regexp, the comprehension external only goes through all the examples of dimensions in "a") - That is, in this case, you extract the numbers using a comprehensilist on nor need more than one line of code.

by Lacobus • **13,510** points · Answer 4 · 2018-06-29T21:36:33+00:00

You can use the following expression:

[^0-9,.]

That is able to replace anything other than numeric digits, semicolons:

In a function:

import re

def findDimensions(text):
    s = re.sub('[^0-9,.]', ' ', text ).replace(',', '.').split()
    return tuple([ float(n) for n in s])

Testing:

import re

def findDimensions(text):
    s = re.sub('[^0-9,.]', ' ', text ).replace(',', '.').split()
    return tuple([ float(n) for n in s])

print(findDimensions("14,5 x 55 x 22,0"))
print(findDimensions("14,5 x 55cm x 22"))
print(findDimensions("14,5cm x 55 x 22cm"))
print(findDimensions("14,5cmx55x22cm"))
print(findDimensions("14,5 cmx55 cmx22 cm"))
print(findDimensions("14,5 cm x 55.0 x 22.0 cm"))

Exit:

(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)
(14.5, 55.0, 22.0)

See regular expression working on regex101.com.

See the test code running on Ideone.com

Regex to capture dimensions of a product with unit of measure

5 answers

Explaining the regex

Multiple values

Using values for mathematical operations

Regex

Regex

Explanation

1° Capture Group ([\d,]+)

Code Dimensions

Upshot:

Code Each Value

Upshot

1° Capture Group `([\d,]+)`