Regular expressions with grep

Question

Regular expressions with grep

Asked 10 years, 3 months ago

Viewed 753 times

3

I need to extract data from a text and I’m trying to do this using grep. But the way to make use of regular expressions with this command is quite different from what is usually done in Ruby or Javascript, and I’m not being able to do what I need. In the following text:

Judicial Notebook of the Regional Labor Court of the 1st Region

ELECTRONIC JOURNAL OF LABOR JUSTICE JUDICIARY

Nº1697/2015

FEDERATIVE REPUBLIC OF BRAZIL

Release date: Wednesday, 01 April 2015.

Regional Labour Court of the 1st Region

I just need to get the number that can be seen on the third line. This number will later be used to make a request to a webservice. I tried with grep as follows:

pdftotext Diario_1697_1_1_4_2015.pdf -f 1 -l 1 - | grep -o /Nº(\d+\/\d+)/

I take the first page of a pdf file, convert it to txt and step to the grep command to extract the information. But that doesn’t work at all. Someone knows the right way to do it with grep or some other bash command?

1 answer

Browser other questions tagged linux shell bash shell-script

You are not signed in. Login or sign up in order to post.

by hugomg • **8,772** points · Answer 1 · 2015-04-02T22:01:44+00:00

Firstly, grep is a shell command and its arguments are simple strings like any other. Instead of delimiting regex with / you should use single quotes (or double quotes if you are careful with expanding shell variables). Also, you need to escape your counterbars with \\.

Second, the standard regex syntax of grep is slightly different and very weak. For example, it does not understand the +, only the *. You can switch to Perl syntax with the flag -P

grep -P -o 'Nº\\d+/\\d+'

or use the POSIX syntax with grep -E or egrep.

grep -E -o 'Nº[[:digit:]]+/[[:digit:]]+'
grep -E -o 'Nº[0-9]+/[0-9]+'