Use of Uniq

Asked

Viewed 351 times

1

I have been researching without much success, something that in theory I believe is quite simple, however I did not find the correct command.

I have a LOG file with a lot of information, but certain information repeats, but only in a certain column, and everything that repeats in this column, I wanted it to be deleted, remaining only one. Example:

  6; Mar 21 03:18; 182.69.170.145;  unknown;  <[email protected]>;  Get much more positive aspects out of your work out; HIGH
  3; Mar 21 03:20; 182.69.170.145;  unknown;  <[email protected]>;  Eating healthful is not assisting you lose weight; HIGH
  2; Mar 21 03:18; 182.69.170.145;  unknown;  <[email protected]>;  Asian infused diet program pill makes it was West; MEDIUM
  2; Mar 21 13:50; 201.53.117.127;  unknown;  <[email protected]>;  want to see me?; MEDIUM
  3; Mar 21 12:28; 179.208.77.183;  unknown;  <[email protected]>;  how do you like it here?; HIGH
  3; Mar 21 13:49; 201.53.117.127;  unknown;  <[email protected]>;  Good Evening How are things? I m Yana; HIGH

They realize that the e-mail field repeats, but after it does not, so using SORT with UNIQ would not solve my problem, as soon as they order and eliminate the line exactly equal.

Is there a command, or even these (SORT and UNIQ) with some specific parameter that does this?

Grateful.

  • 1

    I do not understand very well what you want, has how to put an example of the data as you want it to stay?

  • Using the above data, I need only the column containing the email <Alfredo.xxx> to be used by Uniq, eliminating the repeated ones. Note that the row is all different except the email column, so giving a Uniq in this column will leave only one row in the filter. If I give a Uniq in the file that contains the above data, it will delete the line that is 100% equal, but it will never be totally equal.

  • You could specify the desired output for this example?

  • In the above output we have 6 lines of LOG, what I wanted is the output to have only one line of LOG, because I don’t need 6 lines of the email <Alfredo.xxx@>, if inside a LOG file has a thousand lines, and among these thousand lines, have 100 lines under the e-mail <Alfredo.xxx@>, as UNIQ removes repeaters, wanted you to remove all repeats from the email column, leaving only one. If I set the UNIQ without directing the column, it will delete the line exactly the same, and see that in the 6 lines above, only the email account is equal, the rest is not. I wanted to use the email field as delimiter.

  • Do you want to group only Alfredo.xxx emails? or do you want to group all repeat emails?

  • For the above example, I would like to have only one complete email line, with (date/time IP and etc.) and the rest to be deleted. Because, in LOG I will have 100 thousand lines with several repeated emails, but the rest does not repeat, as IP, Date/ Time and Subject. I wanted that mass of data eliminate only the lines that repeat the emails, remaining only one.

  • 1

    @user54154 tries this command then: sort -u -t ';' -k5,5 nome-do-arquivo, here worked well. NOTE: Replace filename with LOG file.

  • It worked :), can you explain the parameter? So I know what exactly it does for me to learn? Grateful.

  • I will put an answer explaining each thing.

Show 4 more comments

2 answers

1


You can use this command:

sort -u -t ';' -k5,5 nome-do-arquivo
  • -u (Unique) which causes the equal values to be grouped.

  • -t ';' Sets the column separator (which in the case of your file is ;).

  • -k5,5 Sets the number of the column you want to work on (in your case 5, which is email, and only email).

You can read more about the command sort here.

  • And which parameter eliminates the repeated ones, because the "u" only groups. Thank you very much.

  • The own -u that does the removal of the repeated. I just tested here. Thank you so much for your help!

  • Exactly @user54154 when it groups it eliminates everything.

1

By the way, if it’s important to keep the order of the original file, can:

awk -F';' '++n[$5] == 1' nome
  • -F ';' -- sets the Fieldseparator (field separator)
  • n[$5] -- counts the number of occurrences of each field value 5 (email) The vector n has string type indexes (associative array)
  • ++n[$5] -- increments the value corresponding to the specific email
  • ++n[$5] == 1 -- first occurrence of this email (default action: print)
  • 1

    Very good. Can you explain the parameters too? I believe that the -F is the delimiter, the ==1 must be the return of only one row, the '++[$5], must be the 5th column, but I didn’t understand the ++n. Grateful.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.