Format columns - select specific information

Asked

Viewed 116 times

2

Dear users, I have a large file with the following columns

chr10_46938     EXON=28/28      STRAND=-1       ENSP=ENSGALP00000004070 SIFT=tolerated(0.38) 
chr10_46966     EXON=28/28      STRAND=-1       DOMAINS=Low_complexity_(Seg):Seg        SIFT=tolerated(0.66)    ENSP=ENSGALP00000004070   
chr10_46987     EXON=28/28      STRAND=-1       SIFT=tolerated(0.93)    ENSP=ENSGALP00000004070
chr10_47071     ENSP=ENSGALP00000004070 SIFT=tolerated(0.97)    EXON=28/28      STRAND=-1
chr10_47164     EXON=28/28      STRAND=-1       DOMAINS=Low_complexity_(Seg):Seg        SIFT=tolerated(0.37)    ENSP=ENSGALP00000004070
chr10_47466     ENSP=ENSGALP00000004070 SIFT=tolerated(0.11)    STRAND=-1       EXON=28/28    DOMAINS=PROSITE_profiles:PS50196,Pfam_domain:SSF50729

I want to select only the first column and the information SIFT=tolerated(..), but this is not found in fixed columns, example column 2. How to select only this information I wish to have for example the following output:

chr10_46938     SIFT=tolerated(0.38)  
chr10_46966     SIFT=tolerated(0.66)   
chr10_46987     SIFT=tolerated(0.93)  
chr10_47071     SIFT=tolerated(0.97)  
chr10_47094     SIFT=tolerated(1)            
chr10_47164     SIFT=tolerated(0.37)    
chr10_47466     SIFT=tolerated(0.11)

What command to use on UNIX to get this list?

  • You can use the awk or cut.

  • I tried several mutlei commands and it didn’t work...if you can be more specific....

2 answers

1

You can extract this information in several ways, for example with cut, the awk, and also with the glorious Perl.

Follow an example using awk:

$ awk 'match($0, /SIFT=tolerated\([0-9.]+\)/) { print $1, "\t", 
substr($0, RSTART, RLENGTH) } ' arquivo

Where:

  • match: It is the function that will look for the pattern SIFT=tolerated\([0-9.]+\), That means she’ll match the sequence SIFT=tolerated containing numbers or a point . between parentheses. It returns the position of the character, or index, from where the substring corresponding.
  • substr: Returns a substring, the RSTART means the index of substring correspondent and RLENGTH the size.

Upshot:

$ awk 'match($0, /SIFT=tolerated\([0-9.]+\)/){ print $1, "\t", substr($0, RSTART, RLENGTH)}' foo.txt
chr10_46938     SIFT=tolerated(0.38)
chr10_46966     SIFT=tolerated(0.66)
chr10_46987     SIFT=tolerated(0.93)
chr10_47071     SIFT=tolerated(0.97)
chr10_47164     SIFT=tolerated(0.37)
chr10_47466     SIFT=tolerated(0.11)
$ 

In other systems it may be that the syntax is different, but nothing that cannot be adapted.

  • Thank you very much Qmechanic73, it worked perfectly!!! :)

  • @Clarissa Great. If possible mark the answer as accepted, click on the check mark below the arrows; the color will change from gray to green. :)

1

@Qmechanic73: ... glorious Perl

perl -nE 'say m/(\S+ ).*? (SIFT=\S+)/' foo.txt

And by the way sed for a change

sed -r 's!(\S+).*(SIFT=\S+).*!\1 \2!' foo.txt

Browser other questions tagged

You are not signed in. Login or sign up in order to post.