How to get different outputs (fasta) files with Unix or python

Asked

Viewed 283 times

4

I have a fasta file that has several sequences of genes, like:

>gene1 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene2 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gene3 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene4
TTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I also have several files called ids.txt with genes separated ids on each line.

For example I have the file named id1.txt:

gene1
gene2

And many others such as id2.txt:

gene4

I want to run a script that generates separate fasta files depending on my file id and how to input my large fasta file with sequences (just one file). For example for id2.txt, generate an id2.fasta file:

>gene4
TTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

And so on...I do not know how to do this, I tried to do with python, but I am very beginner and I could only do for one file at a time, but I have 500 different files for the ids...

Any suggestions? Thank you

1 answer

4


File organization

First, let’s define the organization of our files and the content in each one.

 fasta/
   all.fasta
   id1.txt
   id2.txt
 fasta.py

all fasta. is the main file, which contains the complete information of all genes.

>gene1 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene2 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gene3 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene4
TTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

id1.txt contains a list of gene identifiers.

gene1
gene2

id2.txt also has a list of genes.

gene4

fasta py. is the Python file that we will create to run the process.

Getting the data

The first thing to do in the Python file is to read the contents of fasta/all.fasta and interpret, obtaining the data for each gene separately.

# -*- coding: utf-8 -*-

with open("fasta/all.fasta", 'r') as f:

    genes = {}

    fasta_content = f.read()
    genes_list = [_ for _ in fasta_content.split('>') if _]

    for gene in genes_list:
        gene_id = (gene.split('\n')[0]).split(' ')[0]
        genes[gene_id] = gene

We start opening the file all.fasta as reading, with the function open. We read all its contents, with the function read and store it in the variable fasta_content. After, we divide the content based on the character > to obtain the data for each gene separately. Later, we iterate on this list with the instruction for, where we first take the gene identifier, separating the first word from the first line (gene1, gene2, gene3, etc.) and with this we populate a variable genes.

Until this part of the code, we will have the variable genes, of the dictionary type, in the following format::

{
    "gene1": "gene1 C.irapeanum 5.8S rRNA gene\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n",
    "gene2": "gene2 C.irapeanum 5.8S rRNA gene\nAATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\nAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n",
    "gene3": "gene3 C.irapeanum 5.8S rRNA gene\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n",
    "gene4": "gene4\nTTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
}

Getting the gene lists

Now we will iterate over the named text files with the pattern idX.txt inside the briefcase fasta in order to obtain the list of genes for each one and subsequently create the respective archive idX.fasta. For this, we use the function glob. First, we import the library at the beginning of the archive.

import glob

With this, we get the list of files:

for id_file in glob.glob("fasta/id*.txt"):
    print(id_file)

The exit of this stretch must be something like:

fasta/id1.txt
fasta/id2.txt

Generating the fasta files

Instead of printing the file name, let’s open it for reading.

for id_file in glob.glob("fasta/id*.txt"):
    with open(id_file, 'r') as idf:
        id_list = idf.readlines()
        print(id_list)

Right now, the exit should be something like:

["gene1", "gene2"]
["gene4"]

The lists of genes are defined in each file. From it, we can create the file idX.fasta and write the contents of each gene on the list, as a subsequent.

for id_file in glob.glob("fasta/id*.txt"):
    with open(id_file, 'r') as idf:
        id_list = idf.readlines()
        idX_name = "{}.fasta".format(id_file.split('.')[0])

        with open(idX_name, "w+") as idX_file:
            for id in id_list:
                idX_file.write(genes[id])

That is, for each file of the type fasta/id*.txt, open as read, read all lines and store in id_list, resume the file name with the extension fasta (here the variable idX_name will have a value fasta/id*.fasta). After, open the fasta file in written mode and for each id in id_list, write the respective content of the gene identified by id.

This way, you should have already generated all the respective files with the correct content.

Full fasta.py file

# -*- coding: utf-8 -*-

import glob

with open("fasta/all.fasta", 'r') as f:

    genes = {}

    fasta_content = f.read()
    genes_list = [_ for _ in fasta_content.split('>') if _]

    for gene in genes_list:
        gene_id = (gene.split('\n')[0]).split(' ')[0]
        genes[gene_id] = gene

    for id_file in glob.glob("fasta/id*.txt"):
        with open(id_file, 'r') as idf:
            id_list = idf.readlines()
            idX_name = "{}.fasta".format(id_file.split('.')[0])

            with open(idX_name, "w+") as idX_file:
                for id in id_list:
                    idX_file.write('>' + genes[id[:-1]])

Upshot

Running the code, two files are created: fasta/id1.fasta and fasta/id2.fasta.

fasta/id1.fasta

>gene1 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene2 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

fasta/id2.fasta

>gene4
TTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

That I believe was the desired.

  • Thank you very much Andreson for your explanations and time. I tried to run but gave error: "File". /input_fasta.py", line 23, in <module> idX_file.write(genes[id])". I don’t know what might be.

  • There were two errors, because I wrote all the code right here and ended up not even testing in practice. The two are in the last line: as the fasta format starts with the character >, need to write it in the file as well. The other error was in the index id. When you read the indexes with readlines, the character \n comes together, so we need to remove it, through [:-1]. I edited the answer with the solutions.

  • Sensational. Thank you so much for your help! It worked perfectly. And with your explanations you can understand each part and try to learn. I’m starting to learn python.

  • If I may ask, what is the purpose of the code? I was unaware of this pattern until today.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.