File organization
First, let’s define the organization of our files and the content in each one.
fasta/
all.fasta
id1.txt
id2.txt
fasta.py
all fasta. is the main file, which contains the complete information of all genes.
>gene1 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene2 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gene3 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene4
TTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
id1.txt contains a list of gene identifiers.
gene1
gene2
id2.txt also has a list of genes.
gene4
fasta py. is the Python file that we will create to run the process.
Getting the data
The first thing to do in the Python file is to read the contents of fasta/all.fasta
and interpret, obtaining the data for each gene separately.
# -*- coding: utf-8 -*-
with open("fasta/all.fasta", 'r') as f:
genes = {}
fasta_content = f.read()
genes_list = [_ for _ in fasta_content.split('>') if _]
for gene in genes_list:
gene_id = (gene.split('\n')[0]).split(' ')[0]
genes[gene_id] = gene
We start opening the file all.fasta
as reading, with the function open
. We read all its contents, with the function read
and store it in the variable fasta_content
. After, we divide the content based on the character >
to obtain the data for each gene separately. Later, we iterate on this list with the instruction for
, where we first take the gene identifier, separating the first word from the first line (gene1, gene2, gene3, etc.) and with this we populate a variable genes
.
Until this part of the code, we will have the variable genes
, of the dictionary type, in the following format::
{
"gene1": "gene1 C.irapeanum 5.8S rRNA gene\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n",
"gene2": "gene2 C.irapeanum 5.8S rRNA gene\nAATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\nAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n",
"gene3": "gene3 C.irapeanum 5.8S rRNA gene\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n",
"gene4": "gene4\nTTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
}
Getting the gene lists
Now we will iterate over the named text files with the pattern idX.txt
inside the briefcase fasta
in order to obtain the list of genes for each one and subsequently create the respective archive idX.fasta
. For this, we use the function glob
. First, we import the library at the beginning of the archive.
import glob
With this, we get the list of files:
for id_file in glob.glob("fasta/id*.txt"):
print(id_file)
The exit of this stretch must be something like:
fasta/id1.txt
fasta/id2.txt
Generating the fasta files
Instead of printing the file name, let’s open it for reading.
for id_file in glob.glob("fasta/id*.txt"):
with open(id_file, 'r') as idf:
id_list = idf.readlines()
print(id_list)
Right now, the exit should be something like:
["gene1", "gene2"]
["gene4"]
The lists of genes are defined in each file. From it, we can create the file idX.fasta
and write the contents of each gene on the list, as a subsequent.
for id_file in glob.glob("fasta/id*.txt"):
with open(id_file, 'r') as idf:
id_list = idf.readlines()
idX_name = "{}.fasta".format(id_file.split('.')[0])
with open(idX_name, "w+") as idX_file:
for id in id_list:
idX_file.write(genes[id])
That is, for each file of the type fasta/id*.txt
, open as read, read all lines and store in id_list
, resume the file name with the extension fasta (here the variable idX_name
will have a value fasta/id*.fasta
). After, open the fasta file in written mode and for each id in id_list
, write the respective content of the gene identified by id
.
This way, you should have already generated all the respective files with the correct content.
Full fasta.py file
# -*- coding: utf-8 -*-
import glob
with open("fasta/all.fasta", 'r') as f:
genes = {}
fasta_content = f.read()
genes_list = [_ for _ in fasta_content.split('>') if _]
for gene in genes_list:
gene_id = (gene.split('\n')[0]).split(' ')[0]
genes[gene_id] = gene
for id_file in glob.glob("fasta/id*.txt"):
with open(id_file, 'r') as idf:
id_list = idf.readlines()
idX_name = "{}.fasta".format(id_file.split('.')[0])
with open(idX_name, "w+") as idX_file:
for id in id_list:
idX_file.write('>' + genes[id[:-1]])
Upshot
Running the code, two files are created: fasta/id1.fasta
and fasta/id2.fasta
.
fasta/id1.fasta
>gene1 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
>gene2 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
fasta/id2.fasta
>gene4
TTTTTTTTCCCTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
That I believe was the desired.
Thank you very much Andreson for your explanations and time. I tried to run but gave error: "File". /input_fasta.py", line 23, in <module> idX_file.write(genes[id])". I don’t know what might be.
– Alex
There were two errors, because I wrote all the code right here and ended up not even testing in practice. The two are in the last line: as the fasta format starts with the character
>
, need to write it in the file as well. The other error was in the indexid
. When you read the indexes withreadlines
, the character\n
comes together, so we need to remove it, through[:-1]
. I edited the answer with the solutions.– Woss
Sensational. Thank you so much for your help! It worked perfectly. And with your explanations you can understand each part and try to learn. I’m starting to learn python.
– Alex
If I may ask, what is the purpose of the code? I was unaware of this pattern until today.
– Woss