scikit-allel
"This package provides utilities for exploratory analysis of large-scale genetic variation data. It is based on scientific libraries Python numpy, scipy and others of general purpose."
$ pip install scikit-allel
Let’s consider the file sample.vcf
with the following content:
##fileformat=VCFv4.3
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS DP=14;AF=0.5;DB GT:DP 0/0:1 0/1:8 1/1:5
20 17330 . T A 3 q10 DP=11;AF=0.017 GT:DP 0/0:3 0/1:5 0/0:41
20 1110696 rs6040355 A G,T 67 PASS DP=10;AF=0.333,0.667;DB GT:DP 0/2:6 1/2:0 2/2:4
20 1230237 . T . 47 PASS DP=13 GT:DP 0/0:7 0/0:4 ./.:.
20 1234567 microsat1 GTC G,GTCT 50 PASS DP=9 GT:DP 0/1:4 0/2:2 1/1:3
Below 2 examples of file data extraction, in the first we will use the function read_vcf() to extract data for numpy arrays:
callset = allel.read_vcf('sample.vcf')
print(callset.Keys())
Exit:
dict_keys(['samples', 'calldata/GT', 'variants/ALT', 'variants/CHROM', 'variants/FILTER_PASS', 'variants/ID', 'variants/POS', 'variants/QUAL', 'variants/REF'])
In the second example we will use the function vcf_to_dataframe()
to read the file and assign to a pandas DataFrame
:
df = allel.vcf_to_dataframe('sample.vcf')
print(df)
Exit:
CHROM POS ID REF ALT_1 ALT_2 ALT_3 QUAL FILTER_PASS
0 20 14370 rs6054257 G A NaN NaN 29.0 True
1 20 17330 . T A NaN NaN 3.0 False
2 20 1110696 rs6040355 A G T NaN 67.0 True
3 20 1230237 . T NaN NaN NaN 47.0 True
4 20 1234567 microsat1 GTC G GTCT NaN 50.0 True
See more in this excellent post.
the file I’m trying to read is my contact book that I exported from google in a file called Contacts.vcf and I get the following error: Theaders = _read_vcf_headers(stream) File "/home/Yan/. local/lib/python3.5/site-Packages/allel/io/vcf_read.py", line 1772, in _read_vcf_headers raise Runtimeerror('VCF file is Missing Mandatory header line ("#CHROM...")') Runtimeerror: VCF file is Missing Mandatory header line ("#CHROM...")
– Yan Luiz
You’re trying with the lib I suggested?
– Sidon
You said that the data is from the calendar of google contacts, in this case, if I am not mistaken vcf is another type of file (Ios), but it would not be simpler vc export as CSV?
– Sidon
I exported in csv and managed to process the data, thank you very much
– Yan Luiz