Pyvcf lib problem - Extract data from vcf files

Asked

Viewed 151 times

0

I get the error while trying to read a vcf file, did not find solution or other lib to handle vcf files, any suggestions? I tried with the two verses of python

Traceback (most recent call last):
  File "./csvBasic.py", line 6, in <module>
    record = next(vcf_reader)
  File "/home/yan/.local/lib/python2.7/site-packages/vcf/parser.py", line 551, in next
    pos = int(row[1])
IndexError: list index out of range

1 answer

0


scikit-allel
"This package provides utilities for exploratory analysis of large-scale genetic variation data. It is based on scientific libraries Python numpy, scipy and others of general purpose."

$ pip install scikit-allel

Let’s consider the file sample.vcf with the following content:

##fileformat=VCFv4.3
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20      14370   rs6054257       G       A       29      PASS    DP=14;AF=0.5;DB GT:DP   0/0:1   0/1:8   1/1:5
20      17330   .       T       A       3       q10     DP=11;AF=0.017  GT:DP   0/0:3   0/1:5   0/0:41
20      1110696 rs6040355       A       G,T     67      PASS    DP=10;AF=0.333,0.667;DB GT:DP   0/2:6   1/2:0   2/2:4
20      1230237 .       T       .       47      PASS    DP=13   GT:DP   0/0:7   0/0:4   ./.:.
20      1234567 microsat1       GTC     G,GTCT  50      PASS    DP=9    GT:DP   0/1:4   0/2:2   1/1:3

Below 2 examples of file data extraction, in the first we will use the function read_vcf() to extract data for numpy arrays:

callset = allel.read_vcf('sample.vcf') print(callset.Keys())

Exit:

dict_keys(['samples', 'calldata/GT', 'variants/ALT', 'variants/CHROM', 'variants/FILTER_PASS', 'variants/ID', 'variants/POS', 'variants/QUAL', 'variants/REF'])

In the second example we will use the function vcf_to_dataframe() to read the file and assign to a pandas DataFrame:

df = allel.vcf_to_dataframe('sample.vcf')
print(df)

Exit:

    CHROM      POS         ID   REF ALT_1   ALT_2   ALT_3   QUAL  FILTER_PASS
0      20     14370 rs6054257     G     A     NaN     NaN   29.0         True
1      20     17330         .     T     A     NaN     NaN    3.0        False
2      20   1110696 rs6040355     A     G       T     NaN   67.0         True
3      20   1230237         .     T   NaN     NaN     NaN   47.0         True
4      20   1234567 microsat1   GTC     G    GTCT     NaN   50.0         True

See more in this excellent post.

  • the file I’m trying to read is my contact book that I exported from google in a file called Contacts.vcf and I get the following error: Theaders = _read_vcf_headers(stream) File "/home/Yan/. local/lib/python3.5/site-Packages/allel/io/vcf_read.py", line 1772, in _read_vcf_headers raise Runtimeerror('VCF file is Missing Mandatory header line ("#CHROM...")') Runtimeerror: VCF file is Missing Mandatory header line ("#CHROM...")

  • You’re trying with the lib I suggested?

  • You said that the data is from the calendar of google contacts, in this case, if I am not mistaken vcf is another type of file (Ios), but it would not be simpler vc export as CSV?

  • I exported in csv and managed to process the data, thank you very much

Browser other questions tagged

You are not signed in. Login or sign up in order to post.