g2gtools is a new suite of tools that creates personal diploid genomes. It allows us to
easily extract regions from personal genomes so we can create individualized alignment indexes for
next-generation sequencing reads. g2gtools can liftover alignments on personal genome
coordinates back to that of reference so we can compare alignments from among samples in a population.
Unlike other liftover tools, g2gtools does not throw away alignments that land on indel
regions.
We highly recommend using Anaconda distribution of python to install all the dependencies without issues, although g2gtools is also available at PyPI for ‘pip install’ or ‘easy_install’. The most recent version is also available on Anaconda Cloud, so add the following channels if you have not already.
$ conda config --add channels r
$ conda config --add channels bioconda
To avoid conflicts among dependencies, we highly recommend using conda virtual environment:
$ conda create -n g2gtools jupyter ipykernel
$ source activate g2gtools
Once g2gtools virtual environment is created and activated, your shell prompt will show ‘(g2gtools)’ at the beginning to specify what virtual environment you are currently in. Now type the following and install g2gtools:
(g2gtools) $ conda install -c kbchoi g2gtools
That’s all! You can go out from g2gtools virtual environment anytime by deactivating it:
(g2gtools) $ source deactivate
In this section, we will show core features of g2gtools.
$ g2gtools -h
___ _ _
|__ \ | | | |
__ _ ) |__ _| |_ ___ ___ | |___
/ _` | / // _` | __/ _ \ / _ \| / __|
| (_| |/ /| (_| | || (_) | (_) | \__ \
\__, |____\__, |\__\___/ \___/|_|___/ v0.2.0
__/ | __/ |
|___/ |___/
The most commonly used commands are:
vcf2vci Creates VCI file from VCF file(s)
patch Replaces bases in a fasta file from SNPs specified in a VCI file
transform Incorporates indels specified in a VCI file
convert Lifts over coordinates
extract Gets subsequences from fasta file
Other commands:
gtf2db Creates db file from GTF annotations
compiles variant call information from a VCF file into a G2G VCI format file for later (sample-specific) lookup. The input VCF file should contain both snps and indels.
$ g2gtools vcf2vci -h
Creates VCI file from VCF file(s)
Usage: g2gtools vcf2vci [-i <indel VCF file>]* -s <strain> -o <output file> [options]
Required Parameters:
-i, --vcf <vcf_file> VCF file name
-s, --strain <Strain> Name of strain (column in VCF file)
-o, --output <Output file> VCI file name to create
Optional Parameters:
-p, --num-processes <number> The number of processes to use, defaults to the number of cores
--diploid Create diploid VCI
--keep Keep track of VCF lines that could not be converted to VCI file
--pass Use only VCF lines that have a PASS for the filter value
--quality Filter on quality, FI=PASS
--no-bgzip DO NOT compress and index output
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
replaces bases in the input (reference) sequence using SNPs in the sample-specific VCI file. This command supports patching of local subsequence on the fly.
$ g2gtools patch
Patch SNPs onto the reference sequence
Usage: g2gtools patch -i <Fasta file> -c <VCI file> [options]
Required Parameters:
-i, --input <Fasta file> Reference fasta file to extract and patch on
-c, --vci <VCI file> VCI file
Optional Parameters:
-p, --num-processes <number> The number of processes to use, defaults to the number of cores
-o, --output <Output file> Name of output fasta file that contains SNP-patched sequences
-r, --region <seqid:start-end> Region to extract (cannot be used with -b)
-b, --bed <BED file> BED file (cannot be used with -r)
--bgzip Compress and index output (can only be used with -o)
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
Note:
--bgzip can be potentially slow depending on the size of the file
incorporates indels specified in a sample's VCI file into the input (reference) sequence. This command also supports the transformation of local subsequence on the fly.
$ g2gtools transform -h
Incorporate indels onto the input sequence
Usage: g2gtools transform -i <Fasta file> -c <VCI file> [options]
Required Parameters:
-i, --input <Fasta file> Reference fasta file to extract and transform
-c, --vci <VCI file> VCI file
Optional Parameters:
-p, --num-processes <number> The number of processes to use, defaults to the number of cores
-o, --output <Output file> Name of output fasta file
-r, --region <seqid:start-end> Region to extract (cannot be used with -b)
-b, --bed <BED file> BED file (cannot be used with -r)
--bgzip Compress and index output fasta file (can only be used with -o)
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
Note:
--bgzip can be potentially slow depending on the size of the file
lifts over coordinates back and forth between reference and custom personal genomes. It supports BAM/SAM, GTF, and BED formats. We implemented g2gtools since every available liftover tools discards alignments that land on indel locations in BAM/SAM files with no justifiable reason. We wanted to avoid removing such alignments since they are possibly enriched with allele-specific information.
$ g2gtools convert -h
Liftover coordinates of bam|sam|gtf|bed files
Usage: g2gtools convert -c <VCI file> -i <input file> [options]
Required Parameters:
-i, --input <input file> Input file to liftover
-c, --vci <VCI file> VCI file
Optional Parameters:
-f, --format <bam|sam|gtf|bed> Input file format
-o, --output <output file> Name of output file
--reverse Reverse the direction of the conversion (Custom -> Ref)
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
Note:
If no output file is specified [-o], the converted information will be redirected
to standard out.
creates G2G DataBase (DB) file using gene annotation (GTF format). G2G DB file is used for 'extract'ing annotated elements -- generally exons, transcripts, and gene regions -- from the input genomes.
$ g2gtools gtf2db -h
Convert a GTF file to a G2G DB file
Usage: g2gtools gtf2db -i <GTF file> -o <G2G DB file> [options]
Required Parameters:
-i, --input <GTF file> GTF file to parse
-o, --output <G2G DB file> G2G DB file to create
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
excerpts genomic regions from the input genome. Useful for creating FASTA file of exons, transcripts, and genes when used with G2G DB (gene annotation) file. Or we can specify putative peak regions for ChIP-Seq or ATAC-Seq in a BED file to create a "peakome".
$ g2gtools extract -h
Extract subsequence from a fasta file given a region
Usage: g2gtools extract -i <Fasta file> [-r <region> | -b <BED file> | -db <Database> | -id <seqid>] [options]
Required Parameters:
-i, --input <Fasta file> Fasta file to extract subsequence from, BGZIP files supported
-b, --bed <BED file> BED file -OR-
-id, --identifier <identifier> Fasta identifier -OR-
-r, --region <seqid:start-end> Region to extract -OR-
-db, --database <DB file> Database file
--transcripts For use with -db, extracts transcripts (default)
--exons For use with -db, extracts exons
--genes For use with -db, extracts genes
Optional Parameters:
-c, --vci <VCI file> Input VCI file, matching input Fasta file
-R, --vci-reverse Reverse the direction of liftover
--complement Complement the extracted sequence
--reverse Reverse the extracted sequence
--reverse-complement Reverse complement the extracted sequence
--raw Just shows the extracted sequence
Help Parameters:
-h, --help Print the help and exit
-d, --debug Turn debugging mode on (list multiple times for more messages)
Note:
Locations specified on command line are 1-based coordinates.
Locations specified via BED file are 0-based coordinates.
We show a typical workflow for creating diploid genome, exome, and transcriptome using a human example: NA19670 from 1000 Genomes (phase 3). You can download our bash script here.
$ g2gtools vcf2vci -o NA19670.vci -s NA19670 --diploid -p 8 \
-i ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr4.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr5.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr8.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr9.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr13.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr14.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr16.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr17.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz \
-i ALL.chrY.phase3_integrated_v1b.20130502.genotypes.vcf.gz \
-i ALL.chrMT.phase3_callmom.20130502.genotypes.vcf.gz
$ g2gtools patch -i hs37d5.fa -c NA19670.vci.gz -o NA19670.fa -p 8
$ g2gtools transform -i NA19670.patched.fa -c NA19670.vci.gz -o NA19670.fa -p 8
$ g2gtools convert -i Homo_sapiens.GRCh37.75.gtf -c NA19670.vci.gz -o NA19670.gtf
$ g2gtools gtf2db -i NA19670.gtf -o NA19670.db
$ g2gtools extract -i NA19670.fa -db NA19670.db --exons > NA19670.exons.fa
$ g2gtools extract -i NA19670.fa -db NA19670.db --transcripts > NA19670.transcripts.fa
$ g2gtools extract -i NA19670.fa -db NA19670.db --genes > NA19670.genes.fa