g2gtools

g2gtools is a new suite of tools that creates personal diploid genomes. It allows us to easily extract regions from personal genomes so we can create individualized alignment indexes for next-generation sequencing reads. g2gtools can liftover alignments on personal genome coordinates back to that of reference so we can compare alignments from among samples in a population. Unlike other liftover tools, g2gtools does not throw away alignments that land on indel regions.

View on GitHub Download .zip Download .tar.gz

Installation

We highly recommend using Anaconda distribution of python to install all the dependencies without issues, although g2gtools is also available at PyPI for ‘pip install’ or ‘easy_install’. The most recent version is also available on Anaconda Cloud, so add the following channels if you have not already.


$ conda config --add channels r
$ conda config --add channels bioconda

To avoid conflicts among dependencies, we highly recommend using conda virtual environment:


$ conda create -n g2gtools jupyter ipykernel
$ source activate g2gtools

Once g2gtools virtual environment is created and activated, your shell prompt will show ‘(g2gtools)’ at the beginning to specify what virtual environment you are currently in. Now type the following and install g2gtools:


(g2gtools) $ conda install -c kbchoi g2gtools

That’s all! You can go out from g2gtools virtual environment anytime by deactivating it:


(g2gtools) $ source deactivate

Features

In this section, we will show core features of g2gtools.


$ g2gtools -h


        ___       _              _
       |__ \     | |            | |
   __ _   ) |__ _| |_ ___   ___ | |___
  / _` | / // _` | __/ _ \ / _ \| / __|
 | (_| |/ /| (_| | || (_) | (_) | \__ \
  \__, |____\__, |\__\___/ \___/|_|___/   v0.2.0
   __/ |     __/ |
  |___/     |___/




    The most commonly used commands are:
       vcf2vci       Creates VCI file from VCF file(s)
       patch         Replaces bases in a fasta file from SNPs specified in a VCI file
       transform     Incorporates indels specified in a VCI file
       convert       Lifts over coordinates
       extract       Gets subsequences from fasta file

    Other commands:
       gtf2db        Creates db file from GTF annotations

vcf2vci

compiles variant call information from a VCF file into a G2G VCI format file for later (sample-specific) lookup. The input VCF file should contain both snps and indels.


$ g2gtools vcf2vci -h

Creates VCI file from VCF file(s)

Usage: g2gtools vcf2vci [-i <indel VCF file>]* -s <strain> -o <output file> [options]

Required Parameters:
    -i, --vcf <vcf_file>             VCF file name
    -s, --strain <Strain>            Name of strain (column in VCF file)
    -o, --output <Output file>       VCI file name to create

Optional Parameters:
    -p, --num-processes <number>     The number of processes to use, defaults to the number of cores
    --diploid                        Create diploid VCI
    --keep                           Keep track of VCF lines that could not be converted to VCI file
    --pass                           Use only VCF lines that have a PASS for the filter value
    --quality                        Filter on quality, FI=PASS
    --no-bgzip                       DO NOT compress and index output

Help Parameters:
    -h, --help                       Print the help and exit
    -d, --debug                      Turn debugging mode on (list multiple times for more messages)

patch

replaces bases in the input (reference) sequence using SNPs in the sample-specific VCI file. This command supports patching of local subsequence on the fly.


$ g2gtools patch

Patch SNPs onto the reference sequence

Usage: g2gtools patch -i <Fasta file> -c <VCI file> [options]

Required Parameters:
    -i, --input <Fasta file>         Reference fasta file to extract and patch on
    -c, --vci <VCI file>             VCI file

Optional Parameters:
    -p, --num-processes <number>     The number of processes to use, defaults to the number of cores
    -o, --output <Output file>       Name of output fasta file that contains SNP-patched sequences
    -r, --region <seqid:start-end>   Region to extract (cannot be used with -b)
    -b, --bed <BED file>             BED file (cannot be used with -r)
    --bgzip                          Compress and index output (can only be used with -o)

Help Parameters:
    -h, --help                       Print the help and exit
    -d, --debug                      Turn debugging mode on (list multiple times for more messages)

Note:
    --bgzip can be potentially slow depending on the size of the file

transform

incorporates indels specified in a sample's VCI file into the input (reference) sequence. This command also supports the transformation of local subsequence on the fly.


$ g2gtools transform -h

Incorporate indels onto the input sequence

Usage: g2gtools transform -i <Fasta file> -c <VCI file> [options]

Required Parameters:
    -i, --input <Fasta file>         Reference fasta file to extract and transform
    -c, --vci <VCI file>             VCI file

Optional Parameters:
    -p, --num-processes <number>     The number of processes to use, defaults to the number of cores
    -o, --output <Output file>       Name of output fasta file
    -r, --region <seqid:start-end>   Region to extract (cannot be used with -b)
    -b, --bed <BED file>             BED file (cannot be used with -r)
    --bgzip                          Compress and index output fasta file (can only be used with -o)

Help Parameters:
    -h, --help                       Print the help and exit
    -d, --debug                      Turn debugging mode on (list multiple times for more messages)

Note:
    --bgzip can be potentially slow depending on the size of the file

convert

lifts over coordinates back and forth between reference and custom personal genomes. It supports BAM/SAM, GTF, and BED formats. We implemented g2gtools since every available liftover tools discards alignments that land on indel locations in BAM/SAM files with no justifiable reason. We wanted to avoid removing such alignments since they are possibly enriched with allele-specific information.


$ g2gtools convert -h

Liftover coordinates of bam|sam|gtf|bed files

Usage: g2gtools convert -c <VCI file> -i <input file> [options]

Required Parameters:
    -i, --input <input file>         Input file to liftover
    -c, --vci <VCI file>             VCI file

Optional Parameters:
    -f, --format <bam|sam|gtf|bed>   Input file format
    -o, --output <output file>       Name of output file
    --reverse                        Reverse the direction of the conversion (Custom -> Ref)

Help Parameters:
    -h, --help                       Print the help and exit
    -d, --debug                      Turn debugging mode on (list multiple times for more messages)


Note:
    If no output file is specified [-o], the converted information will be redirected
    to standard out.

gtf2db

creates G2G DataBase (DB) file using gene annotation (GTF format). G2G DB file is used for 'extract'ing annotated elements -- generally exons, transcripts, and gene regions -- from the input genomes.


$ g2gtools gtf2db -h

Convert a GTF file to a G2G DB file

Usage: g2gtools gtf2db -i <GTF file> -o <G2G DB file> [options]

Required Parameters:
    -i, --input <GTF file>           GTF file to parse
    -o, --output <G2G DB file>       G2G DB file to create

Help Parameters:
    -h, --help                       Print the help and exit
    -d, --debug                      Turn debugging mode on (list multiple times for more messages)

extract

excerpts genomic regions from the input genome. Useful for creating FASTA file of exons, transcripts, and genes when used with G2G DB (gene annotation) file. Or we can specify putative peak regions for ChIP-Seq or ATAC-Seq in a BED file to create a "peakome".


$ g2gtools extract -h

Extract subsequence from a fasta file given a region

Usage: g2gtools extract -i <Fasta file> [-r <region> | -b <BED file> | -db <Database> | -id <seqid>] [options]

Required Parameters:
    -i, --input <Fasta file>         Fasta file to extract subsequence from, BGZIP files supported
    -b, --bed <BED file>             BED file -OR-
    -id, --identifier <identifier>   Fasta identifier -OR-
    -r, --region <seqid:start-end>   Region to extract -OR-
    -db, --database <DB file>        Database file
         --transcripts                 For use with -db, extracts transcripts (default)
         --exons                       For use with -db, extracts exons
         --genes                       For use with -db, extracts genes

Optional Parameters:
    -c, --vci <VCI file>             Input VCI file, matching input Fasta file
    -R, --vci-reverse                Reverse the direction of liftover
    --complement                     Complement the extracted sequence
    --reverse                        Reverse the extracted sequence
    --reverse-complement             Reverse complement the extracted sequence
    --raw                            Just shows the extracted sequence

Help Parameters:
    -h, --help                       Print the help and exit
    -d, --debug                      Turn debugging mode on (list multiple times for more messages)

Note:
    Locations specified on command line are 1-based coordinates.
    Locations specified via BED file are 0-based coordinates.

Workflow

We show a typical workflow for creating diploid genome, exome, and transcriptome using a human example: NA19670 from 1000 Genomes (phase 3). You can download our bash script here.

1. Create VCI file of the sample


$ g2gtools vcf2vci -o NA19670.vci -s NA19670 --diploid -p 8 \
-i ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr3.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr4.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr5.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr8.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr9.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr11.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr12.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr13.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr14.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr15.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr16.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr17.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr18.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
-i ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz \
-i ALL.chrY.phase3_integrated_v1b.20130502.genotypes.vcf.gz \
-i ALL.chrMT.phase3_callmom.20130502.genotypes.vcf.gz

2. Incorporate SNPs into the reference genome.


$ g2gtools patch -i hs37d5.fa -c NA19670.vci.gz -o NA19670.fa -p 8

3. Incorporate indels into the reference genome and get custom diploid genome for the sample.


$ g2gtools transform -i NA19670.patched.fa -c NA19670.vci.gz -o NA19670.fa -p 8

4. Liftover gene annotation onto sample coordinates.


$ g2gtools convert -i Homo_sapiens.GRCh37.75.gtf -c NA19670.vci.gz -o NA19670.gtf

5. Parse custom gene annotation into a G2G database file.


$ g2gtools gtf2db -i NA19670.gtf -o NA19670.db

6. Extract exons, transcripts, and gene regions from the custom sample genome.


$ g2gtools extract -i NA19670.fa -db NA19670.db --exons > NA19670.exons.fa
$ g2gtools extract -i NA19670.fa -db NA19670.db --transcripts > NA19670.transcripts.fa
$ g2gtools extract -i NA19670.fa -db NA19670.db --genes > NA19670.genes.fa