EMASE-Zero is a C++ implementation of EMASE, Expectation-Maximization (EM) algorithm for Allele Specific Expression (ASE). Unlike other approaches, we take into account the nested relationship of isoforms and alleles under genes when we disambiguate multiply aligning reads, which improves downstream analyses performed at the gene level. In combination with alntools (an alignment file processor that generates compressed three-dimensional incidence matrix), Zero estimates the expected read counts fast, over 10 times faster than RSEM. Zero also generalizes our fast hierarchical EM to any decent alignment strategies. The original python implementation is available at https://github.com/churchill-lab/emase.
- Free software: GPLv3 license
Building and installing EMASE-Zero requires a C++ compiler. After unzipping or extracting the downloaded file, please do the following:
$ cd src $ make
This will create an executable called emase-zero in the src directory. You can copy this executable into your desired bin directory (such as /usr/local/bin).
EMASE-Zero takes in an alignment incidence matrix file (a binary format) generated by alntools using the diploid transcriptome alignment BAM file and two text files containing transcript lengths and gene-to-transcript mapping information. We need to specify which EMASE Model to use for disambiguating the origins of multiply-aligning reads.
$ emase-zero -h EMASE-Zero Version x.y.z Help ----------------------------- USAGE: emase-zero [options]
OPTIONS --bin (-b): Alignment incident matrix file (Binary format) prepared with alntools (https://churchill-lab.github.io/alntools) --outbase (-o) : Specify basename for output file (default="emase-zero.quantified") --model (-m) : Specify EMASE model (can be 1-4, default=4) --transcript-lengths (-l) : Filename for transcript length file. Tab delimited file where the first field is the transcript ID, and the second is transcript length --gene-mappings (-g) : Filename containing transcript to gene mapping. Tab delimited file where the first field is the gene ID, all other fields are the transcript IDs that belong to that gene --max-iterations (-i) : Specify the maximum number of EM iterations (Default=9999) --tolerance (-t) : Specify the convergence threshold. EM will terminate when the sum of the aboslute value of differences in the total TPM from one iteration to the next is lower than this value if performing length adjustment and multipled by #reads if not (Default=0.000001) --verbose (-v): Run in verbose mode --version: Print the version and exit --help (-h): Print this message and exit
--bin: Specify the name of binary equivalence class (EC) file. A binary EC file is generated from a bam file using alntools. EMASE-Zero does not require any specific aligner to use, but we recommend bowtie Please see example below for the bowtie options we use for RNA-Seq analyses.
--model: Specify EMASE model for disambiguating multireads. According to our benchmarking, Model 2 worked best for ASE while Model 4 were the best for total gene exression.
--transcript-lengths: Specify transcript length file name. We assume the file has the format of "Transcript_ID[tab]Length". For example, assuming 'M' and 'P' stand for maternal and paternal alleles respectively,
ENSMUST00000000001_M 3262 ... ENSMUST00000098609_M 2653 ... ENSMUST00000000001_P 3264 ... ENSMUST00000098609_P 2659 ...
--gene-mappings: Specify gene-to-transcripts mapping file name. The file should be formatted in Gene_ID[tab]Transcript_ID_1[tab]Transcript_ID_2[tab]…. These mappings are among reference IDs, not among allele IDs, so no haplotype code at the end like _M or _P. For example,
ENSMUSG00000000001 ENSMUST00000000001 ENSMUSG00000000003 ENSMUST00000000003 ENSMUST00000114041 ENSMUSG00000000028 ENSMUST00000000028 ENSMUST00000096990 ENSMUST00000115585 ...
Here's an example pipeline of commands:
$ bowtie -q -a --best --strata --sam -v 3 -m 100 bowtie.transcriptome my_sample.fastq | samtools view -bS - > my_sample.bam $ alntools bam2ec -t transcripts.lengths.tsv -c 8 my_sample.bam my_sample.bin $ emase-zero -m 2 -b my_sample.bin -o emase-zero.quantified -l transcript.lengths.tsv -g gene2transcripts.tsv
Note: We highly recommend adding poly-A tail to the alignment target sequences (transcripts) when you build bowtie index. We usually add (R-1)-long poly-A tails where R is the read length.