scBASE is a python implementation of “soft” zero-and-one inflated beta-binomial (ZOIBB) model for estimating cellular allelic proportions in which we tackle data sparsity and variability by harnessing information derived from cell's population context.
- Free software: MIT license
Installation
First, clone the github scBASE repository under your source directory.
$ git clone https://github.com/churchill-lab/scBASE.git scBASE
Or download zip
or tar.gz
file from the link above and extract it in your source directory. Go to scBASE source folder. You will see the following files.
$ cd scBASE
$ ls
AUTHORS.rst LICENSE README.rst scbase setup.py
CONTRIBUTING.rst MANIFEST.in docs scripts tests
HISTORY.rst Makefile requirements_dev.txt setup.cfg tox.ini
Note: We highly recommend using Anaconda distribution of python to install all the dependencies without issues. Add the following channels if you have not already:
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
To avoid conflicts among dependencies, we also highly recommend using conda virtual environment:
$ conda create -n scbase python=3
$ source activate scbase
Once scBASE virtual environment is created and activated, your shell prompt will show ‘(scbase)’ at the beginning to specify what virtual environment you are currently in. Now please type the following and install the dependencies for scBASE first:
(scbase) $ conda install pystan click future scipy
(scbase) $ pip install loompy
It is now ready to install scBASE. Note compiling stan code takes a while (several minutes depending on system resource).
(scbase) $ python setup.py install
That's all! Now scBASE is available. The whole installation process took 9m54s on a macbook pro with 2.7GHz Intel Core i7 processor and 16GB 1600MHz DDR3 RAM.
Once you are done using scBASE, you can go out from scBASE virtual environment anytime by deactivating it:
(scbase) $ source deactivate
Usage
We assume that users have allele-specific read counts in a loom format file (where total read counts of genes are stored in the default data layer and allele-specific counts are in each allele layer using allele codes as layer keys, 'Mat' and 'Pat' for example). Or scBASE can create a loom file using allele-specific read count file(s), tab-delimited text in the following format. Users can also create count files using the results from RSEM, kallisto, or even from counting uniquely-mapping reads. The count file can contain multiple samples (or cells) as long as each sample ID is specified using a header #sample_id:
.
#target_id <maternal_allele_code> <paternal_allele_code> total
#sample_id: <Cell_ID_1>
<Gene_ID_1> <maternal_allele_count> <paternal_allele_count> <total_count>
<Gene_ID_2> <maternal_allele_count> <paternal_allele_count> <total_count>
...
#sample_id: <Cell_ID_2>
<Gene_ID_1> <maternal_allele_count> <paternal_allele_count> <total_count>
<Gene_ID_2> <maternal_allele_count> <paternal_allele_count> <total_count>
...
For example, the following file was created by emase-zero using reads from F1 hybrid mouse of C57BL/6J male and CAST/EiJ female (parental haplotypes denoted as B
and F
respectively). It has four columns: the first column is for alignment target ID's, followed by the read counts of alleles, and then, the total.
#target_id B F total
#sample_id: SRR1041728
ENSMUSG00000000001 1394.25216017 334.74783983 1729.0
ENSMUSG00000000003 0.0 0.0 0.0
ENSMUSG00000000028 843.0 2.85744524983e-63 843.0
...
#sample_id: SRR1041729
ENSMUSG00000000001 30.9238234732 0.0761765268472 31.0
ENSMUSG00000000003 0.0 0.0 0.0
ENSMUSG00000000028 554.935510723 1053.06448928 1608.0
...
subcommand: collate
We can collate the allele-specific count information and store it in a loom file using collate
subcommand. scBASE can process multiple count files if the name is specified using wildcard characters (*
or ?
).
Usage: scbase collate [OPTIONS] <indir> <loomfile>
Collates count or parameter files and store in a loom file
(http://loompy.org)
Options:
--counts If you are collating count files
--params If you are collating count files
--name TEXT File name (default: *genes*counts)
-t, --tidfile <tidfile> Name of target ID file
-m, --model <ase_model> <tgx_model>
This is a developer option used only when
other models in addition to ones provided by
default are available.
-v, --verbose '-v' is Level 1 and '-vv' is Level 2
--help Show this message and exit.
For example, the following command will collate allele-specific counts from all the files whose name has the pattern of *.gene.counts
in the folder /fastscratch/myproject1
and create a loom file named proj1.loom
.
$ scbase collate --counts -t /fastscratch/myproject1/geneID.txt --name '*.gene.counts' /fastscratch/myproject1 proj1.loom
We require target ID file in case the count files do not contain genes that are not expressed. For example, geneID.txt
is a text file that contains all the Ensembl gene ID's:
ENSMUSG00000000001
ENSMUSG00000000003
ENSMUSG00000000028
ENSMUSG00000000031
ENSMUSG00000000037
...
subcommand: select
We can select genes for fitting with ZOIBB model according to the minumum numbers of cells that satisfy the minimum number of reads expressed from a gene using select
subcommand.
Usage: scbase select [OPTIONS] <loomfile>
Select genes if they are expressed at least <min_read_count> in at least
<min_cell_count> cells
Options:
--min-read-count <min_read_count> Min read count required
--min-cell-count <min_cell_count> Min cell count required
--layer <layer_key> The layer to consider selection
-v, --verbose '-v' is Level 1 and '-vv' is Level 2
--help Show this message and exit.
For example, the following command will filter out genes that were not expressed in at least 4 reads over in at least 16 cells, and then, mark selected genes in a row attribute (ra.Selected
) of the input loom file so the next procedure run-mcmc
will only run on those selected genes.
$ scbase select --min-read-count 4 --min-cell-count 16 proj1.loom
subcommand: run-mcmc
Once input data is ready in a loom file, we can fit ZOIBB model using run-mcmc
subcommand.
Usage: scbase run-mcmc [OPTIONS] <loomfile>
Runs MCMC (using loom file)
Options:
--hapcode <mat_hapcode> <pat_hapcode>
Haplotype code for maternal and paternal
alleles (default: M P)
-m, --model <ase_model> <tgx_model>
This is a developer option used only when
other models in addition to ones provided by
default are available.
-s, --start <gidx_start> Starting gene (row index)
-e, --end <gidx_end> Ending gene (row index)
-o, --outfile <outfile> Name of output file (default:
scbase.<start>-<end>.param.npz)
-v, --verbose '-v' is Level 1 and '-vv' is Level 2
--help Show this message and exit.
For example, the following command will fit the model for the first six genes in the loom file, proj1.loom
and save the results in a file named scbase.0-6.param.npz
. Haplotype code should be given in the order of maternal and paternal allele.
$ scbase run-mcmc proj1.loom -s 0 -e 6 --hapcode F B
subcommand: submit
Running MCMC for tens of thousands of genes would be time consuming but we can parallelize it using HPC cluster systems. We can submit jobs automatically using submit
subcommand.
Usage: scbase submit [OPTIONS] <loomfile>
Submits scBASE fitting jobs to HPC clusters
Options:
--hapcode <mat_hapcode> <pat_hapcode>
Haplotype code for maternal and paternal
alleles (default: M P)
-m, --model <ase_model> <tgx_model>
This is a developer option used only when
other models in addition to ones provided by
default are available.
-c, --chunk <chunk_size> Number of genes in each chunk
-o, --outdir <outdir> Folder name to store parameter files
--systype <systype> Type of HPC cluster system (default: pbs)
--email <email> Notification E-mail
--queue <queue> Queue name
--mem <mem> Memory in GB (default: 16GB)
--walltime <walltime> Walltime in hours (default: 24h)
--dryrun Use this when you want to rehearse your
submit commands
-v, --verbose '-v' is Level 1 and '-vv' is Level 2
--help Show this message and exit.
For example, the following comand will submit jobs to PBS cluster, each fits 50 genes in proj1.loom
file and save the results in files named scbase.{gidx_start}-{gidx_end}.param.npz
in /fastscratch/myproject1
folder. This will also generate temporary data files named _chunk.{gidx_start}-{gidx_end}.npz
, which you can delete once the jobs are all done.
$ scbase submit -o /fastscratch/myproject1 --systype pbs --chunk 50 --hapcode F B proj1.loom
Now we collate the model parameters and store in the input loom file, proj1.loom
so that we can now have all the information in a single file.
$ scbase collate --params /fastscratch/myproject1 proj1.loom
The proj1.loom
file contains the following.
In [1]: import loompy
In [2]: ds = loompy.connect('proj1.loom')
In [3]: ds.layers.keys()
Out[3]:
['',
'B',
'F',
'lambda_k',
'p_k',
'pi_bk',
'pi_mk',
'pi_pk']
In [4]: ds.ra.keys()
Out[4]:
['GeneID',
'Rhat_ase',
'Rhat_tgx',
'alpha_ase1',
'alpha_ase2',
'alpha_mono',
'alpha_tgx1',
'alpha_tgx2',
'pi_b',
'pi_m',
'pi_p',
'Selected',
'Symbol',
'Valid']
In [5]: ds.ca.keys()
Out[5]: ['CellID', 'CellType', 'Selected', 'Size', 'Valid']
In [6]: ds.close()
layers['lambda_k']
is "normalized" gene expression count per cell and layers['p_k']
contains the final "partial pooled" allelic proportion per cell. (Note: Each layer in a loom file has a dimensionality of |num_genes|x|num_cells|.) The proportion of paternal monoallelic, bi-allelic, and maternal monoallelic cells are in ra['pi_p']
, ra['pi_b']
, and ra['pi_m']
. Cell level probability of allelic expression states are available at the layers of layers['pi_pk']
, layers['pi_bk']
, and layers['pi_mk']
.
Announcements
- 25 Oct 2018 » Welcome to scBASE!