scBASE is a python implementation of “soft” zero-and-one inflated beta-binomial (ZOIBB) model for estimating cellular allelic proportions in which we tackle data sparsity and variability by harnessing information derived from cell's population context.

Installation

First, clone the github scBASE repository under your source directory.

$ git clone https://github.com/churchill-lab/scBASE.git scBASE

Or download zip or tar.gz file from the link above and extract it in your source directory. Go to scBASE source folder. You will see the following files.

$ cd scBASE
$ ls
AUTHORS.rst          LICENSE              README.rst           scbase               setup.py
CONTRIBUTING.rst     MANIFEST.in          docs                 scripts              tests
HISTORY.rst          Makefile             requirements_dev.txt setup.cfg            tox.ini

Note: We highly recommend using Anaconda distribution of python to install all the dependencies without issues. Add the following channels if you have not already:

$ conda config --add channels conda-forge
$ conda config --add channels bioconda

To avoid conflicts among dependencies, we also highly recommend using conda virtual environment:

$ conda create -n scbase python=3
$ source activate scbase

Once scBASE virtual environment is created and activated, your shell prompt will show ‘(scbase)’ at the beginning to specify what virtual environment you are currently in. Now please type the following and install the dependencies for scBASE first:

(scbase) $ conda install pystan click future scipy
(scbase) $ pip install loompy

It is now ready to install scBASE. Note compiling stan code takes a while (several minutes depending on system resource).

(scbase) $ python setup.py install

That's all! Now scBASE is available. The whole installation process took 9m54s on a macbook pro with 2.7GHz Intel Core i7 processor and 16GB 1600MHz DDR3 RAM.

Once you are done using scBASE, you can go out from scBASE virtual environment anytime by deactivating it:

(scbase) $ source deactivate

Usage

We assume that users have allele-specific read counts in a loom format file (where total read counts of genes are stored in the default data layer and allele-specific counts are in each allele layer using allele codes as layer keys, 'Mat' and 'Pat' for example). Or scBASE can create a loom file using allele-specific read count file(s), tab-delimited text in the following format. Users can also create count files using the results from RSEM, kallisto, or even from counting uniquely-mapping reads. The count file can contain multiple samples (or cells) as long as each sample ID is specified using a header #sample_id:.

#target_id	<maternal_allele_code>	<paternal_allele_code>	total
#sample_id: <Cell_ID_1>
<Gene_ID_1>	<maternal_allele_count>	<paternal_allele_count>	<total_count>
<Gene_ID_2>	<maternal_allele_count>	<paternal_allele_count>	<total_count>
...
#sample_id: <Cell_ID_2>
<Gene_ID_1>	<maternal_allele_count>	<paternal_allele_count>	<total_count>
<Gene_ID_2>	<maternal_allele_count>	<paternal_allele_count>	<total_count>
...

For example, the following file was created by emase-zero using reads from F1 hybrid mouse of C57BL/6J male and CAST/EiJ female (parental haplotypes denoted as B and F respectively). It has four columns: the first column is for alignment target ID's, followed by the read counts of alleles, and then, the total.

#target_id	B	F	total
#sample_id: SRR1041728
ENSMUSG00000000001	1394.25216017	334.74783983	1729.0
ENSMUSG00000000003	0.0	0.0	0.0
ENSMUSG00000000028	843.0	2.85744524983e-63	843.0
...
#sample_id: SRR1041729
ENSMUSG00000000001	30.9238234732	0.0761765268472	31.0
ENSMUSG00000000003	0.0	0.0	0.0
ENSMUSG00000000028	554.935510723	1053.06448928	1608.0
...

subcommand: collate

We can collate the allele-specific count information and store it in a loom file using collate subcommand. scBASE can process multiple count files if the name is specified using wildcard characters (* or ?).

Usage: scbase collate [OPTIONS] <indir> <loomfile>

  Collates count or parameter files and store in a loom file
  (http://loompy.org)

Options:
  --counts                        If you are collating count files
  --params                        If you are collating count files
  --name TEXT                     File name (default: *genes*counts)
  -t, --tidfile <tidfile>         Name of target ID file
  -m, --model <ase_model> <tgx_model>
                                  This is a developer option used only when
                                  other models in addition to ones provided by
                                  default are available.
  -v, --verbose                   '-v' is Level 1 and '-vv' is Level 2
  --help                          Show this message and exit.

For example, the following command will collate allele-specific counts from all the files whose name has the pattern of *.gene.counts in the folder /fastscratch/myproject1 and create a loom file named proj1.loom.

$ scbase collate --counts -t /fastscratch/myproject1/geneID.txt --name '*.gene.counts' /fastscratch/myproject1 proj1.loom

We require target ID file in case the count files do not contain genes that are not expressed. For example, geneID.txt is a text file that contains all the Ensembl gene ID's:

ENSMUSG00000000001
ENSMUSG00000000003
ENSMUSG00000000028
ENSMUSG00000000031
ENSMUSG00000000037
...

subcommand: select

We can select genes for fitting with ZOIBB model according to the minumum numbers of cells that satisfy the minimum number of reads expressed from a gene using select subcommand.

Usage: scbase select [OPTIONS] <loomfile>

  Select genes if they are expressed at least <min_read_count> in at least
  <min_cell_count> cells

Options:
  --min-read-count <min_read_count> Min read count required
  --min-cell-count <min_cell_count> Min cell count required
  --layer <layer_key>               The layer to consider selection
  -v, --verbose                     '-v' is Level 1 and '-vv' is Level 2
  --help                            Show this message and exit.

For example, the following command will filter out genes that were not expressed in at least 4 reads over in at least 16 cells, and then, mark selected genes in a row attribute (ra.Selected) of the input loom file so the next procedure run-mcmc will only run on those selected genes.

$ scbase select --min-read-count 4 --min-cell-count 16 proj1.loom

subcommand: run-mcmc

Once input data is ready in a loom file, we can fit ZOIBB model using run-mcmc subcommand.

Usage: scbase run-mcmc [OPTIONS] <loomfile>

  Runs MCMC (using loom file)

Options:
  --hapcode <mat_hapcode> <pat_hapcode>
                                Haplotype code for maternal and paternal
                                alleles (default: M P)
  -m, --model <ase_model> <tgx_model>
                                This is a developer option used only when
                                other models in addition to ones provided by
                                default are available.
  -s, --start <gidx_start>      Starting gene (row index)
  -e, --end <gidx_end>          Ending gene (row index)
  -o, --outfile <outfile>       Name of output file (default:
                                scbase.<start>-<end>.param.npz)
  -v, --verbose                 '-v' is Level 1 and '-vv' is Level 2
  --help                        Show this message and exit.

For example, the following command will fit the model for the first six genes in the loom file, proj1.loom and save the results in a file named scbase.0-6.param.npz. Haplotype code should be given in the order of maternal and paternal allele.

$ scbase run-mcmc proj1.loom -s 0 -e 6 --hapcode F B

subcommand: submit

Running MCMC for tens of thousands of genes would be time consuming but we can parallelize it using HPC cluster systems. We can submit jobs automatically using submit subcommand.

Usage: scbase submit [OPTIONS] <loomfile>
  
  Submits scBASE fitting jobs to HPC clusters
  
Options:
  --hapcode <mat_hapcode> <pat_hapcode>
                                  Haplotype code for maternal and paternal
                                  alleles (default: M P)
  -m, --model <ase_model> <tgx_model>
                                  This is a developer option used only when
                                  other models in addition to ones provided by
                                  default are available.
  -c, --chunk <chunk_size>        Number of genes in each chunk
  -o, --outdir <outdir>           Folder name to store parameter files
  --systype <systype>             Type of HPC cluster system (default: pbs)
  --email <email>                 Notification E-mail
  --queue <queue>                 Queue name
  --mem <mem>                     Memory in GB (default: 16GB)
  --walltime <walltime>           Walltime in hours (default: 24h)
  --dryrun                        Use this when you want to rehearse your
                                  submit commands
  -v, --verbose                   '-v' is Level 1 and '-vv' is Level 2
  --help                          Show this message and exit.

For example, the following comand will submit jobs to PBS cluster, each fits 50 genes in proj1.loom file and save the results in files named scbase.{gidx_start}-{gidx_end}.param.npz in /fastscratch/myproject1 folder. This will also generate temporary data files named _chunk.{gidx_start}-{gidx_end}.npz, which you can delete once the jobs are all done.

$ scbase submit -o /fastscratch/myproject1 --systype pbs --chunk 50 --hapcode F B proj1.loom

Now we collate the model parameters and store in the input loom file, proj1.loom so that we can now have all the information in a single file.

$ scbase collate --params /fastscratch/myproject1 proj1.loom

The proj1.loom file contains the following.

In [1]: import loompy

In [2]: ds = loompy.connect('proj1.loom')

In [3]: ds.layers.keys()
Out[3]:
['',
 'B',
 'F',
 'lambda_k',
 'p_k',
 'pi_bk',
 'pi_mk',
 'pi_pk']

In [4]: ds.ra.keys()
Out[4]:
['GeneID',
 'Rhat_ase',
 'Rhat_tgx',
 'alpha_ase1',
 'alpha_ase2',
 'alpha_mono',
 'alpha_tgx1',
 'alpha_tgx2',
 'pi_b',
 'pi_m',
 'pi_p',
 'Selected',
 'Symbol',
 'Valid']

In [5]: ds.ca.keys()
Out[5]: ['CellID', 'CellType', 'Selected', 'Size', 'Valid']

In [6]: ds.close()

layers['lambda_k'] is "normalized" gene expression count per cell and layers['p_k'] contains the final "partial pooled" allelic proportion per cell. (Note: Each layer in a loom file has a dimensionality of |num_genes|x|num_cells|.) The proportion of paternal monoallelic, bi-allelic, and maternal monoallelic cells are in ra['pi_p'], ra['pi_b'], and ra['pi_m']. Cell level probability of allelic expression states are available at the layers of layers['pi_pk'], layers['pi_bk'], and layers['pi_mk'].

Announcements