scCODA - Compositional analysis of single-cell data¶
This notebook serves as a tutorial for using the scCODA package (Büttner, Ostner et al., 2021) to analyze changes in cell composition data.
The package is intended to be used with cell composition from single-cell RNA-seq experiments, however there are no technical restrictions that prevent the use of data from other sources.
The data we use in the following example comes from Haber et al., 2017. It contains samples from the small intestinal epithelium of mice with different conditions.
This tutorial is designed to be executed on a standard computer (any operating system) in a Python environment with scCODA, Jupyter notebook and all their dependencies installed. Running the tutorial takes about 1.5 minutes on a 2020 Apple MacBook Pro (16GB RAM).
[12]:
# Setup
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt
from sccoda.util import comp_ana as mod
from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz
import sccoda.datasets as scd
Data preparation¶
[2]:
# Load data
cell_counts = scd.haber()
print(cell_counts)
Mouse Endocrine Enterocyte Enterocyte.Progenitor Goblet Stem \
0 Control_1 36 59 136 36 239
1 Control_2 5 46 23 20 50
2 Control_3 45 98 188 124 250
3 Control_4 26 221 198 36 131
4 H.poly.Day10_1 42 71 203 147 271
5 H.poly.Day10_2 40 57 383 170 321
6 H.poly.Day3_1 52 75 347 66 323
7 H.poly.Day3_2 65 126 115 33 65
8 Salm_1 37 332 113 59 90
9 Salm_2 32 373 116 67 117
TA TA.Early Tuft
0 125 191 18
1 11 40 5
2 155 365 33
3 130 196 4
4 109 180 146
5 244 256 71
6 263 313 51
7 39 129 59
8 47 132 10
9 65 168 12
Looking at the data, we see that we have 4 control samples, and 3 conditions with 2 samples each. To use the models in scCODA, we first have to convert the data into an anndata object. This can be done easily with the sccoda.util.cell_composition_data
module. The resulting object separates our data components: Cell counts are stored in data.X
, covariates in data.obs
.
[3]:
# Convert data to anndata object
data_all = dat.from_pandas(cell_counts, covariate_columns=["Mouse"])
# Extract condition from mouse name and add it as an extra column to the covariates
data_all.obs["Condition"] = data_all.obs["Mouse"].str.replace(r"_[0-9]", "", regex=True)
print(data_all)
AnnData object with n_obs × n_vars = 10 × 8
obs: 'Mouse', 'Condition'
For our first example, we want to look at how the Salmonella infection influences the cell composition. Therefore, we subset our data.
[4]:
# Select control and salmonella data
data_salm = data_all[data_all.obs["Condition"].isin(["Control", "Salm"])]
print(data_salm.obs)
Mouse Condition
0 Control_1 Control
1 Control_2 Control
2 Control_3 Control
3 Control_4 Control
8 Salm_1 Salm
9 Salm_2 Salm
Plotting the data, we can see that there is a large increase of Enterocytes in the infected sampes, while most other cell types slightly decrease. Since scRNA-seq experiments are limited in the number of cells per sample, the count data is compositional, which leads to negative correlations between the cell types. Thus, the slight decreases in many cell types might be fully caused by the increase in Enterocytes.
[14]:
viz.boxplots(data_salm, feature_name="Condition")
plt.show()
Note that the use of anndata in scCODA is different from the use in scRNA-seq pipelines, e.g. scanpy. To convert scanpy objects to a scCODA dataset, have a look at `dat.from_scanpy.
Model setup and inference¶
We can now create the model and run inference on it. Creating a sccoda.util.comp_ana.CompositionalAnalysis
class object sets up the compositional model and prepares everxthing for parameter inference. It needs these informations:
The data object from above.
The
formula
parameter. It specifies how the covariates are used in the model. It can process R-style formulas via the patsy package, e.g.formula="Cov1 + Cov2 + Cov3"
. Here, we simply use the “Condition” covariate of our datasetThe
reference_cell_type
parameter is used to specify a cell type that is believed to be unchanged by the covariates informula
. This is necessary, because compositional analysis must always be performed relative to a reference (See Büttner, Ostner et al., 2021 for a more thorough explanation). If no knowledge about such a cell type exists prior to the analysis, taking a cell type that has a nearly constant relative abundance over all samples is often a good choice. It is also possible to let scCODA find a suited reference cell type by usingreference_cell_type="automatic"
. Here, we take Goblet cells as the reference.
[6]:
model_salm = mod.CompositionalAnalysis(data_salm, formula="Condition", reference_cell_type="Goblet")
HMC sampling is then initiated by calling model.sample_hmc()
, which produces a sccoda.util.result_classes.CAResult
object.
[7]:
# Run MCMC
sim_results = model_salm.sample_hmc()
WARNING:tensorflow:From /Users/johannes.ostner/opt/anaconda3/envs/scCODA_3/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
100%|██████████| 20000/20000 [01:01<00:00, 327.72it/s]
MCMC sampling finished. (79.348 sec)
Acceptance rate: 51.9%
Result interpretation¶
Calling summary()
on the results object, we can see the most relevant information for further analysis:
[8]:
sim_results.summary()
Compositional Analysis summary:
Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Intercepts:
Final Parameter Expected Sample
Cell Type
Endocrine 1.102 34.068199
Enterocyte 2.328 116.089840
Enterocyte.Progenitor 2.523 141.085258
Goblet 1.753 65.324318
Stem 2.705 169.247878
TA 2.113 93.631267
TA.Early 2.861 197.821355
Tuft 0.449 17.731884
Effects:
Final Parameter Expected Sample \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.0000 24.315528
Enterocyte 1.3571 321.891569
Enterocyte.Progenitor 0.0000 100.696915
Goblet 0.0000 46.623988
Stem 0.0000 120.797449
TA 0.0000 66.827533
TA.Early 0.0000 141.191224
Tuft 0.0000 12.655794
log2-fold change
Covariate Cell Type
Condition[T.Salm] Endocrine -0.486548
Enterocyte 1.471333
Enterocyte.Progenitor -0.486548
Goblet -0.486548
Stem -0.486548
TA -0.486548
TA.Early -0.486548
Tuft -0.486548
Model properties
First, the summary shows an overview over the model properties: * Number of samples/cell types * The reference cell type. * The formula used
The model has two types of parameters that are relevant for analysis - intercepts and effects. These can be interpreted like in a standard regression model: Intercepts show how the cell types are distributed without any active covariates, effects show ho the covariates influence the cell types.
Intercepts
The first column of the intercept summary shows the parameters determined by the MCMC inference.
The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.
Effects
For the effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types. Most important is the distinctions between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.
Since the numerical values of the “Final parameter” columns are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, are will change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.
Interpretation
In the salmonella case, we see only a credible increase of Enterocytes, while all other cell types are unaffected by the disease. The log-fold change of Enterocytes between control and infected samples with the same total cell count lies at about 1.54.
We can also easily filter out all credible effects:
[9]:
print(sim_results.credible_effects())
Covariate Cell Type
Condition[T.Salm] Endocrine False
Enterocyte True
Enterocyte.Progenitor False
Goblet False
Stem False
TA False
TA.Early False
Tuft False
Name: Final Parameter, dtype: bool
Adjusting the False discovery rate¶
scCODA selects credible effects based on their inclusion probability. The cutoff between credible and non-credible effects depends on the desired false discovery rate (FDR). A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries.
The desired FDR level can be easily set after inference via sim_results.set_fdr()
. Per default, the value is 0.05, but we recommend to increase it if no effects are found at a more conservative level.
In our example, setting a desired FDR of 0.4 reveals effects on Endocrine and Enterocyte cells.
[16]:
sim_results.set_fdr(est_fdr=0.4)
sim_results.summary()
Compositional Analysis summary (extended):
Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Spike-and-slab threshold: 0.434
MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 79.348 sec. Acceptance rate: 51.9%
Intercepts:
Final Parameter HDI 3% HDI 97% SD \
Cell Type
Endocrine 1.102 0.363 1.740 0.369
Enterocyte 2.328 1.694 2.871 0.314
Enterocyte.Progenitor 2.523 1.904 3.088 0.320
Goblet 1.753 1.130 2.346 0.330
Stem 2.705 2.109 3.285 0.318
TA 2.113 1.459 2.689 0.332
TA.Early 2.861 2.225 3.378 0.307
Tuft 0.449 -0.248 1.207 0.394
Expected Sample
Cell Type
Endocrine 34.068199
Enterocyte 116.089840
Enterocyte.Progenitor 141.085258
Goblet 65.324318
Stem 169.247878
TA 93.631267
TA.Early 197.821355
Tuft 17.731884
Effects:
Final Parameter HDI 3% HDI 97% \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.327533 -0.506 1.087
Enterocyte 1.357100 0.886 1.872
Enterocyte.Progenitor 0.000000 -0.395 0.612
Goblet 0.000000 0.000 0.000
Stem -0.240268 -0.827 0.168
TA 0.000000 -0.873 0.252
TA.Early 0.000000 -0.464 0.486
Tuft 0.000000 -1.003 0.961
SD Inclusion probability \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.338 0.457133
Enterocyte 0.276 0.998400
Enterocyte.Progenitor 0.163 0.338200
Goblet 0.000 0.000000
Stem 0.219 0.434800
TA 0.220 0.364000
TA.Early 0.128 0.284733
Tuft 0.319 0.392533
Expected Sample log2-fold change
Covariate Cell Type
Condition[T.Salm] Endocrine 34.413767 0.014560
Enterocyte 328.331183 1.499910
Enterocyte.Progenitor 102.711411 -0.457971
Goblet 47.556726 -0.457971
Stem 96.897648 -0.804604
TA 68.164454 -0.457971
TA.Early 144.015830 -0.457971
Tuft 12.908980 -0.457971
Saving results¶
The compositional analysis results can be saved as a pickle object via results.save(<path_to_file>)
.
[11]:
# saving
path = "test"
sim_results.save(path)
# loading
with open(path, "rb") as f:
sim_results_2 = pkl.load(f)
sim_results_2.summary()
Compositional Analysis summary:
Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Intercepts:
Final Parameter Expected Sample
Cell Type
Endocrine 1.102 34.068199
Enterocyte 2.328 116.089840
Enterocyte.Progenitor 2.523 141.085258
Goblet 1.753 65.324318
Stem 2.705 169.247878
TA 2.113 93.631267
TA.Early 2.861 197.821355
Tuft 0.449 17.731884
Effects:
Final Parameter Expected Sample \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.327533 33.362298
Enterocyte 1.357100 318.299445
Enterocyte.Progenitor 0.000000 99.573196
Goblet 0.000000 46.103691
Stem 0.000000 119.449419
TA 0.000000 66.081776
TA.Early 0.000000 139.615611
Tuft 0.000000 12.514563
log2-fold change
Covariate Cell Type
Condition[T.Salm] Endocrine -0.030207
Enterocyte 1.455143
Enterocyte.Progenitor -0.502738
Goblet -0.502738
Stem -0.502738
TA -0.502738
TA.Early -0.502738
Tuft -0.502738
[11]: