Data import and visualization in scCODA

When performing compositional data analysis with scCODA, data can come from various sources. Also, it is often helpful to gain some insight on the data via exploratory plotting before running the scCODA model. In this tutorial notebook, we will

  • convert data from various sources for use with scCODA via sccoda.util.cell_composition_data

  • visualize the data with sccoda.util.data_visualization

[1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import anndata as ad
import warnings

from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz

import sccoda.datasets as scd

warnings.filterwarnings("ignore")

Data import

scCODA uses the anndata format to handle cell count datasets and covariates in one object. Here, adata.X contains the cell counts as a numpy array with the dimensions \(sample \times celltype\), while adata.obs stores the covariate information for each sample as a pandas DataFrame. adata.var can contain additional information on the cell types.

Besides manually creating the data object, sccoda.util.cell_composition_data gives methods to import data directly from a pandas DataFrame or from gene-based anndata objects like in scanpy.

Import from pandas

To import data from a pandas DataFrame (with each row representing a sample), it is sufficient to specify the names of the metadata (covariate columns). As an example, we will use a dataset produced by *Haber et al. [2017]*, describing pathogen infection in the small intestinal epithelium of the mouse.

[2]:
# Read data into pandas from csv

cell_counts = scd.haber()

print(cell_counts)
            Mouse  Endocrine  Enterocyte  Enterocyte.Progenitor  Goblet  Stem  \
0       Control_1         36          59                    136      36   239
1       Control_2          5          46                     23      20    50
2       Control_3         45          98                    188     124   250
3       Control_4         26         221                    198      36   131
4  H.poly.Day10_1         42          71                    203     147   271
5  H.poly.Day10_2         40          57                    383     170   321
6   H.poly.Day3_1         52          75                    347      66   323
7   H.poly.Day3_2         65         126                    115      33    65
8          Salm_1         37         332                    113      59    90
9          Salm_2         32         373                    116      67   117

    TA  TA.Early  Tuft
0  125       191    18
1   11        40     5
2  155       365    33
3  130       196     4
4  109       180   146
5  244       256    71
6  263       313    51
7   39       129    59
8   47       132    10
9   65       168    12
[3]:
data_mouse = dat.from_pandas(cell_counts, covariate_columns=["Mouse"])

# Extract condition from mouse name and add it as an extra column to the covariates
data_mouse.obs["Condition"] = data_mouse.obs["Mouse"].str.replace(r"_[0-9]", "", regex=True)
print(data_mouse.X)
print(data_mouse.obs)
[[ 36.  59. 136.  36. 239. 125. 191.  18.]
 [  5.  46.  23.  20.  50.  11.  40.   5.]
 [ 45.  98. 188. 124. 250. 155. 365.  33.]
 [ 26. 221. 198.  36. 131. 130. 196.   4.]
 [ 42.  71. 203. 147. 271. 109. 180. 146.]
 [ 40.  57. 383. 170. 321. 244. 256.  71.]
 [ 52.  75. 347.  66. 323. 263. 313.  51.]
 [ 65. 126. 115.  33.  65.  39. 129.  59.]
 [ 37. 332. 113.  59.  90.  47. 132.  10.]
 [ 32. 373. 116.  67. 117.  65. 168.  12.]]
            Mouse     Condition
0       Control_1       Control
1       Control_2       Control
2       Control_3       Control
3       Control_4       Control
4  H.poly.Day10_1  H.poly.Day10
5  H.poly.Day10_2  H.poly.Day10
6   H.poly.Day3_1   H.poly.Day3
7   H.poly.Day3_2   H.poly.Day3
8          Salm_1          Salm
9          Salm_2          Salm

Import from scanpy

Cell type assignments are usually gained by clustering methods such as Louvain clustering. The scanpy platform provides a way of performing a variety of such assignment methods. Even though scanpy also uses the anndata format, it does so on a level of \(cell \times gene\). To allow for quick transformation into scCODA’s data format, some converter functions are available.

We will explore them on the Preprocessing and clustering 3k PBMCs tutorial dataset of scanpy. Since scCODA requires a distinction into samples, we will copy the dataset three times and add a new column sample to adata.obs. Also, we need some covariate metadata, which we also simulate for this tutorial.

[4]:
# read data
adata = ad.read_h5ad("../data/10x_pbmc68k_reduced.h5ad")
print(adata)

AnnData object with n_obs × n_vars = 700 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain'
    var: 'n_counts', 'means', 'dispersions', 'dispersions_norm', 'highly_variable'
    uns: 'bulk_labels_colors', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
[5]:
# make three copys
adata_1 = adata.copy()
adata_1.obs["sample"] = 1

adata_2 = adata.copy()
adata_2.obs["sample"] = 2

adata_3 = adata.copy()
adata_3.obs["sample"] = 3

# join them together again
adata_all = ad.concat([adata_1, adata_2, adata_3])
print(adata_all)
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
AnnData object with n_obs × n_vars = 2100 × 765
    obs: 'bulk_labels', 'n_genes', 'percent_mito', 'n_counts', 'S_score', 'G2M_score', 'phase', 'louvain', 'sample'
    obsm: 'X_pca', 'X_umap'
[6]:
# make covariate DataFrame

cov_df = pd.DataFrame({"Cond": ["A", "B", "A"]}, index=[1,2,3])
print(cov_df)
  Cond
1    A
2    B
3    A

If we have one anndata object that contains the gene expression data for all samples, we can use dat.from_scanpy. We just need to specify the columns in adata.obs that contain the sample information and the clustering results, as well as the covariate data frame.

[7]:
data_scanpy_1 = dat.from_scanpy(
    adata_all,
    cell_type_identifier="louvain",
    sample_identifier="sample",
    covariate_df=cov_df
)
print(data_scanpy_1)
AnnData object with n_obs × n_vars = 3 × 11
    obs: 'Cond'
    var: 'n_cells'

If there is one anndata object per sample, we can use dat.from_scanpy_list in the same way. Obviously, we do not need to specify a sample column now.

[8]:
data_scanpy_2 = dat.from_scanpy_list(
    [adata_1, adata_2, adata_3],
    cell_type_identifier="louvain",
    covariate_df=cov_df
)
print(data_scanpy_2)
AnnData object with n_obs × n_vars = 3 × 11
    obs: 'Cond'
    var: 'n_cells'

Compositional data visualization

Analyzing compositional data is not straightforward. scCODA provides some ways of visualizing the properties of a compositional dataset before analysis. We will showcase these functions on the data on pathogen infection of mice from *Haber et al. [2017]*.

Stacked barplot

The most common representation of cell count data in genomic research are stacked barplots. In scCODA, these can be produced easily via viz.stacked_barplot. The function can either plot one stacked bar for each level of a covariate, or one stacked bar for each sample.

[9]:
# Stacked barplot for each sample
viz.stacked_barplot(data_mouse, feature_name="samples")
plt.show()

# Stacked barplot for the levels of "Condition"
viz.stacked_barplot(data_mouse, feature_name="Condition")
plt.show()
_images/Data_import_and_visualization_14_0.png
_images/Data_import_and_visualization_14_1.png

Grouped boxplots

While stached barplots are a nice and compact way of representing compositional data, they omit crucial information about the data variance. Also, comparing the abundance of cell types that are not on the top or bottom of the bars can be hard.

To combat this, scCODA also provides a visualization via boxplots crouped by cell type and condition via viz.boxplots. The function provides a lot of customization options, like log-scale plotting, representation as a faceted plot, or the addition of dots for each sample.

[10]:
# Grouped boxplots. No facets, relative abundance, no dots.
viz.boxplots(
    data_mouse,
    feature_name="Condition",
    plot_facets=False,
    y_scale="relative",
    add_dots=False,
)
plt.show()

# Grouped boxplots. Facets, log scale, added dots and custom color palette.
viz.boxplots(
    data_mouse,
    feature_name="Condition",
    plot_facets=True,
    y_scale="log",
    add_dots=True,
    cmap="Reds",
)
plt.show()
_images/Data_import_and_visualization_16_0.png
_images/Data_import_and_visualization_16_1.png

Finding a reference cell type

The scCODA model requires a cell type to be set as the reference category. However, choosing this cell type is often difficult. A good first choice is a referenece cell type that closely preserves the changes in relative abundance during the compositional analysis.

For this, it is important that the reference cell type is not rare, to avoid large relative changes being caused by small absolute changes. Also, the relative abundance of the reference should vary as little as possible across all samples.

The visualization viz.rel_abundance_dispersion_plot shows the presence (share of non-zero samples) over all samples for each cell type versus its dispersion in relative abundance. Cell types that have a higher presence than a certain threshold (default 0.9) are suitable candidates for the reference and thus colored.

[11]:
viz.rel_abundance_dispersion_plot(
    data=data_mouse,
    abundant_threshold=0.9
)
plt.show()
_images/Data_import_and_visualization_18_0.png