An atlas of chromatin accessibility in mouse at single cell resolution

Publication
Tutorial
GitHub

Study Design

What is sci-ATAC-seqMouse sci-ATAC-seq Atlas

What is sci-ATAC-seq

To generate these data we have used a technique we developed called sci-ATAC-seq (Cusanovich et al., Science 2015). sci-ATAC-seq uses a paradigm called combinatorial indexing, where nucelic acids from cells are labeled with unique combinations of barcodes via multiple rounds of split-pool barcoding.

Mouse sci-ATAC-seq Atlas

From sci-ATAC-seq we report single cell measurements of chromatin accessibility for 17 samples spanning 13 different tissues in 8-week old mice

Downloads

ATAC Matrices

Similar to sc-RNA-seq, sci-ATAC-seq data is typically analyzed in sparse peak (row) by cell (column) matrices. The first set we provide are binarized counts. The second set has rare peaks filtered out and is then normalized with TFIDF to allow for input to PCA/TSNE, for example. Note that only cells in our final QC filtered set are included. See tutorials for examples of how to read these formats into R or python along with documentation on lots of other downstream analysis.

Activity Score Matrices

We also report gene activity scores, where a single number is calculated based on a weighted combination of proximal and distal sites for each gene (see manuscript for details; both quantitative and binarized calculations provided below). Unlike the ATAC matrices above, these are in gene (row) by cell (column) format

Metadata

For all cells and peaks used in our QC filtered set, we report tables of metadata including information about tissue source, cell type assignment, TSNE coordinates, cluster assignments, etc. for cells, and intersections with genes (TSS only) for peaks.

Differential Accessibility

We report results from differential accessibility (DA) tests performed between each cluster of cells (final iterative clusters) and a set of 2K sampled cells. See manuscript for details. These are reported using both the binarized ATAC matrix (contains all peaks) and the binarized gene activity score matrix (contains a single entry per gene).

Specificity Scores

We report specificity scores to rank elements by their restricted accessiblity within each of our clusters (see manuscript for details). Only sites that had significant specificity scores at our empirically determined false discovery rate threshold are reported. These are provided in both Excel format and text format to allow browsing of results.

Comparisons to scRNA-seq Datasets

In our manuscript we also examine similarity between our sci-ATAC-seq dataset and several sc-RNA-seq datasets. We do this using a cluster-level correlation-based approach and a cell-by-cell KNN-based approach. Both used activity scores as calculated by Cicero as input (see above). Here we provide the cluster-level correlations for each dataset/tissue and the cell-by-cell KNN results for each dataset we have compared to.

Cicero Maps

We have also run Cicero (Pliner et al.), which connects regulatory elements to their target genes using coaccessibility as a measure of connectedness, as measured by sci-ATAC-seq. We have generated Cicero maps for each cluster in the dataset. Maps and peak sets are combined into single files with columns to indicate the cluster and subset_cluster entries correspond to.

Basset Results

We have also trained convolutional neural network (CNN) models with Basset (Github; Kelley et al.) to find motifs that distinguish our clusters from one another. Here we provide results relevant to interpretation of these motifs as well as the actual models generated by Basset so they may be used in downstream analyses. To interpret the filters in the first layer of the CNN model, we utilize common tools for interpreting PWMs such as TomTom and MEME from the MEME suite in conjunction with the Hocomoco PWM database.

GWAS h2 Enrichments

As described in the manuscript, we report enrichments in heritability (h2) in DA peaks with positive betas for each cluster across many human traits as measured by GWAS. These enrichments are calculated using a tool called partitioned LD score regression (LDSC; Finucane et al.; Github). We also report the trained LDSC models and baseline model which could be used to calculate enrichments for any other trait given the appropriate summary statistics.

BAM Files

While we provide some raw data on GEO (GSE111586), we also provide BAM files of the sequences aligned to mm9 here in case users would like to use them for their own pipelines or methods development. Below we provide one file per tissue, named by their tissue.replicate ID as specified in the tissue.replicate column of the cell_metadata.txt file in the metadata section above. This means there will be two files for tissues where we performed a replicate and a single file for all other tissues (in addition to BAM index files). Each read is assigned to a cell ID (the sequence specified in the cell column of the same metadata file mentioned above). This is encoded in the read name as cellid:otherinfo, so the sequence before the colon is the corrected cell barcode sequence for the read, reads are already deduplicated, There will be cell IDs that do not appear in our final set of cells, as data is a superset of what ultimately passes our QC steps, and Files may not download correctly in Chrome (and other web browsers), but they can easily be downloaded with wget or curl, by right clicking and copying the link address.

UCSC Trackhub and Bigwigs

We also provide bigWig files and a UCSC trackhub to visualize aggregated pseudo-bulk ATAC-seq profiles for the cells from each cluster. Note that for the smallest clusters, the data will appear fairly sparse even in aggregate at any single locus. In general, we prefer methods for assessing differential accessibility or specificity computationally over visual inspection, although viewing tracks is often useful to get a sense for the data at a given locus. You may access our UCSC trackhub here. By default the hub will contain a track at the top called _All_Peak_Calls, which annotates regions that we called within LSI clusters for each tissue (see Methods). These peaks were used as our features for all downstream analysis. The trackhub will also contain a track for each cluster in the dataset named according to the convention cell_label-id, where cell_label and id are defined in the same way as they are in our cell_metadata.txt file above. Spaces and periods in cell labels have been removed or replaced as necessary.

BigWig Files

In case you would like access to the files used to make the trackhub above, we provide them for download below. Each file is named in the same manner as described above with a .bw or .bb extension.