Similar to sc-RNA-seq, sci-ATAC-seq data is typically analyzed in sparse peak (row) by cell (column) matrices. The first set we provide are binarized counts. The second set has rare peaks filtered out and is then normalized with TFIDF to allow for input to PCA/TSNE, for example. Note that only cells in our final QC filtered set are included. See tutorials for examples of how to read these formats into R or python along with documentation on lots of other downstream analysis.
Activity Score Matrices
We also report gene activity scores, where a single number is calculated based on a weighted combination of proximal and distal sites for each gene (see manuscript for details; both quantitative and binarized calculations provided below). Unlike the ATAC matrices above, these are in gene (row) by cell (column) format
For all cells and peaks used in our QC filtered set, we report tables of metadata including information about tissue source, cell type assignment, TSNE coordinates, cluster assignments, etc. for cells, and intersections with genes (TSS only) for peaks.
We report results from differential accessibility (DA) tests performed between each cluster of cells (final iterative clusters) and a set of 2K sampled cells. See manuscript for details. These are reported using both the binarized ATAC matrix (contains all peaks) and the binarized gene activity score matrix (contains a single entry per gene).
We report specificity scores to rank elements by their restricted accessiblity within each of our clusters (see manuscript for details). Only sites that had significant specificity scores at our empirically determined false discovery rate threshold are reported. These are provided in both Excel format and text format to allow browsing of results.
Comparisons to scRNA-seq Datasets
In our manuscript we also examine similarity between our sci-ATAC-seq dataset and several sc-RNA-seq datasets. We do this using a cluster-level correlation-based approach and a cell-by-cell KNN-based approach. Both used activity scores as calculated by Cicero as input (see above). Here we provide the cluster-level correlations for each dataset/tissue and the cell-by-cell KNN results for each dataset we have compared to.
We have also run Cicero (Pliner et al.), which connects regulatory elements to their target genes using coaccessibility as a measure of connectedness, as measured by sci-ATAC-seq. We have generated Cicero maps for each cluster in the dataset. Maps and peak sets are combined into single files with columns to indicate the cluster and subset_cluster entries correspond to.
We have also trained convolutional neural network (CNN) models with Basset (Github; Kelley et al.) to find motifs that distinguish our clusters from one another. Here we provide results relevant to interpretation of these motifs as well as the actual models generated by Basset so they may be used in downstream analyses. To interpret the filters in the first layer of the CNN model, we utilize common tools for interpreting PWMs such as TomTom and MEME from the MEME suite in conjunction with the Hocomoco PWM database.
o aid in interpretation we have found it helpful to calculate gene set enrichments using peaks that are DA and have positive betas (are open). We report these enrichments for a number of different gene sets.
GWAS h2 Enrichments
As described in the manuscript, we report enrichments in heritability (h2) in DA peaks with positive betas for each cluster across many human traits as measured by GWAS. These enrichments are calculated using a tool called partitioned LD score regression (LDSC; Finucane et al.; Github). We also report the trained LDSC models and baseline model which could be used to calculate enrichments for any other trait given the appropriate summary statistics.
While we provide some raw data on GEO (GSE111586), we also provide BAM files of the sequences aligned to mm9 here in case users would like to use them for their own pipelines or methods development. Below we provide one file per tissue, named by their tissue.replicate ID as specified in the tissue.replicate column of the cell_metadata.txt file in the metadata section above. This means there will be two files for tissues where we performed a replicate and a single file for all other tissues (in addition to BAM index files). Each read is assigned to a cell ID (the sequence specified in the cell column of the same metadata file mentioned above). This is encoded in the read name as cellid:otherinfo, so the sequence before the colon is the corrected cell barcode sequence for the read, reads are already deduplicated, There will be cell IDs that do not appear in our final set of cells, as data is a superset of what ultimately passes our QC steps, and Files may not download correctly in Chrome (and other web browsers), but they can easily be downloaded with wget or curl, by right clicking and copying the link address.
UCSC Trackhub and Bigwigs
We also provide bigWig files and a UCSC trackhub to visualize aggregated pseudo-bulk ATAC-seq profiles for the cells from each cluster. Note that for the smallest clusters, the data will appear fairly sparse even in aggregate at any single locus. In general, we prefer methods for assessing differential accessibility or specificity computationally over visual inspection, although viewing tracks is often useful to get a sense for the data at a given locus. You may access our UCSC trackhub here. By default the hub will contain a track at the top called _All_Peak_Calls, which annotates regions that we called within LSI clusters for each tissue (see Methods). These peaks were used as our features for all downstream analysis. The trackhub will also contain a track for each cluster in the dataset named according to the convention cell_label-id, where cell_label and id are defined in the same way as they are in our cell_metadata.txt file above. Spaces and periods in cell labels have been removed or replaced as necessary.
In case you would like access to the files used to make the trackhub above, we provide them for download below. Each file is named in the same manner as described above with a .bw or .bb extension.