Datasets Overview

The majority of models within the passport have undergone extensive characterisation including: sequencing, copy number, methylation, gene expression and drug screening. These datasets enable researchers to identify and understand the underlying molecular causes of cancer. Systematic comparisons have demonstrated that cancer cell lines and organoids effectively represent clinical tumour samples.

These large scale genomic and functional datasets have been made available through the website as processed downloads and via the API. In addition, links to the raw data enable users to independently analyse each dataset.

Table of data sets and key information.

ModelsDatasetTypeDetailsData/LinkPublication
Cell LinesWhole Exome SequencingBAMIllumina HiSeq 2000EGAS000010009781
Cell LinesCopy Number VariationAffymetrix SNP6EGAS000010009781
Cell LinesExpressionCELAffymetrix Human Genome U219 ArrayE-MTAB-36101
Cell LinesRNASeqBAMIllumina HiSeq 2000EGAS000010008282
Cell LinesMethylationTAR (of IDAT)Illumina Human Methylation 450 BeadChipGSE683791
OrganoidsTargeted SequencingCRAMIllumina HiSeq 4000EGAS00001002221
OrganoidsWhole Genome SequencingCRAMIllumina HiSeq 4000EGAS00001002222

Publications:


Gene Annotation & Mapping

All genes have an internal ID, allowing mapping to current and previous HGNC gene symbols, Ensembl Gene IDs (v91) and other external gene identifiers. All genes with an HGNC-approved symbol as of April 2018 are currently included in the Passports, including those without a protein product. Any dataset values that were mapped to genes without an official gene symbol have been discarded from processing, but continue to be available in raw data downloads.


Cancer Driver List

To annotate cancer drivers, the list of cancer driver variants from the above Cell paper was used (Table S2C), excluding any fusion genes.

Variants

Only mutations listed in this table can receive cancer-related annotations. Mutations not found in this table are considered technical artefacts or non-oncogenic (passenger) mutations. From this table, genes that pass the 'Recurrence Filter' and are present in one of the annotated driver gene lists are marked as cancer mutations.


Oncogene / Tumour Suppressor Gene annotation

From the list of cancer genes - those that pass the recurrence filter - all cancer genes with 'Truncating' mutations are annotated as tumour suppressor genes, while genes without such mutations are designated oncogene. This results in Driver Gene List that can be found on the Datasets & Downloads page.


Gene Fusions

Fusions are annotated from RNASeq data as detailed in Picco et al., 2019. Specifically, the fusion events, validation information and patient annotation is obtained from Supplementary table 2 from this paper. The COSMIC fusion list is obtained from the COSMIC Fusions page and matched by gene symbol to the Cell Model Passports.


Model Authentication for Datasets

For published datasets we recommend users refer to the original publication for details of the model authentication within that dataset. During dataset integration names and identifiers have been cross referenced to ensure that the data is attributed to the correct model.

Newly generated organoid sequencing data available through the passports has been authenticated back to primary tumour samples obtained from clinical sites using a panel of 95 SNPs assayed using the 96.96 Dynamic Array IFC, Fluidigm.