Instructions for reproducing the database:

  1. Create a directory called Adeptus2
  2. Create a subdirectory called rscripts and put all R scripts within. See our github repository for the code.
  3. Create a subdirectory data and unzip data into it
  4. Run the desired MAIN_ script

File Name Description Size All data, including raw expression data 4,468MB
Supervised_data_atleast_10000_genes_max_percet_na_10.RData The final preprocessed supervised database, can be easily loaded into an R session. Contains:
  • x – the expression profiles: a matrix of 37782 samples vs. 10081 genes
  • y – the matrix of all labels: a binary matrix of 37337 samples vs. 216 labels (all samples appear as rows in x)
  • sample2study, sample2terms (y as list of labels), sample2tissue_slim
sample2study.RData A mapping of samples to their studies. 158KB
classification_performance_scores_summary.RData Results of the leave-study-out SVM cross-validation. Contains a list called classifier_scores_matrices. It has several score matrices. In each one the rows are labels and the first column is the score of the SVM-based classifier.
The sublist classifier2selected_diseases contains (as the first entry) the list of 68 well-classified labels (13 tissue controls and 55 disease-related).
classification_results_40k_samples.RData An R object with the leave-study-out SVM cross validation results – the actual predictions 242MB
gene_dataset_p_matrices.RData A list with an entry for each label. Each entry has a p-value matrix of genes vs labels. Each p-value is a result of comparing the label’s samples to the other samples in that study. 228MB
gene_pb_roc_scores.RData PB-ROC scores: a matrix of genes vs labels. 17,980KB
gene_pn_roc_scores.RData PN-ROC scores: a matrix of genes vs labels. 14,501KB
gene_edge_based_son2rocs.RData Results of the edge-based analysis: a list with an entry for each disease label. Each entry is a matrix of genes vs the parents of the label (typically a single parent) 14,971KB
selected_genes_adeptus2.RData A list with the selected genes for each label 20,004KB
gpl_mappings_to_entrez.RData A list that maps the probes in each microarray platform into Entrez gene ids. 9,383KB

See our github repository for the code

The R code requires the following packages: R Dependencies: e1071, PRROC, ROCR, pROC, hash, LiblineaR, gplots, VennDiagram R Dependencies (Bioconductor): CMA, preprocessCore, DO.db, limma,, IRanges, RBGL, BiocGenerics, gRbase, gRain, S4Vectors, GEOquery Optional R dependencies (for additional analyses that are implemented): RandomForest, ranger, bnlearn