Tutorial: Simple DeepProg model

The principle of DeepProg can be summarized as follow:

  • Loading of multiple samples x OMIC matrices

  • Preprocessing ,normalisation, and sub-sampling of the input matrices

  • Matrix transformation using autoencoder

  • Detection of survival features

  • Survival feature agglomeration and clustering

  • Creation of supervised models to predict the output of new samples

Input parameters

All the default parameters are defined in the config file: ./simdeep/config.py but can be passed dynamically. Three types of parameters must be defined:

  • The training dataset (omics + survival input files)

    • In addition, the parameters of the test set, i.e. the omic dataset and the survival file

  • The parameters of the autoencoder (the default parameters works but it might be fine-tuned.

  • The parameters of the classification procedures (default are still good)

Input matrices

As examples, we included two datasets:

  • A dummy example dataset in the example/data/ folder:

examples
├── data
│   ├── meth_dummy.tsv
│   ├── mir_dummy.tsv
│   ├── rna_dummy.tsv
│   ├── rna_test_dummy.tsv
│   ├── survival_dummy.tsv
│   └── survival_test_dummy.tsv
  • And a real dataset in the data folder. This dataset derives from the TCGA HCC cancer dataset. This dataset needs to be decompressed before processing:

data
├── meth.tsv.gz
├── mir.tsv.gz
├── rna.tsv.gz
└── survival.tsv

An input matrix file should follow this format:

head mir_dummy.tsv

Samples        dummy_mir_0     dummy_mir_1     dummy_mir_2     dummy_mir_3 ...
sample_test_0  0.469656032287  0.347987447237  0.706633335508  0.440068758445 ...
sample_test_1  0.0453108219657 0.0234642968791 0.593393816691  0.981872970341 ...
sample_test_2  0.908784043793  0.854397550009  0.575879144667  0.553333958713 ...
...

Also, if multiple matrices are used as input, they must keep the sample order. For example:

head rna_dummy.tsv

Samples        dummy_gene_0     dummy_gene_1     dummy_gene_2     dummy_gene_3 ...
sample_test_0  0.69656032287  0.47987447237  0.06633335508  0.40068758445 ...
sample_test_1  0.53108219657 0.234642968791 0.93393816691  0.81872970341 ...
sample_test_2  0.8784043793  0.54397550009  0.75879144667  0.53333958713 ...
...

The arguments training_tsv and path_data from the extract_data module are used to defined the input matrices.

# The keys/values of this dict represent the name of the omic and the corresponding input matrix
training_tsv = {
    'GE': 'rna_dummy.tsv',
    'MIR': 'mir_dummy.tsv',
    'METH': 'meth_dummy.tsv',
}

a survival file must have this format:

head survival_dummy.tsv

barcode        days recurrence
sample_test_0  134  1
sample_test_1  291  0
sample_test_2  125  1
sample_test_3  43   0
...

In addition, the fields corresponding to the patient IDs, the survival time, and the event should be defined using the survival_flag argument:

#Default value
survival_flag = {'patient_id': 'barcode',
                  'survival': 'days',
                 'event': 'recurrence'}

Creating a simple DeepProg model with one autoencoder for each omic

First, we will build a model using the example dataset from ./examples/data/ (These example files are set as default in the config.py file). We will use them to show how to construct a single DeepProg model inferring a autoencoder for each omic


# SimDeep class can be used to build one model with one autoencoder for each omic
from simdeep.simdeep_analysis import SimDeep
from simdeep.extract_data import LoadData

help(SimDeep) # to see all the functions
help(LoadData) # to see all the functions related to loading datasets

# Defining training datasets
from simdeep.config import TRAINING_TSV
from simdeep.config import SURVIVAL_TSV
# Location of the input matrices and survival file
from simdeep.config import PATH_DATA

dataset = LoadData(training_tsv=TRAINING_TSV,
        survival_tsv=SURVIVAL_TSV,
        path_data=PATH_DATA)

# Defining the result path in which will be created an output folder
PATH_RESULTS = "./TEST_DUMMY/"

# instantiate the model with the dummy example training dataset defined in the config file
simDeep = SimDeep(
        dataset=dataset,
        path_results=PATH_RESULTS,
        path_to_save_model=PATH_RESULTS, # This result path can be used to save the autoencoder
        )

simDeep.load_training_dataset() # load the training dataset
simDeep.fit() # fit the model

At that point, the model is fitted and some output files are available in the output folder:

TEST_DUMMY
├── test_dummy_dataset_KM_plot_training_dataset.png
└── test_dummy_dataset_training_set_labels.tsv

The tsv file contains the label and the label probability for each sample:

sample_test_0   1       7.22678272919e-12
sample_test_1   1       4.48594196888e-09
sample_test_4   1       1.53363205571e-06
sample_test_5   1       6.72170409655e-08
sample_test_6   0       0.9996581662
sample_test_7   1       3.38139255666e-08

And we also have the visualisation of a Kaplan-Meier Curve:

_images/test_dummy_dataset_KM_plot_training_dataset.pngKM plot

Now we are ready to use a test dataset and to infer the class label for the test samples. The test dataset do not need to have the same input omic matrices than the training dataset and not even the sample features for a given omic. However, it needs to have at least some features in common.

# Defining test datasets
from simdeep.config import TEST_TSV
from simdeep.config import SURVIVAL_TSV_TEST

simDeep.load_new_test_dataset(
    TEST_TSV,
    fname_key='dummy',
    path_survival_file=SURVIVAL_TSV_TEST, # [OPTIONAL] test survival file useful to compute accuracy of test dataset

    )

# The test set is a dummy rna expression (generated randomly)
print(simDeep.dataset.test_tsv) # Defined in the config file
# The data type of the test set is also defined to match an existing type
print(simDeep.dataset.data_type) # Defined in the config file
simDeep.predict_labels_on_test_dataset() # Perform the classification analysis and label the set dataset

print(simDeep.test_labels)
print(simDeep.test_labels_proba)

The assigned class and class probabilities for the test samples are now available in the output folder:

TEST_DUMMY
├── test_dummy_dataset_dummy_KM_plot_test.png
├── test_dummy_dataset_dummy_test_labels.tsv
├── test_dummy_dataset_KM_plot_training_dataset.png
└── test_dummy_dataset_training_set_labels.tsv

head test_dummy_dataset_training_set_labels.tsv

And a KM plot is also constructed using the test labels

_images/test_dummy_dataset_dummy_KM_plot_test.pngKM plot test

Finally, it is possible to save the keras model:

simDeep.save_encoders('dummy_encoder.h5')