Camila Duitama González

Postdoc researcher | Institut Pasteur | ML for Genomics

Master thesis | Camila Duitama González

Master thesis

I did my master thesis under the supervision of Dr. Holger Fröhlich and Dr. Ashar Ahmad at the University of Bonn in Germany.

I presented the results of my master thesis at the following venues:

Background:

The background of this master thesis is the Survival Based Bayesian Clustering (SBC) model developed by Ahmad and Fröhlich (2017), a model that infers clinically relevant cancer subtypes, by jointly clustering molecular data along with survival datain a semi-supervised manner. A graphical representation of the model is this:

SBC

SBC’s main features are:

Abstract:

The SBC infers clinically relevant cancer subtypes, by jointly clustering molecular data along with survival data. Originally, the model was tested on a a Breast Cancer (Van De Vijver et al., 2002) and a Glioblastoma Multiforme (GBM) (Verhaak et al., 2010) data set, without any further external validation. The objective of this master thesis was to perform an external validation of the SBC, a goal that entailed two major tasks: a rigorous feature engineering and selection process that improved the known predictive ability of the model, and the characterisation of the obtained clusters and corresponding signature by delving into other types of clinical and omics data such as Copy Number Variation and miRNA.

The TCGA-GBM data set was retrieved using the Bioconductor package RTCGAToolbox and after data preprocessing, appropriate normalisation and correction for sample selection bias, a combined patient cohort of 421 samples was obtained (160 patients for the training and 261 patients for the validation set). Various feature engineering and selection techniques were explored. Every SBC model fit was done using Gibbs sampling. The best feature engineering and selection approaches were the Block HSIC-Lasso (Climente-González et al., 2019) model for mRNA-based selection and a Penalized Accelerated Failure Time (PAFT) model on a collection of oncogenic gene sets for pathway-based selection. In both cases there was an improvement of the initial Predictive C-Index (Block HSIC-Lasso feature selection = +1.5%, PAFT feature selection = +27.6%) and Recovery C-Index (Block HSIC-Lasso feature selection = +8.7%, PAFT feature selection = +5.0%).

The work done in this master thesis is a step forward in the validation of the SBC model on an external data set such as the TCGA-GBM patient cohort.

Written document:

If you would like to read my master thesis click here