Data-driven Biomarker and Drug Discovery using Network-based Approach

An increasing body of large-scale genomic profiling data has been being generated on many diseases including cancers and on a number of drugs and compounds. The exploration of such big data has led to data-driven biomedical research. The data-driven studies include exploring disease subtypes with distinct molecular patterns, uncovering novel diagnosis biomarkers or treatments, and discovering new indications of drugs along with novel mechanisms of drug action, among others. However, challenges remain to integrate, interpret and convert the big biomedical data into informative knowledge or therapeutic discovery, and demanding sophisticated computational approaches for the big data analysis. Here, we review the network-based approaches as a promising strategy for novel biomarker and drug discovery of cancer by integrating big and diverse data of genomics.

Genomic profiling is also conducted to unveil molecular signatures of drugs and mechanisms of drug action.In the Connectivity Map (CMAP), for example, the transcriptome signatures of 1302 small compound drugs are uncovered based on four cell lines by comparing the mRNA expression profiles before and after drug treatment [9].The CMAP data has been used in repositioning the drugs for new indications by associating molecular signatures with reverse disease gene signatures.The CMAP datasets are further expanded by including more drugs or compounds and genetic perturbations (e.g RNAi) in the Library of Integrated Network-based Cellular Signatures (LINCS) program (http://www.lincsproject.org/).This leads to a database of gene signatures of over 5,000 compounds and 10,000 genetic perturbations on tens cell lines.Similarly, the Cancer Cell Line Encyclopedia (CCLE) [10] and Genomics of Drug Sensitivity in Cancer (GDSC) [11] also conduct extensive systematic investigations of the genomics basis of drug responses across nearly all tumor cell lines and with the database available publically.
More recently, multi-omics profiling is conducted at the singlecell level on a number of cancer types.For example, single-cell genomic sequencing has been conducted on bladder cancer [12] and breast cancer [13].Single-cell exome sequencing has been conducted on kidney cancer [14], myeloproliferative neoplasm [15] and muscleinvasive bladder cancer [16].Single-cell mRNA sequencing has been conducted on melanoma [17] and prostate cancer [18].The single-cell genomic sequencing uncovers the general landscape of mutations, such as single nucleotide variations, insertions and deletion on the tumor genome.The single-cell transcriptomic sequencing reveals information about transcriptomic alterations, including those related to mRNAs, microRNAs, retained introns, alternative splicing, longnoncoding RNAs and fusion genes, with a much higher detection rate.The single-cell profiling allows the exploration of tumor heterogeneity, which is highly responsible to drug sensitivity, resistance and relapse of cancer therapy [19,20].This further allows the development of a long-lasting therapeutic regimens of cancer by targeting tumor heterogeneity [20].
The multi-omics data of cancer is diverse and complex, and drug responses are high heterogeneous [21,22].Thus it is challenging to

Big Data of Genomics Profiling Cancer Subtypes and Drug Response
Along with the advance in new technologies, particularly the next generation sequencing (NGS), large-Scale genomics profiles of cancer samples are increasingly generated.With more and more large-scale genomic data becoming available, biomedical research becomes increasingly data driven or data intensive.For example, the Cancer Genome Atlas (TCGA) program [1], supported by the National Institutes of Health (NIH), has profiled over 11,000 cancer patient across 30 tumor types and subtypes.Each patient sample is profiled for the mRNA (using microarray and RNAseq), miRNA and protein expression, DNA aberrations (using DNAseq and SNP array), and the epigenomics (DNA methylation and histone modification).Integrative analyses of these data have uncovered novel subtypes of tumor and the underlying complex molecular mechanisms on various types of cancer such as breast cancer [2], squamous cell lung cancer [3,4] and Uterine cancer [5].The International Cancer Genome Consortium (ICGC), on the other hand, provides more comprehensive genomics profiles of cancer patients in a global scale, with the goal of uncovering the genomic, transcriptomic and epigenomic changes of about 50 tumor types or subtypes [6].All data ISSN: 2378-3648 Li and Zhan.J Genet Genome Res 2015, 2:2

ISSN: 2378-3648
Li and Zhan.J Genet Genome Res 2015, 2:2 network of protein-protein interaction associated with metastatic prostate cancer.In specific, 40 genes associated with metastatic prostate cancer were obtained by using DisGeNET [38,39] online search, and then the genes were used as input of ReactomeFIPlugIn (a plugin of Cytoscape) [40,41] to to generate one sub-network (24 nodes and 46 edges) (Upper-Panel) that all nodes (out of the 40 prostate cancer genes) are directly linked (from Reactome database), and one subnetwork (44 nodes (9 red nodes are linker genes; 35 green nodes are prostate cancer genes) and 147 edges) (Lower-Panel).Summaries of these signaling networks and network-based target discovery were reported in [7,[42][43][44].In addition, the network constraint was used in classifying cancer subtypes based on gene mutation data [45].Individual patients often have distinct gene mutations, and it is often difficult to use the mutation data to classify patients into subtypes due to the data missing (no mutation in a given gene).To solve the problem, the mutation information of individual genes can be diffused on the protein-protein interaction network, and the clustering analysis is then conducted on the diffused mutation data to obtain meaningful cancer subtypes [45].In addition to applied on the single type of genomics data, the network constraints is also used to integrate multiple types of genomics data, e.g., mRNA, miRNA and DNA copy number data in [46].In brief, signaling pathways are selected as a factor graph, and the protein status is determined by the gene copy number and mRNA expression level, as well as the neighboring nodes on the factor graph [46].It is expected that more data-driven computational methods will being developed based on integration of genomic variation data with the network constraint, contributing to the robust identification of disease related genes or signaling networks as biomarkers.

Network-Based Approaches for Drug Discovery
Network medicine has been believed as the next paradigm of drug discovery [7,8,47].Network-based drug discovery often employs the graph theory or topological analysis of network to visualize and identify drugs or drug combinations.For example, figure 2 shows the sub-networks of FDA approved drugs and their target proteins has integrate and interpret such big data for inferring knowledge or mechanisms and for therapeutic drug and target discovery.Among various computational approaches so far developed, the networkbased approach appears to be highly promising for the genomic big data analysis.

Network-Based Approaches for Biomarker Discovery
Diseases are often regulated by a set of genes that coordinate and interact one and another in a network to maintain or regulate biological processes within a cell.This provides a basis of the networkbased approach to data-driven disease biomarker discovery.For example, the human disease network and disease -gene network were investigated in [23,24].The disease network showed the common and distinct gene functional modules of different diseases.In addition, the metabolic disease network was reconstructed in [25].The human diseases are linked if mutated enzymes associated with them catalyze adjacent metabolic reactions [25] and the network analysis shown that the diseases with more connections to other diseases have higher prevalence and mortality rate.
Mathematically, the constraint of network can be viewed as the conditional random field (CRF), and explained as genes that functionally connected should be selected or should not be selected as biomarkers together.There are several widely used interactome databases [26][27][28] of protein-protein interaction or signaling networks, such as STRING [29,30], IntAct [31], MINT [32,33], BioGRID [34], Biocarta [35], Reactome [28], HPRD [26], and KEGG [27].Such network information can serve as a constraint for selecting genes as robust biomarkers.For example, the average gene expression difference of connected genes can be used to select sub-network biomarkers for classifying the breast cancer metastasis from normal samples [36].The discovered network biomarkers have increased reproducibility across data sets and increased classification accuracy and robustness.Moreover, gene biomarkers can be selected by the network constraint, many of which are potential disease-causal genes that regulate differentially expressed genes [37].For example, figure 1 shows a sub- shown that etiological drugs targeting disease genes have a higher odd to be effective [48].The topological analysis of the drug-target network indicates that many drugs target on the same set of targets.(Figure 2) To expand the work, the diffusion process on the drugtarget network can applied to predict unknown targets of given drugs [49].The nearby targets of drugs that are not directly connected in the drug-target network will be potential off-targets of drugs.Moreover, the STITCH database provides a comprehensive drugtarget interactions based on both know drug-target interactions as well as the literature report evidence [50].
In addition to the drug-target network, the CMAP genomics profiling data of drugs are particularly used to reconstruct the drugdrug interaction network for drug discovery [51].For example, similarity scores of drugs are first estimated based on the genomics data of multiple cell lines before and after drug treatment; and then the drug-drug network is constructed by linking drug pairs that have similarity scores than the given threshold [51].Then drug-drug network is partitioned into sub drug-drug networks (or modules), in which drugs are believed share similar mechanisms of action.Based on the known targets and clinical indications of some drugs, drug targets and mechanism of other drugs in the same sub-network can be predicted.The sub drug-drug network can be also used in discovering synergistic drug combinations [50,52,53].For example, drugs that are from different sub-networks with distinct modes of action and target on different parts of the disease signaling networks are identified and considered as have a higher degree of synergy.

Summary
Diseases are often regulated by complex signaling networks, and multi-drugs and multi-targets are often associated to form a big drug-target network.The network-based approach is thus a logic choice for the data-driven therapeutic discovery.Topological structure of networks provides constraints to select more robust and causal network biomarkers, associates drugs with the targets through the information diffusion process.An increasing number of network-based approaches have been developed for biomarker and drugs discovery.The development is particularly benefited from the availability of many well established bionetwork analysis and visualization tools (e.g., igraph in R (http://igraph.org/r/),NetworkX in Python (https://networkx.github.io/),Cytoscape (http://www.cytoscape.org/)).Yet, challenges remain in discovering informative biomarkers and effective drugs.More sophisticated or comprehensive network-based approaches to therapeutic discovery are expected in the new era of pharmacogenomics research.

Figure 1 :
Figure 1: Examples of sub protein-protein interaction network associated with metastasis prostate cancer.Forty genes associated with metastatic prostate cancer (obtained by using DisGeNET online search), and then the genes were used as input of Reactome FI PlugIn (a plugin of Cytoscape) to generate one sub-network (24 nodes and 46 edges) (Upper-Panel) that all nodes (out of the 40 prostate cancer genes) are directly linked (from Reactome database), and one sub-network (44 nodes (9 red nodes are linker genes; 35 green nodes are prostate cancer genes) and 147 edges) (Lower-Panel).