Over the last decade, the maturation of high-throughput biotechnological platforms such as massively parallel sequencing has spurred a data-driven revolution in biomedical research. 2003 marked a major milestone when the Human Genome Project released the first consensus sequence of a complete human genome. Since then, there has been a tremendous acceleration in the field of large-scale functional genomics that uses high-throughput experimental techniques to measure the genome-wide activity of 1000s of cellular moieties with the goal of trying to decipher functions of different parts of the genomic sequence and unravel the dynamic control systems that regulate all cellular processes. Large, publicly-funded, coordinated consortia are generating massive compendia of heterogeneous functional genomic data. Now, more than ever, biologists are looking to engineering and computational techniques to help organize, analyze and interpret the deluge of data. My long-term research interest is the development of statistical and machine learning methods for integrative analysis of heterogeneous, high-throughput biological data to obtain a systems-level view of cellular mechanisms and complex diseases.
A biological primer
The genomic DNA sequence of a living organism encodes a myriad of functional elements. Genes are genomic elements whose DNA sequences are transcribed to intermediate RNA molecules that are further translated to proteins. Proteins perform a variety of critical cellular functions including the transcription of genes and the regulation of gene expression i.e. controlling the timing and amount of synthesis of RNA and proteins. A living cell adapts and responds to changes in its internal and external environment primarily by regulating gene expression. The regulatory mechanisms that control gene expression have static and dynamic components. All genes have their respective regulatory domains in the genome that are composed of a variety of short DNA sequence motifs to which specific regulatory proteins bind to activate, enhance or repress gene expression. This combinatorial grammar of regulatory sequence elements is statically encoded in the genome. However, the regulatory activity of these elements is also dynamically controlled by altering the synthesis of the regulatory proteins that bind these elements. There is another layer of dynamic control in the form of a plethora of chemical modifications that can be made to DNA and to structural proteins that the genomic DNA is wrapped around (collectively known as chromatin). These epigenomic modifications can completely override or complement the regulatory information encoded in the genomic sequence by affecting various properties of chromatin (e.g. compactness) thereby controlling the accessibility of regulatory elements by regulatory proteins. In a multi-cellular organism, the tremendous morphological and physiological diversity of different cell types which share the same genomic DNA sequence is also largely a consequence of dramatic differences in epigenomic activation or silencing of genomic domains. Complex diseases are often a consequence of subtle disruptions to different combinations of functional elements, which collectively disrupt complex regulatory pathways. Hence, deciphering the entire regulatory parts-list of the cell, understanding its static and dynamic components and unraveling the complex network of regulatory interactions that control gene expression are key steps to obtaining a systems-level understanding of cellular response and disease.
Past, current and future research goals
My past and current research spans three major areas each critical to understanding gene regulatory mechanisms.
1. A comprehensive functional annotation of the human genome:
Only about 1.5% of the human genome codes for genes. The remaining intergenic DNA contains a bewildering number of undiscovered and poorly understood regulatory elements. In order to reverse engineer the complex regulatory pathways that affect gene expression, we first need to obtain a comprehensive collection of regulatory elements. The ENCODE (Encyclopedia of DNA elements) project was launched in 2003 to pick up where the Human Genome Project left off, with this exact goal of systematically identifying all functional elements in the human genome . I served as the lead data coordinator and one of the primary computational analysts for the ENCODE consortium from 2008 to 2012. I developed novel computational methods and collaborated extensively with experimental and computational research groups to reveal the regulatory architecture of the largest collection of human functional elements to date. I co-authored 10 publications in 2012 in the prestigious journals Nature, Genome Research and Genome Biology. Several independent scientific reviews rated ENCODE as one of the top research advances of 2012. My main research contributions were as follows:Discovering regulatory elements by extracting biological signal from noisy experiments: The ENCODE experimental groups generated over 2000 diverse functional genomic datasets using a variety of protocols, over a span of more than 4 years with rapidly evolving biotechnological platforms. These high-throughput experiments are riddled with various types of noise, artifacts and systematic biases and the first step to successful data integration is effective noise filtering and normalization of data. For example, genome-wide binding maps of 100s of regulatory proteins were generated in a large number of normal and cancerous human cell-lines. These binding maps consist of noisy, continuous, genome-wide signals roughly representing the probability of binding of specific regulatory proteins at each location in the genome. These raw data need to be processed in order to identify high-confidence binding sites as potential regulatory elements. To handle the immense scale and diversity of datasets, I developed a novel signal processing pipeline for robust, automated discovery of regulatory elements (collaboration with Prof. Qunhua Li, Penn State and Prof. Peter Bickel, UC-Berkeley). We leveraged a powerful, probabilistic framework (based on copula-mixture models) known as the Irreproducible Discovery Rate (IDR) that uses the rank-based reproducibility of binding signal across replicate datasets of an experiment in order to identify a critical threshold indicating a transition from signal to noise [1,2]. I used the IDR framework to uniformly process all the ENCODE datasets and obtain a comprehensive collection of high confidence regulatory elements. This method and other novel statistical quality control measures that I developed are key components of the official ENCODE data standards which serve as state-of-the-art guidelines to the scientific community for analyzing functional genomics experiments [1,3,4].Deciphering higher-order organization and functional heterogeneity of regulatory elements: A regulatory protein can perform multiple functions by interacting with and co-binding DNA with different combinations of other regulatory proteins. The comprehensive maps of discovered regulatory elements provided the unique opportunity to explore their higher-order organization in the genome and the functional heterogeneity of regulatory proteins. I developed a novel supervised learning formulation based on rule-based ensembles and regularized regression that was able to sort through the combinatorial complexity of possible regulatory interactions and learn statistically significant (non-random) item-sets of co-binding events at an unprecedented level of detail . I discovered a large number of novel pairwise and higher order interactions, several of which were validated. We showed extensive evidence that regulatory proteins could switch co-binding partners at different sets of regulatory domains within a single cell-type and across different cell-types. More importantly, these context-specific sets of regulatory proteins were found to be in close proximity to (and potentially regulate) distinct functional categories of genes. Epigenomic chromatin modifications also play a key role in defining functions of regulatory domains. ENCODE generated genome-wide signal maps of several key chromatin modifications which allowed me to answer two key questions – Do all binding sites of a regulatory protein share a similar chromatin environment? What are the similarities and differences of the chromatin environment at binding sites of different regulatory proteins? To tease out this functional heterogeneity, I developed a novel clustering algorithm to discern distinct patterns of chromatin modification signals in the vicinity of binding sites of 100s of regulatory proteins . We revealed for the first time, extensive diversity and asymmetry (hidden directionality) of chromatin signal patterns. The functional significance of these patterns was evident by the fact that different classes of regulatory proteins (activators and repressors) showed different types of chromatin patterns. Also, regulatory elements showing distinct chromatin signal patterns were associated with differences in expression levels of nearby genes, in local sequence content and in the composition of co-binding regulatory proteins.
2. Epigenomic regulation in diverse human cell-types
Hundreds of different chromatin modifications have been identified to date. Different combinations of modifications (chromatin states) have been found to define different types of functional domains (e.g. activating, repressed, transcribed) in the genome. However, only a sparse number of chromatin states have been well characterized so far. Deciphering biological meaningful sets of modifications from the enormous number of possibilities is a daunting task. Moreover, the chromatin states of genomic domains are modified during cellular differentiation giving rise to different cell-types and these states are often disrupted in different diseases. Hence, elucidating the dynamics of chromatin states across different cell-types is a critical step to understand epigenomic mechanisms of regulation. Since March 2012, I have been working with the Roadmap Epigenomics consortium that has generated the largest collection of genome-wide signal maps of dozens of chromatin modifications in 100s of human primary cell-types and tissues. I have trained multivariate hidden Markov models on these signal maps jointly across all the cell-types in order to learn a limited repertoire of hidden chromatin states and automatically segment the human genome into cell-type specific regions annotated with different chromatin state labels. These dynamic chromatin-state maps are not only revealing a staggering number of novel regulatory domains but are also allowing us to infer detailed similarities and differences of epigenomic regulation between the different cell-types. By correlating the dynamic transitions of chromatin state labels of regulatory elements with transcriptional activity of genes across cell-types, using novel probabilistic models, we have been able to infer long-range regulatory interactions between distal regulatory elements and their target genes. We are currently exploring new large-scale learning to reveal the regulatory sequence grammars underlying ~2.3 million dynamic regulatory elements discovered in the human genome.
3. Predictive, causal, unified models of gene regulation
High-throughput experimental methods are allowing us to accurately measure gene expression levels of all genes in the genome in a variety of contexts such as different cell types or under various perturbations (e.g. drugs, stress and disease). From a systems standpoint, one can consider this global transcriptional response of the cell as the output of a complex regulatory network. Our goal is to reverse engineer the regulatory network (consisting of static and dynamic regulatory elements) by linking regulatory element dynamics with gene expression dynamics. I have developed a novel machine learning framework based on Boosting algorithms (an ensemble machine learning method) to learn predictive models of gene regulation from heterogeneous types of functional genomic features [6,7,8,9]. Each gene and every experimental context is represented by heterogeneous sets of millions of complementary and potentially interacting features such as DNA sequences of gene-specific regulatory domains, context-specific expression levels of regulatory proteins and activity profiles of diverse types of regulatory elements. Groups of genes that are co-regulated in specific groups of experimental contexts will tend to show correlated patterns of expression and also share similar sets of regulatory features. By jointly analyzing the dynamics of the transcriptional output of the regulatory system across a large number of contexts with these heterogeneous feature sets, the learning algorithm automatically selects a sparse set of informative features and also learns the statistical interactions between them to build a model (an ensemble of decision/regression trees) that can accurately and quantitatively predict context-specific gene expression. I have built novel learning modules (weak-learners) to automatically discover probabilistic representations of DNA sequence motifs from raw regulatory sequence data. I have used this framework to learn regulatory networks controlling stress responses in yeast and embryonic development in worms. I generated novel hypotheses about the activating or repressing roles of complex sub-networks of interacting regulatory elements, several of which were validated in follow up experiments .Most regulatory network modeling approaches including my own have been very successful in simpler model organisms such as yeast and worm that have relatively simpler and better understood regulatory mechanisms, more complete functional annotations of their genomes and tremendous amounts of supporting data to learn from. Analogous efforts in humans have been less successful because the common tractable approach to handle the layered complexity of human regulation has been to analyze separate pieces of the regulatory system in isolation. ENCODE and other similar efforts have laid the foundations to build more complete and comprehensive models of regulation in humans. We have started extending some of my previous work on learning regulatory networks to model diverse regulatory mechanisms within unified machine learning frameworks. For example, in humans there are a large number of long-range regulatory connections between regulatory elements and target genes that are millions of bases (letters) apart on the linear genomic scale. The three dimensional folding structure of DNA brings these distal regulatory elements in physical proximity to their target genes. Learning these long-range regulatory contact maps will be extremely important to build predictive and accurate regulatory models. In order to effectively learn from vast and diverse feature spaces and massive amounts of data, significant improvements will be needed to the learning algorithms and model representations. So far batch learning algorithms dominate in computational genomics. We are exploring efficient online learning and other large-scale, distributed learning algorithms in order to rapidly update our models with the ever increasing amounts of data. We are also exploring deep learning architectures that are particularly good at capturing highly combinatorial functions by learning hierarchical feature representations that are well suited to capture the complexity of regulatory interactions. In order to learn causal statistical models, we are developing machine learning methods to integrate dynamic temporal and perturbation datasets specifically in the context of cellular differentiation (in skin) and reprogramming (using heterokaryons).Finally, in computational biology, the interpretability of learned models is of paramount importance. Learning sparse sets of informative features is especially difficult due to highly correlated, redundant features (an inherent property of biological systems conferring robustness) that are often equally important in terms of their biological function. I am keen to explore novel regularization and feature selection methods to overcome these problems.
4. Regulatory variation in personal genomes and non-coding genetic variation in complex diseases
Homo sapiens comprise a single species. However, no two human beings are exactly identical due to variations (mutations) in the personal genome sequence of each individual. Alongside the functional genomics revolution, there have been large-scale efforts to systematically map the genetic diversity and variation of genomic sequence across diverse human populations and individuals. I intend to focus on two key questions that remain to be answered – What is the variation of activity of regulatory elements across individuals and how does the underlying personal genetic variation affect regulatory mechanisms? We have already begun work in the area of cell-type specific regulatory variation by exploring differences in chromatin state and gene expression in lymphoblastoid cell-lines from several individuals of diverse ancestry . We are working on statistical and machine learning frameworks to uncover complex regulatory mechanisms such as buffering, redundancy and compensation. Research efforts focused on human diseases are using massive case-control studies (known as genome-wide association studies) involving whole-genome sequencing of personal genomes to identify genetic sequence variants (mutations) that are significantly associated with disease. However, there is a lack of tools and methods to interpret the functional effects of these disease-associated variants, especially those lying outside protein-coding regions in the genome (known as non-coding variants). This is especially relevant since a majority of disease associated variants are non-coding. We are developing new computational methods to link disease-associated genetic variants with regulatory elements and decipher downstream deleterious effects on regulatory pathways and gene expression dynamics . We are actively collaborating with several experts in various diseases and medical doctors to uncover genetic and regulatory mechanisms underlying coronary heart disease, Alzheimer's disease, ageing and colorectal cancers.
Over the years, I have had the privilege of learning from and collaborating with a wide range of experts in genomics, computational biology and machine learning on some of the most exciting projects of this decade. In the process, I have developed a deep understanding of the underlying biology and mechanisms of gene regulation as well as the multitude of evolving biotechnological platforms. At the same time, I have significant expertise in formulating biological questions as computational problems and have developed a variety of statistical and machine learning approaches for massive integrative analysis of genomic data. We hope that our research efforts will open up new horizons in functional and disease genomics and ultimately improve human health!
1. Dunham I, Kundaje A, et al. ENCODE Project Consortium. An integrated Encyclopedia of DNA Elements in the human genome. Nature. 2012. PMID: 229556162. Li Q et al. Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics. 20113. Gerstein MB, Kundaje A, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012. PMID: 229556194. Landt SG et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012. PMID: 229559915. Kundaje A et al. Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome Res. 2012. PMID: 229559856. Middendorf M, Kundaje A, et al. Motif Discovery through predictive modeling of gene regulation, Proceeding of RECOMB, 20057. Kundaje A et al. Learning regulatory programs that accurately predict differential expression with MEDUSA. Ann NY Acad Sci. 2007. PMID: 179340558. Kundaje A et al. A predictive model of the oxygen and heme regulatory network in yeast. PLoS Comput Biol. 2008. PMID: 190089399. Lianoglou S, Kundaje A, et al. Inferring active signaling pathways by boosting the phosphorylation network. Proceeding of the NIPS Workshop on Computational Biology, 2006 10. Kasowski M et al. Extensive variation of chromatin states across humans. Science, 201411. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res. 2012. PMID: 22955986
Anshul Kundaje is an Assistant Professor of Genetics and Computer Science at Stanford University. You may read more in his profile page.