Title: Surveying Mount Improbable: Computational Challenges Facing Evolution
Abstract: [In my defense, I will talk about chapters 4 and 5 of my thesis.] Since its origins on earth, life has overcome challenges that appear to be improbable. For life to have emerged, sets of molecules needed to cooperate and maintain this cooperation to build evolvable populations. There are significant obstacles for the emergence and maintenance of cooperation in early life. In the first two chapters of this thesis, we propose and analyze mechanisms that may have reduced the probabilistic burden of building evolvable systems. We present theoretical work on the role of population structure in early evolution. In chapter 1, we suggest that merging compartments can improve the odds of finding minimal evolvable cells. In chapter 2, we recognize a simple mechanism for elongation of sequences within compartments that results in surprising asymmetric selection and cooperative dynamics. In the second half of this thesis, through models and experiments, we study the structure of fitness landscapes of proteins. A fitness landscape is a map between DNA or protein sequences, and their functionality (“fitness”) in a particular context. The structure of the fitness landscape determines how hard it would be for evolution for find particular functionalities or optimize them. Hence, a thorough understanding of this structure can help us predict the likelihood of evolutionary outcomes, and also exploit it for designing proteins with specific functionalities. Due to the large space of all possible sequences, the sparsity of even the most dense experimental data available relative to that space, and the non-additivity of the effects of multiple mutations when taken together (known as epistasis), building accurate fitness landscape models presents a major challenge. In chapter 3, we introduce a machine learning technique, known as Variational Auto-encoder, that can learn the effects of mutations on sequences (the “local landscape”) by training on the evolutionary sequence record. In chapter 4, we expand the scope of our investigation of fitness landscapes by experimentally assaying approximately 80000 variants of Adeno-associated virus capsid proteins. This model organism was chosen for two reasons: (i) Its complexity limits the application of currently available computational models (ii) AAVs are promising gene therapy vectors, and understanding of their fitness landscape has high translational value. We use the data collected in this chapter to train tens of machine learning models and subsequently assay nearly half a million mutants of AAV2. In doing so, we develop state-of-the-art machine learning approaches and propose best practices in this domain. We find that simple logistic regression, trained on randomly sampled data, is competitive with more complex neural network approaches in its ability to detect functional variants far from the wildtype, however, other models are better at finding diverse sets of functional sequences, and more robust to training dataset composition. We further improve the proportion of functional variants in our library by performing a greedy model-guided search towards more promising sequences. By experimentally testing these designed sequences, we demonstrate that our approach can efficiently discover thousands of functional variants, orders of magnitude more than what is possible with rational design or random mutagenesis. In chapter 5, we show that with the aid of computational models, fitness landscapes can be explored through experimental batches in a manner that is more efficient than evolution. We develop a new exploration strategy that complements the models better than approaches that simply pick the best-scored variants greedily. The results in this thesis are relevant to evolutionary biologists, machine learning scientists, protein engineers, and gene therapy researchers.
Committee: Martin Nowak (Advisor), George Church (Co-Advisor), Michael Desai, Pardis Sabeti