Climate Science Hackathon Winner: Team Pluripotent (Best in Show)
Dustin Maurer (Mathematical, Computational, and Systems Biology):
I’m a second year PhD student studying biological networks on a molecular scale from development to cellular differentiation. I use techniques inspired by the recent progresses in deep learning, and especially in recurrent neural networks, to model gene regulatory networks. I am particularly interested in network inference and feature extraction. My background is in mathematics and statistics with a BS in Math, an MS in Statistics (both from Kansas State University), a few years professional experience as an Actuarial modeler.
Kerrigan Blake (Mathematical, Computational, and Systems Biology; Center for Complex Biological Sciences):
I’m a third year PhD student in the MCSB program studying metastatic breast cancer in the Lawson lab. My thesis research aims to identify the genetic markers of metastatic potential by applying statistics and machine learning to large, sparse, single-cell datasets. Before coming to UCI, I received my BS in Mathematics from the University of Kansas where I did research in computational biochemistry and systems biology. In general, my academic interests lie in combining wet-lab biology with computational methods to address important issues in human health.
Srikiran Chandrasekaran (Ayala School of Biological Sciences; Center for Complex Biological Sciences):
I am a second year Ph.D. student advised by Prof. Sunny Jiang in Civil and Environmental Engineering. I use mathematical modeling to answer questions on the safety of water reuse. Specifically, I am developing a transport model to track the flow of viruses from secondary effluent used for irrigation, to the final produce. Before UCI, I obtained my B. Tech in Biotechnology at the Indian Institute of Technology – Madras.
We took a multi-pronged approach to this data science hackathon competition. First we ordered the three data sets from least to most complex figuring we would work our way up in complexity as time permitted. This meant that we started with the California reservoir data set. Noticing that most of the reservoirs followed the same seasonal pattern of filling in the early spring and emptying in the late summer, we felt the data might best be understood by a dimensionality reduction such as principal components (PCs). It turned out that there was so much correlation in the reservoirs that about 95% of the variance could be accounted for with the first PC and the remainder was almost entirely in the second PC. Naturally, plotting the reservoirs in PC space by year rendered their long term nature while plotting them by month displayed their yearly cycle. Finally, we made an animated gif to demonstrate the whole dataset on a weekly level. Feeling like this exercise provided a good understanding of the reservoir data we moved on to the next set.
We began with the forest fire dataset by attempting to answer the question presented along with the raw numbers, namely, how can we determine which observations represented the same larger fire and which represented distinctly different fires? To address the question we developed a simple Euclidean distance metric based on latitude, longitude, and time in days. We fit a gaussian mixture model guided by the bayesian information criterion to establish a cutoff for which fires were close enough to each other to be considered part of the same fire and which fires were far enough apart to be considered separate fires.
We proceeded by performing clustering and visualization in t-SNE space to help us understand what made fire events distinct and which variables may be related. We also visualized fire location and intensity in a month by year facet plot to help understand yearly cycles and trends in these two dimensions. Additionally, coloring the clustered points according to the values of time of the year, vapor pressure deficit (VPD), humidity and temperature directed us to build a linear model relating these variables to the fire radiative power (FRP). Our model served to order and quantify the effect of the most important factors in determining the intensity of a forest fire event. Selecting the model having the best Akaike and Bayesian information criteria, we found that VPD, humidity and temperature are positively associated with FRP in that order (by strength of association). In addition, we were able to explain the seasonality of fires with our model. This analysis also helped generate a testable hypothesis that there is some interaction between these variables in determining FRP. By applying a wide variety of unique techniques our team attempted to digest the complexities of the two data sets to help ourselves and the judges understand the ebb and flow of California reservoirs and relationships between time atmospheric variables and fire intensities.