The UCI Data Science Initiative sponsored Graduate Summer Fellows to support UCI PhD students involved in new interdisciplinary research projects that have a strong data science/data analysis component.

Fellows received over $6,000 plus benefits to develop summer research projects that explore promising new topics.

2015 Graduate Student Fellows

On September 17, 2015, the Graduate Student Fellows will present their research at the Data Science End-of-Summer Event. Below are project descriptions for some of the Graduate Student Fellows presenting.


Conformation Space Graphs of Macromolecules: from Network Inference to Dynamics

Gianmarc Grazioli, Chemistry

Figure 1

Figure 1

The research I conducted this summer as a 2015 Data Science Initiative Fellow focused on the development of algorithms for the calculation of the rates of transitions among spatial configurations of complex biomolecules in molecular dynamics computer simulations. Many chemical properties of biomolecules, such as DNA and proteins, are due to their configurations, i.e. the relative spatial orientation of their atoms. The first project I completed (figure 1) was the development of an algorithm for calculating time correlation functions from Milestoning calculations, a preexisting method for calculating transition rates between configurations by measuring how long it takes the system to transition between hyperplane interfaces in a space of possible configurations. In this case, time correlation functions quantify how long it takes a molecule to “forget” the configuration in which it started, a key insight for comparing configurational data from molecular dynamics simulations with experimental data produced using NMR spectroscopy, one of the best experimental methods for measuring the rates of configurational transitions. The algorithm is based on carrying out random walks along a discrete configuration space graph, and generating time-dependent probability distributions of being at each vertex from the random walks. This is the first time that an algorithm for calculating time correlation functions from Milestoning data has ever been devised.

Figure 2

Figure 2

The second method developed is an algorithm for a fully automated method for both finding the physically important regions of a molecule’s configuration space, and defining the milestone hyperplanes which best serve to bound these regions. It is often the case that only one configuration of a biomolecule of interest, typically obtained from x-ray crystallography experiments, is known. This known configuration can be used as a starting point for molecular dynamics simulations in order to determine other possible configurations of the molecule, but the molecular motions from these simulations are often too complex for a human to recognize the important configurational changes, let alone define the milestone hyperplanes in the space that best capture the transitions.

grazioli3.jpgMotivated by this, an algorithm was devised which consists of two subroutines, first a search of the configuration space (figure 2), where mutually repulsive clones of the system explore the space, and second a milestone designation step (figure 3), where the superior pattern recognition capabilities of machine learning are harnessed to first define clusters of the computer-generated configurations and then define hyperplane interfaces between these clusters to be used as milestones. Figure 2 shows the configuration search step, where a search in configuration space is modeled as a two dimensional potential. This is akin to randomly kicking a ball around a landscape, where raising the temperature is analogous to increasing the strength of the kicks. The mutually repulsive clone exploration step outperformed both the unassisted sampling and the sampling run at an artificially high temperature, by exploring the entire space without blurring out the distinct configurational subspaces of similar configuration (latitude and longitude) and energy (elevation). Figure 3 shows the milestone designation step on both our two dimensional model system and our molecular system, alanine dipeptide. While a clustering algorithm was able to distinguish the 10 different configurational subspaces (top left) that define the configuration space graph shown top right, the elevated temperature configuration data yielded just three subspaces (not shown). For the molecular system, a three dimensional configuration space, composed of the distances between the carbon atoms shown in blue, green, and red, was devised (bottom left). The molecular dynamics governing the configurational fluctuations of this system were then simulated, and these pairwise distances were outputted. Shown bottom right is the result of using a clustering algorithm to group the points into distinct sets, the interfaces between which could then be used in Milestoning simulations.

This project has served to lay some theoretical groundwork and proof of concept for a fully automated machine learning-based method for calculating configurational kinetics from molecular dynamics simulations. The next step is to develop a full software implementation suitable for large scale simulations of complex biomolecules comprised of thousands of atoms. I am incredibly grateful to Prof. Andricioaei and Prof. Butts for their advisement on this project, and to the Data Science Initiative for providing me the opportunity to begin this pathway of scientific inquiry, which I plan to pursue further in the near future.

Reconstructing Southeast Asian Monsoon Variability Using Speleothems

Jessica Wang, Earth System Science

Tm17-top  The Southeast Asian Monsoon (SEAM) is a significant driver of regional and global energy  fluxes. Seasonal precipitation from the SEAM drives food production, which directly affects the economy and the livelihood for millions of people over Southeast Asia. While the mechanisms   behind hydrologic variability of Southeast Asia are well known over millennial timescales, major uncertainties remain on the behavior of precipitation regimes for shorter timescales (e.g., the past millennium).DSC00673

The goal of my summer project was to evaluate the variability in the strength of the Southeast Asian monsoon over the past 2000 years using speleothems (cave deposits) collected from Tham Mai Cave in northern Laos. To achieve this goal, I had to create composite speleothem records from two speleothems collected back in 2013 (TM17 and TM19) and determine the ages of these samples. Uranium-series dating of speleothems is commonly utilized for obtaining the ages of the speleothem samples and is based on the decay of 234Uranium (234U) to 230Thorium (230Th). Robust age-depth models for TM17 and TM19 were developed using U-Th dates through an algorithm, “StalAge” and linear interpolation. agemodel

In addition, I micromilled over 900 TM17 samples and 300 TM19 samples at a 3-year and 7-year resolution, respectively, for oxygen stable isotope analyses. Because speleothems grow layer by layer from the cave drip water, the oxygen isotopic signature of cave drip water is representative of precipitation that falls near the cave site. Therefore, I can use the results from the stable isotope analyses to determine the monsoon intensity over the past 2000 years.


Schools and Neighborhood Crime:
The Effects of Dropout Rates, Graduation Rates, and Test Scores on Youth Crime

Julie Gerlinger, Criminology, Law & Society

Education is an important and well-known predictor of future economic success. Low-performing students and students who do not complete high school are associated with a host of negative outcomes, including increased delinquency and contact with the juvenile and criminal justice system. From a social disorganization perspective, it is possible that schools with high dropout rates or low test scores are contributing to neighborhood disorganization, thereby increasing crime.

OC_highschoolsIn the education literature, we see there are significant achievement gaps between white and black students that heavily contribute to social reproduction, though education is supposed to be a means of social advancement. Thus, we ask the following research questions: How do test scores, dropouts, and graduation rates affect neighborhood crime? Do these effects vary by student OC_bufferrace/ethnicity?

We use longitudinal crime and demographic data for small geographic units in cities in Orange County, CA, along with school data from the California Department of Education (CDE) and Common Core of Data (CCD) to assess how high school characteristics impact youth crime in the surrounding area. We create spatial buffers to test these processes at an appropriate geographic unit, and we examine how these effects vary by student racial groups and the demographic characteristics of the spatial area.

San Diego 023
Markov Chain Monte Carlo for Uncertainty Quantification in the Ocean Carbon Cycle

Gregory Britten, Earth System Science

The ocean plays a major role in the global carbon cycle by taking up 1/3 of anthropogenic carbon dioxide emissions per year, and by containing approximately 50 times more carbon than the atmosphere. In this project, I am utilizing an ocean inverse modeling framework developed at UCI to model biological carbon uptake and storage in the global ocean.

DATA_SCIENCE_PICSAs a Data Science Summer Fellow I have helped to extend the inverse framework to model a new chemical tracer, oxygen, which helps constrain biological carbon flux and storage. Over the summer, I have implemented an optimization algorithm to statistically infer carbon flux from the globally observed oceanic oxygen distribution. This was a first step before developing a fully probabilistic, sampling-based solution. I have also begun prototyping Markov Chain Monte Carlo (MCMC) algorithms to perform probabilistic uncertainty quantification on our inferred quantities. Probabilistic inference is a critically important component of the project and represents a novel advancement in ocean carbon cycle modeling that will aid researchers and policy-makers in constraining global carbon budgets and understanding the fate of anthropogenic emissions. Some research challenges remain, however.

The computational efficiency of both the carbon cycle model and the MCMC algorithm must be improved before we are fully able to explore the uncertainty in our model solutions. This work is ongoing, and collaboration between the Earth System Science and Statistics departments is planned to continue into the future. I would like to extend my sincere thanks to the Data Science Initiative for catalyzing this collaboration that has now become invaluable to the project.

Predictive Modeling in Massive Open Online Courses (MOOC’s) 

Forough Arabshahi,  Electrical Engineering and Computer Science

In this project our goal is to predict the performance of students attending a Massive Open Online Course (MOOC).

We propose two models for performing this prediction task as:

1) Conditional Latent Tree Model that predicts the student’s performance by finding underlying hidden groupings in the students. This method was validated on a Psychology MOOC offered in Spring 2013 on Coursera and the results will be published in the proceedings of IEEE’s International Conference on Data Mining (ICDM) in Nov. 2015.

2) Using topic modeling and mixed membership topic modeling to cluster and predict the student’s grades according to their video watching behavior and the results of class quizzes. The idea is further extended to Hidden Markov Models on top of topic modeling for prediction of students’ performance.

MOOCdesc.001We have been collaborating with Mark Warschauer’s group in UCI’s school of Education for data collection and cleaning for validation of our second proposed method. The datasets consist of all UCI courses offered on Coursera and edX and is provided to us by UCI Extension. We have worked on UCI’s pre-calculus course offered on Coursera with about 19500 students participating in this 10-week course. The results for this proposed framework will be submitted to Sigmetrics 2016.

Pandori_DSI_2015_Selfie (2)
Mussel Physiology and Distribution in a Changing Climate

Lauren Pandori, Ecology and Evolutionary Biology

This summer, through a collaboration with Dr. Cascade Sorte (Ecology and Evolutionary Biology) and Dr. Kristen Davis (Civil and Environmental Engineering), I worked to understand how wind patterns along the Oregon Coast influence the distribution of the California mussel (Mytilus californianus), a foundation species that provides habitat in coastal systems.

Pandori_DSI_2015_Pic1In the face of climate change, species that can no longer tolerate conditions must relocate or risk extinction. For species whose larvae travel via ocean currents, their ability to redistribute can be helped or hindered by the direction of prevailing winds, which drive ocean currents. Along the Oregon coast, dominant currents flow toward the equator. However, during brief periods, current directions can be reversed which might provide a critical window for larval transportation to cooler, more hospitable environments. We integrated ecological field studies and oceanographic modeling to predicting dispersal patterns of mussel larvae, and we related model predictions to observed numbers of recruiting mussels at our study site.

We discovered that during periods of current reversals, more mussels with southern origins were predicted to recruit to the study site. These data support our hypothesis that current reversals can provide an opportunity for northward movement by mussels. A scientific paper reporting the full results of this study will be submitted for publication by the end of this year.

Defining a New Brain State for Use in Robotic Therapy

Sumner Norman, Mechanical and Aerospace Engineering

After neurological trauma such as a stroke or spinal cord injury, robotically assisted movement therapy has been shown to match or better the results obtainable with conventional rehabilitation. However, robotic assistance can cause “slacking” in the patient, an automatic and subconscious reduction in effort, which allows the robot to take over.

setup (Large)In recent years, brain-computer interface (BCI) research has aimed to use signals acquired from electrodes on the patient’s scalp (electroencephalography) to detect brain states associated with movement intention, and reward the patient by triggering robotic assistance. However, in a recent study, we found that robotically imposed movements alone (passive participant) can produce brain states that are typically associated with actual movement from the participant. This phenomenon has the potential to cause false positive movement of the robot, allowing the patient to slack.

PCA_dsiUsing data driven techniques, we aimed to find a new brain state robust to robotic influences. We used feature extraction and machine learning techniques such as classwise principal component analysis and information discriminant analysis with a data set collected during an upper extremity therapy task with and without robotic assistance (N=12 participants). We were successful in identifying brain states that can discriminate active movement intention and are robust to the influences of the robotic environment. A primary physiological finding was that although movement areas of the brain are important in all cases, a large prefrontal cortex decision making process appears before movement when the participant is engaging in the task. With these techniques and insight, our BCI was able to predict patient engagement in the task with a 75.9% accuracy on an individual trial basis, with increasing accuracy on higher numbers of trials.

We hope to use and improve on these new data techniques in upcoming clinical studies to assess their efficacy in improving patient outcome after therapy.

Introduction of the invasive Sahara mustard in North America

Daniel E. Winkler, Ecology and Evolutionary Biology

Winkler Project Pic 2Human activities are changing Earth’s climate and facilitating the spread of invasive species at rapid rates. An essential goal toward controlling invasions is understanding where non-native introductions occur and what mechanisms can disable their invasions.

Our research uses genetic, physiological, and ecological techniques to understand the spread of the highly invasive Sahara mustard across North America. Using hundreds of historic and contemporary herbaria specimens, we identified distribution patterns of Sahara mustard, collected tissue from more than 2,000 plants across North America, and sequenced their DNA using next-generation techniques. We identified single nucleotide polymorphisms that will be used to reconstruct the invasion routes of Sahara mustard into and across North America, evaluate patterns of diversity within and between groups, and identify the evolutionary mechanisms promoting the successful invasion of Sahara mustard.

Winkler Project Pic 1More project information can be found at


Smartphone Response to Cosmic Rays

John Sandy, Physics & Astronomy

The Cosmic Rays Found In Smartphones (CRAYFIS) collaboration is seeking to turn the existing network of personally owned smartphones into a global particle detector for high energy cosmic rays. As a member of CRAYFIS, the research done this summer as a Data Science Fellow has been working towards that goal.

ObTZPpjThe aim of this project was to design, code, and implement a standalone simulation of the response to a high energy cosmic by a generic smartphone CMOS device found in the camera. To this end, the c++ toolkit GEANT4 was used to create a 3D model of the CMOS device, starting with a single pixel and then promoting that to an array of pixels. Once the array was created, simulated muons, which are common cosmic rays, were fired at the array at different angles and energies to measure the response of each pixel in the array to the different scenarios. In this manner, it has been possible to calculate the efficiency of a smart phone camera to detect and record the passage of a muon. In addition to this calculated efficiency, the simulated data will be used to compare to real data to verify expected results and investigate interesting phenomena.

Linh Anh Data Science

Morgan Data Science

Out of the Valley:
A multi-disciplinary approach to forecasting valley fever dispersal in the southwestern U.S.

Linh Anh Cat, Ecology and Evolutionary Biology

Morgan Gorris, Earth System Science

In the southwestern U.S., the pathogenic fungus Coccidioides is found in both the soil and air and is subject to the effects of climate change on its growth and dispersal. Inhaling the fungal spores causes the disease called valley fever in humans and mammals, eliciting flu-like symptoms and rashes in healthy individuals and body-wide infection or death in predisposed groups. Incidence rates throughout the southwestern U.S. have increased twenty-four fold in the last 20 years, possibly due to climate change. There is currently little data on the environmental habitat of Coccidioides or how fluctuations in valley fever incidence rates are connected to its environment.

To address this knowledge gap, we collected 123 soil samples and 61 air samples across five states (CA, AZ, NM, UT, NV) which were then tested for the presence/absence of Coccidioides. Physical characteristics of the soil (e.g., salinity, pH, density, texture, C:N content) are currently being analyzed as well as spore counts from the air samples. In addition, we are also improving upon the existing public health surveillance data by creating a valley fever incidence database with higher resolution, both spatially and temporally (cases at the county level and at monthly or quarterly increments). We plan to examine this database in conjunction with meteorology and climatology data to determine relationships between environmental conditions and valley fever incidence rates.

Regional Cocci Map

2012-03-06 02.17.42-1
Prevalence of Vaccine Declination in Ethnic Enclaves

John Schomberg, Epidemiology

Over the summer a group of Spanish speaking Spanish speaking parents of potential HPV vaccine recipients had their tweets tracked and recorded.  An application to the IRB was submitted to survey this group regarding their attitudes towards HPV vaccination.  Using solely Twitter data we were able to infer race, gender, and ethnicity from this data.  Once IRB approval is granted we may move forward with our direct messaging survey of vaccination attitudes with demographic variables in this population.

We hope this study can be used to validate the use of Twitter to assess health behaviors in hard to reach ethnic enclaves.BIGHIJA

Establishing a National Archive of Narrative Data on Moral Choice

Sarah Bach, Social Ecology

Gabriel Anderson, Political Science

Picture1 Unique among ethics centers in the USA, the UCI Ethics Center takes a scientific and empirical approach — not a philosophical or religious one — to examine topics that reflect critically on the moral implications of the new frontiers in science. One of the Center’s most important enterprises has been to collect data that lend insight into how the human mind thinks about ethical issues as people make moral choices. For the last 15 years, faculty and interns at the Ethics Center have been conducting extensive narrative interpretive interviews that reveal the cognitive/psychological processes surrounding moral choice. The initial analysis of these narrative data have been surprising, and are reworking the literature on how actual decisions concerning moral choice actually occur. (For example, analysis of interviews with rescuers of Jews during the Holocaust revealed that their moral acts emanated from their sense of self in relation to others, and not from the traditional agonistic model of ethical decision making that dominates the literature on decision-theory and philosophical works in the Anglo-American Kantian tradition. This work was presented in three award-winning books.) The narrative data being collected at UCI thus are changing the way scholars think about how the human mind works when it comes to ethical issues. This data set is already being utilized by scholars in neuroscience and other disciplines.    The substantive topics into which these data fall include the following: (1) philanthropy & altruism (2) mustering moral courage during war (3) maintaining humanity and surviving the stress of wars and genocides; (4) cognition stretching (5) how our mind thinks about “differences” and why some differences — race, religion, ethnicity, gender, age — become the subject of prejudice when others — mathematical ability, artistic talent — do not (6) Aging and the types of moral choices individuals reflect on when looking back over their life.   Picture3

The interview data are in the form of narratives and thus provide a more nuanced understanding of psychological parts of human behavior than can surveys, experimental data, or more traditional large data sets. Some of the interviews are filmed, thus lending information on body language and even subtler forms of interaction. (For example, the body language of one bystander during the Holocaust spoke volumes about her discomfort and denial of the possibility that she could have helped save anyone from the Nazis.) This unique data set is now being made public to help put UCI on the intellectual map as a place where important scientific work is being done on how the human mind processes data to create a narrative that leads them to particular types of moral choices. We have spent the summer editing the entire corpus of interviews (omitting any personal information that could violate privacy) and are in the process of putting these data into a public archive so they can be made available to scholars at other universities throughout the world. Therefore, we have systematically organized a professional archive available to any scholar throughout the world who is interested in doing scientific analysis of narrative interpretive data. Equally important outcomes, however, are a series of books presenting some of the archival data.

Picture2Work in cognitive psychology (by scholars such as Nobel laureate Daniel Kahneman and Amos Tversky) note that vibrant data, often in the form of stories, have a greater weight in the human mind than do the more pallid, statistical data. Certainly empirical data are needed as we learn more about how the human mind processes the myriad bits of information it receives every day, and weaves that information into a narrative that helps us make sense of the reality around us. This means the stories become increasingly important, both for scientific work that needs actual data to analyze as we construct more scientific models in fields such as neuroscience and cognitive and political psychology. We will take some of the data and publish them in the form of stories, as we did in “A Darkling Plain”, which contained stories of how people kept their humanity in wars ranging from World War II to Vietnam and the current war in Afghanistan and Iraq. We plan to prepare books dealing with stories on discrimination (against women and against elders) and how people can combat it successfully, as well as stories of altruism. Finally, the papers from the international mini-conference that the Ethics Center ran this past summer at UCI, in conjunction with the International Society of Political Psychology (ISPP) on narratives –how to conduct them, the ethical issues they raise, and how to analyze them will be published in a volume; some of the chapters in this book will come from the stories collected in the UCI Ethics Center Vaughen Archives.

Zagreb Bodyshot
Grandchildren and Grandparent’s Labor Force Attachment

Brian J. Asquith, Economics

David Hirschberg, Computer Science

RelationalDatasetOver the summer, we focused on developing our schema for organizing the familial relationships by generation.  The Census data has millions of observations for each survey wave, making it critical that as much information as possible is contained in the fewest number of variables.

Access to the main dataset is still pending, so we also wanted to develop a comparative result to better understand what kind of results we should expect and what kind of complications we might run into.  The Panel Study of Income Dynamics (PSID) has been following approximately 9,000 families since 1968 in annual and biennial surveys.  Using this smaller, more manageable dataset we tested the feasibility of our main idea and generated some initial results on both grandfathers and grandmothers, separately.  Interestingly, the PSID results suggest that we can expect that grandchildren will influence grandparents to retire earlier, and the result will be stronger for grandfathers.

Food Scraps, Households, and the City: Testing Social-Psychological Theories for Future Policy Tools

Sally Geislar, Department of Planning, Policy, and Design

Landfilled Food WasteFood waste is the single largest material stream entering landfills generating 20% of the country’s methane, a potent greenhouse gas (GHG). In response to statewide mandates for landfill diversion and GHG reductions, cities have begun to collect food waste at the curb to generate compost or biogas. These approaches tend to rely on physical infrastructure improvements. And while large scale facilities and curbside bins may be necessary to manage the waste of millions of households, these systems still rely on and must work together with the everyday behavior of households. Moreover, data related to this sort of household behavior, unlike its counterparts in water and energy, is scant and policies to address organic waste tend to be formed with little insight from the wealth of social-psychological research on pro-environmental behavior change.  My work seeks to quantify, measure, and analyze data on this important aspect of urban life. This research will develop understanding in, and bring evidence to bear on, the human and social aspects of waste management policies designed to reduce the environmental impact of our cities. With the support of the Data Science Initiative Summer Fellowship, my faculty advisor, Prof. David Feldman and collaborators in Informatics, I have continued to develop data collection, communication, and analysis tools to understand household food waste, reduce the landfilling of this valuable material stream, and ultimately contribute to evidence-based social and environmental policy development. The Costa Mesa Sanitary District (CMSD) is the first in the county to offer curbside collection of household food waste and other organic materials. I am partnering with CMSD to conduct this community-based experiment with over 1,000 residents to understand and improve household participation in organic waste management policy implementation.   Social and Technological Approaches

In a two-part experiment, this study 1) tests the effect of beginning to separate food scraps on the pro-environmental attitudes, beliefs and other pro-environmental behaviors of participants and 2) tests the effects of norm communication on participation rates and accuracy. Communicating the actual behavior of others has been shown in social psychology and social marketing research to improve energy conservation and recycling behavior more than financial incentives and information alone. This is the first systematic study extending norm communication tools to the domain of food waste from energy conservation and recycling behavior. If effective, these norm communication tools can be used by cities as they implement curbside organics collection programs for a cost-effective way to improve policy compliance.  Both the reliance on volunteer hours and the mailing costs of norm communication constitute significant barriers for scaling-up new tools to improve the achievement of organic waste policy goals and the reduction of greenhouse gases. City-wide use of norm communication demands more accessible, real-time, and interactive tools. To this end, I have been working with students under Prof. Bill Tomlinson in Informatics to develop a mobile application for data collection that will reduce participant burden and allow for streamlined two-way communication of participant food waste behavior and communication of group norms back to participants. Overall, this research seeks to improve environmental policy implementation by testing theories from social psychology deployed for the first time in food waste with new mobile technologies for widespread adoption to meet the needs of these municipal programs.