I am a Clare Boothe Luce Assistant Professor in the Department of Applied and Computational Mathematics and Statistics (ACMS) at the University of Notre Dame. My research interests are primarily problem-driven and lie broadly at the intersection of applied statistics/data science and medicine. Currently, my research focuses on (1) developing interpretable statistical machine learning methods to extract actionable and reliable insights from real-world data, (2) ensuring transparent and responsible use of AI in healthcare while in close collaboration with domain scientists, and (3) creating open-source tools and software to facilitate community-wide use and adoption of reliable data science in practice.
Previously, I was a postdoctoral researcher with Ji Zhu and Liza Levina in the University of Michigan Statistics Department, and I received my PhD in Statistics from UC Berkeley, where I was advised by Bin Yu.
Research Overview
Interpretability
In many high-impact applications, it is crucial to not only achieve high prediction accuracy, but also to identify the most important features involved in the real-world phenomena under study. We develop tools to extract stable feature importances as well as feature interactions under challenging scenarios with low-signal, highly-correlated, and high-dimensional data.
Data Integration/Fusion
Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. To facilitate such integrative analyses, we develop practical tools to perform dimension reduction, pattern recognition, and feature selection for integrated data (also called multi-view or multi-modal data).
Scientific Reproducibility
Numerous human judgment calls are inevitably made throughout any data analysis. This includes choices like how to preprocess the data, which models to fit, how to evaluate the performance, and more. If not carefully chosen, these decisions may inadvertently result in spurious downstream conclusions. To mitigate this possibility, we provide tools and stability-driven protocols to facilitate scientific reproducibility and transparent substantive research.
Cardiovascular Genomics
Cardiovascular disease is the leading cause of death globally and in the US. We expand upon our current understanding of cardiac structure and function through the lens of epistasis, that is, non-additive gene-gene interactions. Through close interdisciplinary collaboration, we combine machine learning and novel experimental techniques to study the effects of these gene-gene interactions on cardiomyocyte cell sizes.
Precision Cancer Medicine
We leverage the rapid advancement of new biomedical technologies and the expansion of ‑omics data (e.g., genomics, epigenomics, proteomics, metabolomics) to solve various problems in precision cancer medicine. This includes collaborative work on the early detection of pancreatic cancer, drug response prediction, and gene regulatory networks for ovarian cancer.
COVID-19
To support the community-wide fight against COVID-19, we curated a large open-source corpus of COVID-19-related data from 20+ sources. Using this data, we created an ensemble to forecast the short-term trajectory of COVID-19-related recorded deaths. These forecasts were used by the non-profit organization, Response4Life, to determine the medical supply need and to distribute PPE accordingly.
An R package for tidy, high-quality simulation studies with efficient distributed computation, caching, and automated documenation and visualization of results.
A python package for fitting interpretable machine learning models.
An open-source data repository with COVID-19-related information from over 20 sources. Includes data on COVID-19 cases and death counts, demographics, socioeconomic characteristics, health risk factors, social mobility, and more.
An R package for seamless documentation of data analyses via “lab notebooks” to encourage transparent and reliable data science (in early development).
A collection of utility functions and modern themes for ggplot2 plots and R Markdown documents.