Tiffany Tang

Tiffany Tang

PhD Student

Department of Statistics

University of California, Berkeley

I am a fourth-year PhD student in the UC Berkeley Statistics Department, advised by Bin Yu. My research interests are primarily problem-driven and lie broadly at the intersection of applied statistics/data science and medicine. I am grateful to be supported by the NSF Graduate Research Fellowship. Previously, I studied mathematics and statistics at Rice University, where I was advised by Genevera Allen. I have also spent summers at Genentech and Baylor College of Medicine.


  • Statistical Machine Learning
  • Applied Statistics
  • Data Integration
  • Biomedical Data Science
  • Genomics


  • PhD in Statistics, 2018-present

    University of California, Berkeley

  • BS in Mathematics, 2018

    Rice University

  • BA in Statistics, 2018

    Rice University

Research Overview

With the growing volume and complexity of data in today’s society, I am excited by the opportunity to work closely with scientists and doctors to extract data-driven, reproducible, and actionable insights from the craziness that is data to improve human health.

Feature Selection for Multi-View Data

Block Randomized Adaptive Iterative Lasso (B-RAIL) is a practical tool for selecting important features in high-dimensional multi-view data with mixed data types (e.g., continuous, binary, count-valued). B-RAIL serves as a versatile data integration method for both sparse regression and graph selection problems. In our ovarian cancer case study, B-RAIL successfully identifies well-known biomarkers and hints at novel candidates for future ovarian cancer research.

The Fight Against COVID-19

To support the community-wide fight against COVID-19, we are continuously curating a large open-source corpus of COVID-19-related data from 20+ sources. Using this data, we create an ensemble to forecast the short-term trajectory of COVID-19-related recorded deaths. These forecasts are being used by the non-profit organization, Response4Life, to determine the medical supply need for individual hospitals and have directly contributed to the distribution of medical supplies across the country.

Integrated Principal Components Analysis

Integrated Principal Components Analysis (iPCA) generalizes the classical PCA to the integrated data setting, where we want to analyze multiple related data sets simultaneously. iPCA can be used for dimension reduction and exploratory data analysis to find and visualize common patterns that occur in multiple data sets. We use iPCA to study the genomic basis of Alzheimer’s Disease (AD) and the genes which contribute to dominant expression patterns in AD.

Publications & Preprints

(2021). The Future will be Different than Today: Model Evaluation Considerations when Developing Translational Clinical Biomarker. KDD Health Day - DSHealth Workshop.

PDF Cite

(2021). imodels: a python package for fitting interpretable models. Journal of Open Source Software.

PDF Cite Code

(2020). Feature Selection for Data Integration with Mixed Multi-view Data. Annals of Applied Statistics.


(2020). A stability-driven protocol for drug response interpretable prediction (staDRIP). ML4H: Machine Learning for Health - Extended Abstract (NeurIPS Workshop).

PDF Cite Code Poster

(2020). Curating a COVID-19 data repository and forecasting county-level death counts in the United States. Harvard Data Science Review.

PDF Cite Code DOI

Awards & Honors

National Science Foundation Graduate Research Fellowship

National Defense Science Engineering Graduate Fellowship

David P. Byar Young Investigator Award

WNAR Best Student Paper


UC Berkeley Graduate Student Instructor

Rice University Teaching Assistant (2015-2018)

  • STAT 613 Statistical Machine Learning (graduate course)
  • MATH 354 Honors Linear Algebra
  • MATH 355 Linear Algebra
  • MATH 212 Multivariable Calculus
  • MATH 211 Ordinary Differential Equations


  • UC Berkeley Outstanding GSI Award (2019-2020)


  • Evans Hall 331