Education
Relevant Coursework: High Performance Machine Learning, Theoretical Foundations for LLMs, High Dimensional Stats for Biomedical Data, Advanced Programming, Data Structure, Calculus III, Linear Algebra, Machine Learning - Stanford (Coursera), Deep Learning Specialization - DeepLearning.AI (Coursera)
Publications
Adams E, Bai L, Lee M, Yiyang Yu, AlQuraishi M., From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models, BioRxiv, Feb 2025.
Ziqi Tang, Nirali Somia, Yiyang Yu, Peter K Koo., Evaluating the representational power of pre-trained DNA language models for regulatory genomics, BioRxiv, Sep 2024.
Yiyang Yu, Shivani Muthukumar, Peter K Koo, EvoAug-TF: extending evolution-inspired data augmentations for genomic deep learning to TensorFlow, Bioinformatics, Volume 40, Issue 3, March 2024.
Work Experience
- Developing a multimodal transformer model to predict binding affinities of protein-molecule bindings.
- Optimized the training and inference pipeline of protein language models by up to 30% by reimplementing models using flash attention and provided a validation method to evaluate data from wet-lab experiments.
- Designing algorithms to select a maximally diversified set of proteins for molecular dynamics simulations, generating data to support subsequent model development for predicting protein conformation trajectories.
- Developing methods to extract protein conformational ensembles from AlphaFold2 through latent space exploration and systematically created a benchmark library to compare with existing methods.
- Established fine-tuning pipelines with Low-Rank Adaptation (LoRA) and Supervised Fine-Tuning (SFT) for four pre-trained DNA language models to evaluate its representational power for regulatory genomics.
- Developed and implemented evolution-inspired data augmentations (EvoAug-TF) in TensorFlow for genomic deep neural networks and demonstrated its improvement in generalization and interpretability.
- Designed and evaluated more than 100 deep-learning models for predicting DNA promoters' expression rates using Python, TensorFlow, and WandB in the 2022 DREAM Challenge. Placed 7th in the final leaderboard.
- Developed a continuous individual crisis aid alert system (CICaidA), manufactured the prototypes, and tested the hardware backend interface system.
- Guided students in ESE123 to conduct experiments and write lab reports in the lab; hosted weekly office hours.
Projects
- Developed deep learning models to predict small molecule-protein interactions using the Big Encoded Library for Chemical Assessment. Implemented over 40 types of DL models, including CNNs, GNNs, Transformers, RNNs, and GDBT Models, and finally developed a robust solution after using up all 480 submissions.
Leadership Experience
- Facilitating mentor connections and providing programmatic support for cohorts of 6-8 early-stage startups during Columbia's intensive 8-week Almaworks accelerator, helping teams prepare for the culminating investor pitch Demo Day to drive fundraising success.
- Hosted biweekly machine learning/python hands-on workshops to cultivate members in their interests in AI.
- Collaborated with various professors and graduate students to provide research opportunities to undergraduates.
Selected Honors
- 2024 Kaggle Competitions Master (3xGold, 4x Silver) Current Global Ranking: 133 / 203,363 (Top 0.065%)
- 2024 HackMIT Intersystems Challenge 1st Place $2000 prize
- 2024 MayPro Special Prize at Columbia Healthcare Hackathon
- 2024 Gilbert Family Scholarship from Columbia Engineering
- 2021 British Physics Olympiad Senior Physics Challenge Gold Prize (Global Ranking) Top 5%