Shijie Tang

I am a Master’s student in Computational Biology at Carnegie Mellon University (expected 2026), working with Prof. Carl Kingsford and Prof. Wei Wu. My research lies at the intersection of machine learning, computational genomics, and sequence design, with a focus on developing AI-driven methods to address fundamental challenges in molecular biology.

I received my B.S. in Bioinformatics from the ZJU–University of Edinburgh Joint Institute in 2024. I have collaborated with Prof. Christine Orengo at University College London on protein design, and my cancer genomics research at ZJU-Edinburgh resulted in a co-first author publication in Gut (2024, IF: 24.5).

News

  • [Aug 2025] Started new research project on shortcut learning in NLP with Prof. Carl Kingsford at CMU
  • [Jun–Aug 2025] Software Engineering Intern at Google — built AI-powered accessibility validation system using Gemini, achieving >0.95 recall
  • [May 2025] Contributed to the ARCADE paper on controllable mRNA sequence design
  • [Aug 2024] Co-first author paper published in Gut: MED12 loss sensitizes pancreatic cancer to immunotherapy

Research Interests

My research focuses on building computational methods and AI models that decode biological complexity:

  • Sequence design: Developing controllable generative models for mRNA and protein sequences with optimized properties
  • Foundation models for biology: Leveraging large language models and representation learning for genomics and proteomics
  • Cancer genomics: Integrative multi-omics analysis to uncover mechanisms of tumor immune evasion
  • Shortcut learning in NLP: Peer learning approaches to improve robustness of large language models

Selected Publications

Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation
Li J, Tang S, Kaynar G, Du S, Kingsford C. arXiv:2604.12277, 2026. [Paper]

CodonRL: Multi-Objective Codon Sequence Optimization Using Demonstration-Guided Reinforcement Learning
Du S, Kaynar G, Li J, You Z, Tang S, Kingsford C. bioRxiv, 2026. [Paper]

ARCADE: Controllable Codon Design from Foundation Models via Activation Engineering
Li J, Lai HS, Liang L, Du S, Tang S, Kingsford C. bioRxiv, 2025. [Paper]

MED12 loss activates endogenous retroelements to sensitise immunotherapy in pancreatic cancer
Tang Y*, Tang S*, Yang W, et al. Gut, 2024. [DOI] (*co-first author)

Research Experience

Carnegie Mellon University — Research Assistant (Aug 2025 – present)
Advisors: Prof. Carl Kingsford & Prof. Wei Wu
Developing LLM-based methods to mitigate shortcut learning in NLP models, improving inference accuracy from 0.80 to 0.94.

Carnegie Mellon University — Research Assistant (Dec 2024 – May 2025)
Advisor: Prof. Carl Kingsford
Contributed to ARCADE, a controllable mRNA sequence design framework. Implemented parallel computing for RNA MFE and CAI optimization across multiple species codon databases and developed MFE predictors leveraging secondary structure features.

University College London — Research Assistant (Jun – Sep 2023)
Advisor: Prof. Christine Orengo
Applied deep learning and protein language models (ESM, AlphaFold2) for enzyme design and protein function prediction. Incorporated the SAM optimizer and performed statistical analysis of training dynamics.

ZJU–Edinburgh Joint Institute — Research Assistant (Jun 2022 – Jun 2023)
Advisors: Dr. Chaochen Wang & Dr. Jing Xue
Performed integrative multi-omics analysis characterizing MED12’s role in histone regulation and immune evasion in pancreatic cancer, resulting in a co-first author publication in Gut.

Personal Interests

Outside of research, I enjoy astrophotography and travel photography with my Canon G7X Mark III. I recently captured Comet C/2023 A3 from Pittsburgh.