πŸ‘‹ About Me

I am currently a first-year Ph.D. student (Fall 2025 intake) in Artificial Intelligence at the School of Computing and Data Science (CDS), The University of Hong Kong (HKU), advised by Prof. Difan Zou. Previously, I earned my Bachelor’s degree in 2023 from the University of Electronic Science and Technology of China (UESTC) and obtained my Master’s degree in 2025 from the Faculty of Science, The University of Hong Kong.

My research focuses on AI Interpretability, Trustworthy and Safety, aiming to better understand Large Language Models (LLMs) and to design more advanced and safer AI. In particular, I am actively exploring Circuit Analysis and Sparse Autoencoders (SAEs), with the long-term goal of (1) understanding how LLMs work internally, (2) improving their performance, and (3) building models that are safer and more controllable.

If you would like to get in touch or share a passion for LLM interpretability and safety, or if you are looking for potential collaboration in this direction, feel free to reach out via email: sunny615@connect.hku.hk!


πŸ—žοΈ News


πŸ“„ Publications

Part 1: Conference Publications


Part 2: Preprints on arXiv


πŸ”¬ Experience

  • HKU Research Assistant, Department of Statistics and Actuarial Science (05/2024 – 08/2024)
    Research Direction: LLM applications in healthcare.

  • CUHK (ShenZhen) Research Assistant, School of Data Science (09/2024 – 08/2025)
    Research Direction: LLM mechanistic interpretability and AI safety.


🧩 Services (Conference Reviewers)

  • Reviewer for ICLR
  • Reviewer for EMNLP

🧭 Future Plan

  • πŸ” Continue exploring LLM interpretability, following Anthropic Interpretability team
  • 🧠 Leverage SAE, Circuit, and related methods to uncover the internal mechanisms of LLMs, delivering improved foundational SAE and features to the community
  • πŸ›‘οΈ Continue exploring AI safety, focusing on data security and training robustness in LLMs
  • 🌐 Combine mechanistic interpretability with inference and reasoning: identify ways to integrate inference scaling and reinforcement learning (RL) theory into mechanistic interpretability research