π About Me
I am currently a first-year Ph.D. student (Fall 2025 intake) in Artificial Intelligence at the School of Computing and Data Science (CDS), The University of Hong Kong (HKU), advised by Prof. Difan Zou. Previously, I earned my Bachelorβs degree in 2023 from the University of Electronic Science and Technology of China (UESTC) and obtained my Masterβs degree in 2025 from the Faculty of Science, The University of Hong Kong.
My research focuses on AI Interpretability, Trustworthy and Safety, aiming to better understand Large Language Models (LLMs) and to design more advanced and safer AI. In particular, I am actively exploring Circuit Analysis and Sparse Autoencoders (SAEs), with the long-term goal of (1) understanding how LLMs work internally, (2) improving their performance, and (3) building models that are safer and more controllable.
If you would like to get in touch or share a passion for LLM interpretability and safety, or if you are looking for potential collaboration in this direction, feel free to reach out via email: sunny615@connect.hku.hk!
ποΈ News
- β [2026.05] Our Technical report Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models released on arXiv! All weights have been deployed in ModelScope and HuggingFace!
- π [2026.02] Paper DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders accepted to ICML 2026 and ICLR 2026 Workshop on Trustworthy AI as an oral presentation!
- π [2026.01] Paper Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders accepted to ICLR 2026!
- π [2025.10] Paper Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders accepted to NeurIPS 2025 Workshop on ResponsibleFM as an oral presentation and won the Outstanding Paper Award!
- π [2025.08] Two papers Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models and Model Unlearning via Sparse Autoencoder Subspace Guided Projections have been accepted at EMNLP 2025!
- π [2025.05] Paper Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis accepted at ICML 2025!
π Publications
Part 1: Conference Publications
-
Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou. DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
ICML 2026 (accepted) -
Xu Wang, Yan Hu, Benyou Wang, Difan Zou. Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
ICLR 2026 (accepted) -
Xu Wang, Z Li, B Wang, Y Hu, D Zou. Model Unlearning via Sparse Autoencoder Subspace Guided Projections
EMNLP 2025 (accepted) -
Li, Z, Xu Wang, Y Yang, Z Yao, H Xiong, M Du. Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
EMNLP 2025 (accepted) -
Xu Wang, et al. Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
ICML 2025 (accepted)
Part 2: Preprints on arXiv
- Xu Wang. Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen Team Technical Report; Core Contributor
π¬ Experience
-
HKU Research Assistant, Department of Statistics and Actuarial Science (05/2024 β 08/2024)
Research Direction: LLM applications in healthcare. -
CUHK (ShenZhen) Research Assistant, School of Data Science (09/2024 β 08/2025)
Research Direction: LLM mechanistic interpretability and AI safety.
π§© Services (Conference Reviewers)
- Reviewer for ICML
- Reviewer for NIPS
- Reviewer for ICLR
- Reviewer for EMNLP
π§ Future Plan
- π Continue exploring LLM interpretability, following Anthropic Interpretability team
- π§ Leverage SAE, Circuit, and related methods to uncover the internal mechanisms of LLMs, delivering improved foundational SAE and features to the community
- π‘οΈ Continue exploring AI safety, focusing on data security and training robustness in LLMs
- π Combine mechanistic interpretability with inference and reasoning: identify ways to integrate inference scaling and reinforcement learning (RL) theory into mechanistic interpretability research