Xu Wang

👋 About Me

I am currently a first-year Ph.D. student (Fall 2025 intake) in Artificial Intelligence at the School of Computing and Data Science (CDS), The University of Hong Kong (HKU), advised by Prof. Difan Zou. Previously, I earned my Bachelor’s degree in 2023 from the University of Electronic Science and Technology of China (UESTC) and obtained my Master’s degree in 2025 from the Faculty of Science, The University of Hong Kong.

My research focuses on AI Interpretability, Trustworthy and Safety, aiming to better understand Large Language Models (LLMs) and to design more advanced and safer AI. In particular, I am actively exploring Circuit Analysis and Sparse Autoencoders (SAEs), with the long-term goal of (1) understanding how LLMs work internally, (2) improving their performance, and (3) building models that are safer and more controllable.

If you would like to get in touch or share a passion for LLM interpretability and safety, or if you are looking for potential collaboration in this direction, feel free to reach out via email: sunny615@connect.hku.hk!

🗞️ News

📝 [2025.08] Two Paper Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models and Model Unlearning via Sparse Autoencoder Subspace Guided Projections have been accepted at EMNLP 2025
📝 [2025.05] Paper Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis accepted at ICML 2025

📄 Publications

Part 1: Conference Publications

Xu Wang, Z Li, B Wang, Y Hu, D Zou. Model Unlearning via Sparse Autoencoder Subspace Guided Projections
EMNLP 2025 (accepted)
Li, Z, Xu Wang, Y Yang, Z Yao, H Xiong, M Du. Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
EMNLP 2025 (accepted)
Xu Wang, et al. Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis
ICML 2025 (accepted)

Part 2: Preprints on arXiv

Xu Wang, Y Hu, B Wang, D Zou. Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders

🔬 Experience

HKU Research Assistant, Department of Statistics and Actuarial Science (05/2024 – 08/2024)
Research Direction: LLM applications in healthcare.
CUHK (ShenZhen) Research Assistant, School of Data Science (09/2024 – 08/2025)
Research Direction: LLM mechanistic interpretability and AI safety.

🧩 Services (Conference Reviewers)

Reviewer for ICLR
Reviewer for EMNLP

🧭 Future Plan

🔍 Continue exploring LLM interpretability, following Anthropic Interpretability team
🧠 Leverage SAE, Circuit, and related methods to uncover the internal mechanisms of LLMs, delivering improved foundational SAE and features to the community
🛡️ Continue exploring AI safety, focusing on data security and training robustness in LLMs
🌐 Combine mechanistic interpretability with inference and reasoning: identify ways to integrate inference scaling and reinforcement learning (RL) theory into mechanistic interpretability research