Qiang HE

Bochum, Germany

I’m currently a ~~first~~ ~~second~~ ~~third~~ fourth-year PhD student at Ruhr-University Bochum, supervised by Prof. Dr. Setareh Maghsudi. I earned my Master’s degree in Theory and Method of Artificial Intelligence from Institute of Automation, Chinese Academy of Sciences.

I am currently on the job market. Please feel free to contact me.

Research Interests: Large Language Models, RLHF, Human-AI Alignment, Game AI, Reinforcement Learning

I'm broadly interested in large language models, human-AI alignment, RLHF, and AI security. Currently, my research aims to i) develop controllable AI in both training and inference/adaptation; ii) theory and real-world application of Human-AI alignment; and iii) understand the structural information of LLMs, RLHF & RL and how to leverage it to enable agent performance. And yes, we are developing these methods for RL and LLMs.

Contact information

Email: qianghe97 AT gmail DOT com, Qiang DOT He AT rub DOT de.
WeChat ID: pposac

Professional Service

Reviewer for ICLR, NeurIPS, ICML, AAAI, DMLR, ICPR,

news

May 28, 2025	Our paper Pareto Multi-Objective Alignment for Language Models is accepted to ECML/PKDD 2025 research track (acceptance rate: 24%)! Looking forward to seeing you in Porto, Portugal!
May 6, 2024	I attent the ICLR’24. Feel free to chat with me! Check Poster.
May 3, 2024	I attent the AISTATS’24. Feel free to chat with me!
May 2, 2024	Our paper “Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignment” is accepted by ICML 2024! We introduce Shūkai, a game agent trained in the game Naruto Mobile. This work is the first example of a deep RL agent deployed in a commercial fighting game, and has been deployed for a year.
Jan 17, 2024	2 papers accepted to ICLR 2024 and one of them is spotlight. Thank my supervisor and collaborators for their help!

selected publications

ECML’25
Pareto Multi-Objective Alignment for Language Models

Qiang He , and Setareh Maghsudi

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2025

Abs Bib PDF

LLMs are increasingly deployed in real-world applications requiring careful balancing of multiple, often conflicting, objectives—such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose PAreto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive gradient-based multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex problem with a closed-form solution, significantly enhancing scalability. Gradient-based MOO approaches suffer from prohibitive O(dn^2) complexity, where d represents the number of model parameters—typically in the billions for LLMs—rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees showing that PAMA converges to a Pareto stationary point, ensuring that improvements in one objective cannot be achieved without sacrificing others. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA’s robust and effective multi-objective alignment capabilities, consistently outperforming baseline methods, aligning with its theoretical advantages. By transforming a previously intractable optimization problem into a computationally efficient framework, PAMA offers a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments.
@article{ecml25, author = {He, Qiang and Maghsudi, Setareh}, title = {Pareto Multi-Objective Alignment for Language Models}, journal = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases}, year = {2025}, url = {https://arxiv.org/abs/2508.07768} }
ICML’24

Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignment

Chen Zhang , Qiang He , Yuan Zhou , and 4 more authors

In Forty-first International Conference on Machine Learning , 2024

Abs

Deep Reinforcement Learning (DRL) agents have demonstrated impressive success in a wide range of game genres. However, existing research primarily focuses on optimizing DRL competence rather than addressing the challenge of prolonged player interaction. In this paper, we propose a practical DRL agent system for fighting games named Shūkai, which has been successfully deployed to Naruto Mobile, a popular fighting game with over 100 million registered users. Shūkai quantifies the state to enhance generalizability, introducing Heterogeneous League Training (HELT) to achieve balanced competence, generalizability, and training efficiency. Furthermore, Shūkai implements specific rewards to align the agent’s behavior with human expectations. Shūkai’s ability to generalize is demonstrated by its consistent competence across all characters, even though it was trained on only 15% of them. Additionally, HELT exhibits a remarkable improvement in sample efficiency. Shūkai serves as a valuable training partner for players in Naruto Mobile, enabling them to enhance their abilities and skills.
Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation

Qiang He , Tianyi Zhou , Meng Fang , and 1 more author

Twelfth International Conference on Learning Representations, 2024

Abs Bib PDF Code

Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent’s performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our anonymous code is available at https://anonymous.4open.science/r/BEER-3C4B.
@article{ICLR2024-BEER, author = {He, Qiang and Zhou, Tianyi and Fang, Meng and Maghsudi, Setareh}, title = {Adaptive Regularization of Representation Rank as an Implicit Constraint of Bellman Equation}, journal = {Twelfth International Conference on Learning Representations}, year = {2024}, }
Keep Various Trajectories: Promoting Exploration of Ensemble Policies in Continuous Control

Chao Li , Chen Gong , Qiang He , and 1 more author

Thirty-seventh Conference on Neural Information Processing Systems, 2023

Abs Bib PDF

The combination of deep reinforcement learning (DRL) with ensemble methods has been proved to be highly effective in addressing complex sequential decision-making problems. This success can be primarily attributed to the utilization of multiple models, which enhances both the robustness of the policy and the accuracy of value function estimation. However, there has been limited analysis of the empirical success of current ensemble RL methods thus far. Our new analysis reveals that the sample efficiency of previous ensemble DRL algorithms may be limited by sub-policies that are not as diverse as they could be. Motivated by these findings, our study introduces a new ensemble RL algorithm, termed \textbfTrajectories-awar\textbfE \textbfEnsemble exploratio\textbfN (TEEN). The primary goal of TEEN is to maximize the expected return while promoting more diverse trajectories. Through extensive experiments, we demonstrate that TEEN not only enhances the sample diversity of the ensemble policy compared to using sub-policies alone but also improves the performance over ensemble RL algorithms. On average, TEEN outperforms the baseline ensemble DRL algorithms by 41% in performance on the tested representative environments.
@article{NIPS23, author = {Li, Chao and Gong, Chen and He, Qiang and Hou, Xinwen}, title = {Keep Various Trajectories: Promoting Exploration of Ensemble Policies in Continuous Control}, journal = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, }
ECML’23
Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning

Qiang He , Meng Fang , Tianyi Zhou , and 1 more author

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023

Abs Bib PDF Code

We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at this https URL.
@article{ecml23, author = {He, Qiang and Fang, Meng and Zhou, Tianyi and Maghsudi, Setareh}, title = {Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning}, journal = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases}, volume = {abs/2205.14557}, year = {2023}, url = {https://arxiv.org/abs/2306.16750}, doi = {10.48550/arXiv.2205.14557}, eprinttype = {arXiv}, eprint = {2205.14557}, }
CVPR’23
Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning

Qiang He , Huangyuan Su , Jieyu Zhang , and 1 more author

The Thirty-Fourth IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Abs Bib PDF Code

Deep reinforcement learning (DRL) gives the promise that an agent learns good policy from high-dimensional information, whereas representation learning removes irrelevant and redundant information and retains pertinent information. In this work, we demonstrate that the learned representation of the Q-network and its target Q-network should, in theory, satisfy a favorable distinguishable representation property. Specifically, there exists an upper bound on the representation similarity of the value functions of two adjacent time steps in a typical DRL setting. However, through illustrative experiments, we show that the learned DRL agent may violate this property and lead to a sub-optimal policy. Therefore, we propose a simple yet effective regularizer called Policy Evaluation with Easy Regularization on Representation (PEER), which aims to maintain the distinguishable representation property via explicit regularization on internal representations. And we provide the convergence rate guarantee of PEER. Implementing PEER requires only one line of code. Our experiments demonstrate that incorporating PEER into DRL can significantly improve performance and sample efficiency. Comprehensive experiments show that PEER achieves state-of-the-art performance on all 4 environments on PyBullet, 9 out of 12 tasks on DMControl, and 19 out of 26 games on Atari. To the best of our knowledge, PEER is the first work to study the inherent representation property of Q-network and its target. Our code is available at this https URL.
@article{cvpr2023, author = {He, Qiang and Su, Huangyuan and Zhang, Jieyu and Hou, Xinwen}, title = {Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning}, journal = {The Thirty-Fourth IEEE/CVF Conference on Computer Vision and Pattern Recognition}, volume = {abs/2205.14557}, year = {2023}, url = {https://openaccess.thecvf.com/content/CVPR2023/papers/He_Frustratingly_Easy_Regularization_on_Representation_Can_Boost_Deep_Reinforcement_Learning_CVPR_2023_paper.pdf}, doi = {10.48550/arXiv.2205.14557}, eprinttype = {arXiv}, eprint = {2205.14557}, }