Formulating recommender system with reinforcement learning (RL) frameworks has attracted increasing attention from both academic and industry communities. While many promising results have been achieved, existing models mostly simulate the environment reward with a unified value, which may hinder the understanding of users’ complex preferences and limit the model performance. In this paper, we consider how to model user multi-aspect preferences in the context of RL-based recommender system. More specifically, we base our model on the framework of deterministic policy gradient (DPG), which is effective in dealing with large action spaces. A major challenge for modeling user multi- aspect preferences lies in the fact that they may contradict with each other. To solve this problem, we introduce Pareto optimization into the DPG framework. We assign each aspect with a tailored critic, and all the critics share the same actor model. The Pareto optimization is realized by a gradient-based method, which can be easily integrated into the actor and critic learning process. Based on our designed model, we theoretically analyze its gradient bias in the optimization process, and we design a weight-reuse mechanism to lower the upper bound of this bias, which is shown effective for improving the model performance. We conduct extensive experiments on different real-world datasets to demonstrate our model’s superiorities.

The Web Conference is announcing latest news and developments biweekly or on a monthly basis. We respect The General Data Protection Regulation 2016/679.