HARPO: Hierarchical Agentic Reasoning for User-Aligned Conversational Recommendation

Overview

The Problem & Solution

Conversational recommender systems are often trained and evaluated using proxy metrics (Recall@K, BLEU) that weakly reflect true user-aligned recommendation quality. HARPO reframes conversational recommendation as a structured decision-making problem, explicitly optimizing for user satisfaction, relevance, diversity, and engagement.

Abstract: Conversational recommender systems (CRSs) operate under incremental preference revelation, requiring systems to make recommendation decisions under uncertainty. While recent approaches, particularly those built on large language models, achieve strong performance on standard proxy metrics such as Recall@K and BLEU, they often fail to deliver high-quality, user-aligned recommendations in practice. This gap arises because existing methods primarily optimize for intermediate objectives like retrieval accuracy, fluent generation, or tool invocation, rather than recommendation quality itself.

HARPO integrates: (i) CHARM — hierarchical preference learning that decomposes recommendation quality into interpretable dimensions (relevance, diversity, predicted user satisfaction, and engagement) and learns context-dependent weights; (ii) STAR — deliberative tree-search reasoning guided by a learned value network; (iii) BRIDGE — domain-agnostic reasoning abstractions enabling cross-domain transfer; and (iv) MAVEN — multi-agent refinement through collaborative critique.

📐 Place figures/Quantum_CRS.png in the same directory to display the architecture diagram.

Figure 2: Overall architecture of the HARPO framework. The model integrates four components: STAR for structured agentic reasoning, CHARM for hierarchical preference optimization, BRIDGE for cross-domain transfer, and MAVEN for multi-agent refinement, all built on a shared language model backbone.

Architecture

Framework 4 Components

HARPO integrates four tightly coupled modules built on a shared language model backbone (DeepSeek-R1-Distill-Qwen-7B).

🌳

STAR

Structured Tree-of-Thought Agentic Reasoning
Beam search over structured reasoning states guided by a learned value network that predicts multi-dimensional recommendation quality rather than task completion.

🎯

CHARM

Contrastive Hierarchical Alignment with Reward Marginalization
Decomposes recommendation quality into four reward dimensions with context-dependent meta-learned weights and margin-based preference optimization.

🌉

BRIDGE

Cross-Domain Transfer
Adversarial domain adaptation with learnable domain gates — preserves domain-invariant reasoning patterns while retaining domain-specific information.

🤝

MAVEN

Multi-Agent Refinement
Three specialized agents (Recommender, Critic, Explainer) collaborate through shared representations with an agreement loss promoting coherent consensus.

Evaluation Metrics

Evaluation Metrics User-Aligned

HARPO introduces a quality-centric evaluation perspective separating user-aligned measures from standard proxy metrics.

Primary · User-Aligned

User Satisfaction

CHARM reward score for predicted user satisfaction, validated via Pearson correlation with human judgments (r=0.73).

Primary · User-Aligned

Engagement

Predicts follow-up rate and continued interaction. Pearson r=0.64 with human follow-up behavior.

Primary · User-Aligned

Diversity-adj. Relevance

Combined relevance (CHARM r=0.71) and diversity (r=0.68) reward, measuring breadth alongside precision.

Primary · Human

Human Preference

Expert annotator Overall score (1–5 Likert, 200 samples/dataset, Fleiss' κ > 0.72), averaged across Rec.Q and Exp.Q.

Secondary · Proxy

Recall@K

Standard retrieval metric over 100 candidates (99 negatives). Reported for K ∈ {1, 10, 50}.

Secondary · Proxy

NDCG / MRR

Ranking quality metrics at K=10. Reported alongside Recall as complementary proxy signals.

Benchmark

Leaderboard

HARPO Benchmark

ReDial (Movies)

INSPIRED (Movies)

MUSE (Fashion)

Rankings across three datasets on user-aligned metrics. Click any column header to sort. Filter by model type or search by name. Higher is better for all metrics.

#	Model	Satisfaction ↕	Engagement ↕	Div.-Adj. Relevance ↕	Human Pref. ↕	Overall Score ↓

† Text-only adaptation. ‡ Fine-tuned per Wang et al. 2025. Scores normalized [0,1]. Human Pref. = Table 8 Overall score (1–5, normalized).

Submit

Submit Your System

Paste your results JSON below. Evaluated against the official HARPO benchmark API at github.com/harpo-bench/harpo.

📋 Paste Results JSON

All fields required. Results are verified server-side using the HARPO evaluation API.

View full JSON schema

{
  "method_name": string,
  "team": string,
  "dataset": "redial"|"inspired"|"muse",
  "predictions": [{ "conv_id": string, "recommended_items": number[] }],
  "paper_url": string | null,
  "code_url": string | null,
  "description": string  // ≤ 200 chars
}

Evaluation

Benchmark Results

HARPO demonstrates consistent improvements across three conversational recommendation benchmarks (ReDial, INSPIRED, MUSE) with particularly strong gains on user-aligned metrics. All improvements significant at p < 0.01 (paired t-test, Bonferroni correction).

ReDial

INSPIRED

MUSE (Multimodal)

Method	R@1	R@10	R@50	MRR@10	NDCG@10	User Sat.	Engage.
KBRD Open-source	2.9±0.2	16.7±0.4	36.2±0.7	7.4±0.2	10.2±0.3	0.42±0.02	0.38±0.02
KGSF Open-source	3.8±0.2	18.1±0.5	37.4±0.7	8.4±0.3	11.6±0.4	0.45±0.02	0.41±0.02
BARCOR Open-source	3.0±0.2	16.8±0.4	36.8±0.6	7.8±0.2	10.8±0.3	0.44±0.02	0.40±0.02
LLaMA-2-7B Open-source	2.2±0.3	13.6±0.6	33.4±0.9	6.2±0.3	8.6±0.4	0.38±0.02	0.34±0.02
LLaMA-2-13B Open-source	2.8±0.3	15.4±0.6	35.6±1.0	7.2±0.4	9.9±0.5	0.43±0.02	0.39±0.02
UniCRS Open-source	4.8±0.3	21.2±0.5	40.8±0.8	10.1±0.3	13.8±0.4	0.51±0.02	0.47±0.02
DCRS Agent	7.5±0.3	25.1±0.6	43.6±0.9	12.2±0.4	15.2±0.5	0.56±0.02	0.52±0.02
ChatGPT GPT	3.3±0.4	17.0±0.7	37.8±1.1	8.0±0.4	11.0±0.5	0.49±0.03	0.45±0.03
GPT-4 GPT	4.5±0.4	19.4±0.8	40.2±1.2	9.6±0.5	13.2±0.6	0.55±0.03	0.51±0.03
RecMind Agent	5.8±0.3	22.6±0.6	42.2±0.9	11.2±0.4	15.3±0.5	0.54±0.02	0.50±0.02
InteRecAgent Agent	5.2±0.3	21.4±0.6	41.0±0.8	10.4±0.4	14.3±0.5	0.52±0.02	0.48±0.02
HARPO Ours	9.1±0.3	29.8±0.7	50.2±1.0	15.6±0.5	21.2±0.6	0.68±0.02	0.64±0.02

Method	R@1	R@10	R@50	MRR@10	NDCG@10	User Sat.	Engage.
KGSF Open-source	2.4±0.3	13.8±0.6	31.6±1.0	6.4±0.3	8.8±0.4	0.40±0.03	0.36±0.02
UniCRS Open-source	3.8±0.3	17.6±0.7	37.2±1.2	8.6±0.4	11.8±0.5	0.48±0.03	0.44±0.03
GPT-4 GPT	4.2±0.5	18.8±0.9	39.4±1.5	9.4±0.5	12.9±0.6	0.53±0.03	0.49±0.03
RecMind Agent	4.8±0.4	20.4±0.8	41.2±1.3	10.2±0.5	14.0±0.6	0.52±0.03	0.48±0.03
HARPO Ours	7.2±0.4	27.4±0.9	48.8±1.4	14.2±0.6	19.4±0.7	0.66±0.03	0.62±0.03

Method	R@1	R@10	R@50	MRR@10	NDCG@10	User Sat.	Engage.
UniCRS† Open-source	1.6±0.3	11.8±0.6	27.4±1.1	5.1±0.3	7.2±0.4	0.36±0.03	0.32±0.02
LLaVA-Next-8B‡ Open-source	5.2±0.4	25.4±0.7	44.2±1.2	12.0±0.4	16.2±0.5	0.52±0.03	0.48±0.03
GPT-4V GPT	4.4±0.5	23.2±0.9	42.6±1.4	10.8±0.5	14.8±0.6	0.54±0.03	0.50±0.03
Qwen2-VL-7B‡ Open-source	8.4±0.4	34.2±0.8	52.8±1.3	17.2±0.4	23.1±0.5	0.61±0.03	0.57±0.03
HARPO Ours	10.2±0.4	38.6±0.9	58.4±1.3	19.8±0.5	26.4±0.6	0.72±0.03	0.68±0.03

† Text-only adaptation. ‡ Fine-tuned following Wang et al. (2025).

The Problem & Solution

Framework 4 Components

Evaluation Metrics User-Aligned

Leaderboard

Submit Your System

📋 Paste Results JSON

Benchmark Results

Cite This Work