Papers
arxiv:2603.04743

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Published on Mar 5
ยท Submitted by
Maojun SUN
on Mar 6
#3 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

A lightweight retrieval model called DARE incorporates data distribution information into function representations to improve R package retrieval, achieving superior performance over existing embedding models while enabling more reliable statistical analysis through an R-oriented LLM agent.

AI-generated summary

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

Community

We introduce DARE, an embedding model for improving LLM Agents on R package retrieval and downstream statistical analysis tasks. DARE outperforms open-sourced embedding models on R retrieval with higher efficiency and accuracy.

Paper: https://arxiv.org/abs/2603.04743
Website: https://ama-cmfai.github.io/DARE_webpage/
Model: https://huggingface.co/Stephen-SMJ/DARE-R-Retriever
Database: https://huggingface.co/datasets/Stephen-SMJ/RPKB

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/dare-aligning-llm-agents-with-the-r-statistical-ecosystem-via-distribution-aware-retrieval-1339-e232a0b2

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.04743 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.04743 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.04743 in a Space README.md to link it from this page.

Collections including this paper 1