arxiv:2603.04743

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Published on Mar 5

· Submitted by

Maojun SUN on Mar 6

#3 Paper of the day

The Hong Kong Polytechnic University

Upvote

Authors:

Abstract

A lightweight retrieval model called DARE incorporates data distribution information into function representations to improve R package retrieval, achieving superior performance over existing embedding models while enabling more reliable statistical analysis through an R-oriented LLM agent.

AI-generated summary

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

Stephen-SMJ

Paper submitter about 20 hours ago

•

edited about 20 hours ago

We introduce DARE, an embedding model for improving LLM Agents on R package retrieval and downstream statistical analysis tasks. DARE outperforms open-sourced embedding models on R retrieval with higher efficiency and accuracy.

Paper: https://arxiv.org/abs/2603.04743
Website: https://ama-cmfai.github.io/DARE_webpage/
Model: https://huggingface.co/Stephen-SMJ/DARE-R-Retriever
Database: https://huggingface.co/datasets/Stephen-SMJ/RPKB