MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

This paper provides the first controlled, head-to-head comparison of four web agent interaction paradigms—HTML browsing, RAG, MCP, and NLWeb—across identical e-commerce tasks, demonstrating that structured interfaces significantly outperform raw HTML in both effectiveness and efficiency.

Problem Statement

LLM-based web agents are increasingly deployed for tasks like product search and checkout, but researchers have been developing HTML, RAG, MCP, and NLWeb interfaces in isolation without a unified benchmark. This lack of direct comparison makes it impossible to objectively assess trade-offs in accuracy, token consumption, latency, and cost. Practitioners have no empirical basis for choosing the right interface architecture for their specific use case.

Key Novelty

First unified testbed with four simulated e-shops exposing identical products via HTML, MCP, and NLWeb interfaces simultaneously, enabling apples-to-apples comparison
Comprehensive evaluation spanning four interface types (HTML, RAG, MCP, NLWeb) across task complexity levels from simple search to multi-step checkout with four different LLM backends
Quantified cost-performance trade-off analysis showing RAG with GPT 5 mini as an optimal efficiency-accuracy compromise, providing actionable guidance for practitioners

Evaluation Highlights

F1 score rises from 0.67 (HTML) to 0.75–0.77 (RAG/MCP/NLWeb), with best configuration RAG+GPT5 achieving F1=0.87 and completion rate=0.79; token usage drops from ~241k (HTML) to 47k–140k per task
Runtime per task drops from 291 seconds for HTML to 50–62 seconds for structured interfaces, representing a ~5–6x speedup that directly impacts user-facing latency and API costs

Breakthrough Assessment

5/10 This is a solid, well-executed empirical contribution that fills a clear gap in the web agent literature with practical value for practitioners, but it is fundamentally a benchmarking/comparison study rather than a novel algorithmic or architectural advance, placing it in the solid contribution category.

Methodology

Construct a controlled testbed of four simulated e-commerce shops, each exposing its product catalog through three server-side interfaces (HTML pages, MCP API endpoints, NLWeb natural-language query interface) to ensure identical underlying data across all conditions
Develop four specialized agent architectures—HTML browser agent, RAG agent (pre-crawled vector index), MCP API agent, and NLWeb query agent—each optimized for its respective interface, and define a task suite ranging from simple lookups to complex complementary/substitute product queries and multi-step checkout
Evaluate all agent-interface combinations using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as LLM backends, measuring F1 score, task completion rate, token usage, runtime, and estimated API cost to produce a multi-dimensional comparison

System Components

Simulated E-Shop Testbed

Four synthetic online stores each offering identical product data through HTML, MCP, and NLWeb interfaces, enabling controlled cross-interface comparisons

HTML Browser Agent

Baseline agent that navigates raw HTML pages via DOM parsing and link following, mimicking traditional web automation

RAG Agent

Agent that queries a pre-crawled vector index of shop content using dense retrieval, augmenting LLM context with retrieved passages

MCP Agent

Agent that communicates with shops via structured Web API calls using the Model Context Protocol, receiving structured JSON responses

NLWeb Agent

Agent that issues natural-language queries directly to a shop's NLWeb endpoint, receiving semantically processed natural-language responses

Multi-LLM Evaluation Framework

Harness for running all agent-interface combinations across GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4, recording F1, completion rate, tokens, runtime, and cost

Results

Metric	HTML (Baseline)	Best Non-HTML Interface	Delta
F1 Score (avg)	0.67	0.75–0.77 (RAG/MCP/NLWeb avg); 0.87 (RAG+GPT5 best)	+0.08 avg; +0.20 best
Task Completion Rate	Baseline	0.79 (RAG+GPT5 best)	Substantial improvement
Token Usage per Task	~241,000	47,000–140,000	~1.7x–5x reduction
Runtime per Task (seconds)	291s	50–62s	~5–6x speedup

Key Takeaways

Avoid raw HTML browsing for production LLM web agents whenever possible—RAG, MCP, or NLWeb interfaces deliver 5–6x lower latency, up to 5x fewer tokens, and meaningfully higher F1 with minimal additional infrastructure investment
For maximum accuracy, pair a RAG interface with GPT 5 (F1=0.87, completion=0.79); for cost-sensitive deployments, RAG with GPT 5 mini offers the best performance-per-dollar trade-off among all tested configurations
The choice of interaction interface is at least as impactful as the choice of underlying LLM model, meaning teams building web agents should prioritize interface architecture decisions early rather than defaulting to HTML and relying solely on model upgrades to improve performance

Abstract

Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.