April 18, 2026·4 min readEngineeringAI

Why we chose Claude for the heavy lifting

Atlas runs on Claude Vision for PDF parsing, Claude for RAG over your document vault, and Claude for the AI concierge that answers EE benefit questions. Here's the three-way comparison that got us there.

The Atlas team

Published April 18, 2026

We get asked a lot about which LLM stack sits under Atlas. The short version: Claude does the heavy lifting, OpenAI fills in narrow surfaces (Whisper transcription, one embedding workload), and Gemini is on the bench as a backup. This post is the comparison that got us there — not for positioning, for future-us when we have to pick again.

The workloads we care about

Atlas leans on LLMs in four places:

SBC + commission-statement PDF parsing. Turn carrier PDFs (often scanned, sometimes 80 pages) into structured rows with 99%+ field-level accuracy. This is the commercial keystone of Broker Comp and Decision Support.
RAG over the document vault.“Which plans cover GLP-1s without prior auth?” answered against the broker’s actual SBCs, SPDs, and BAAs with page citations.
AI concierge for end-users.Employees ask HSA / FSA / COBRA questions via the EE portal; the answer has to be sourced from the client’s plan, not a generic web search.
Agent tool-use across the CRM.“Draft a renewal summary for Cornerstone, compare last year’s rate with this year’s, include the disruption analysis.” Needs reliable multi-step tool invocation.

The bake-off

We ran each workload head-to-head against Claude Sonnet 4.6, GPT-4o / GPT-5 previews, and Gemini 2.5 Pro through the first quarter of 2026. Ground-truth data sourced from 4,200 carrier statements and 600 SBCs we’d hand-extracted first.

PDF parsing— Claude Vision wins decisively on multi-page, multi-column carrier statements. Accuracy 99.1% vs GPT’s 96.4% vs Gemini’s 95.2%. The gap widens on scanned (not digitally-generated) statements, where Claude’s document-layout understanding is meaningfully better. Per-page cost is roughly flat across the three within our usage.

RAG accuracy— close race. Claude and GPT tied on answer quality when the source docs were available in-context; GPT pulled slightly ahead on long-context SPDs (80+ pages) but Claude’s 1M-context window eventually closed the gap. We picked Claude for consistency with the PDF workload — one model, one audit log.

AI concierge tone— Claude won on instruction following for refusal cases (“if you’re not sure, say so and route to a human”). This mattered more than raw accuracy for us — wrong medical-adjacent answers to an employee are a liability, not a feature.

Tool use— Claude’s tool-use reliability was the clincher. Multi-step agent flows stayed on the rails more consistently; error recovery when a tool call returned an unexpected shape was visibly better.

What we kept OpenAI for

Whisper for voice transcription on calls — it’s still the best thing going on latency + accuracy for US-English telephony. And one embedding workload where the embedding model is materially cheaper at our scale.

Where Gemini sits

On the bench. Competitive per-token pricing and improving instruction-following, but we didn’t see a workload where it won outright. We’ll re-run the bake-off each quarter.

The boring answer

“Claude does the heavy lifting” is partly about model quality, but mostly about consistency. One provider, one audit-log format, one BAA in place, one rate-limit regime. We didn’t want Atlas to be a portfolio manager of three LLM stacks with three outage profiles. We wanted one stack we could reason about end-to-end, and that’s Anthropic today.

Ready to see Atlas?

Bring a real renewal. 15 minutes. We’ll show you the product running on your actual data.

Request a walkthrough