How to size, scope and price LLM-driven Salesforce work β when the agent (not a script) is making the calls.
Agentic Salesforce isn't deterministic. The same user intent can cost 2 calls or 47, depending on your MCP design. Move cost from runtime (token + API burn) to design-time (skills + a searchable schema index + preflight). Naive MCP = pay-per-discovery. Stainless-style MCP + skills = pay-per-intent.
If you're writing a deterministic integration β Mulesoft, a Node script, a workflow β estimating Salesforce API calls is arithmetic. Count the operations, multiply by volume, add a fudge factor.
The moment an LLM is the one calling Salesforce, that arithmetic breaks. The model decides what to do. Missing fields, validation rules, picklist mismatches, ambiguous lookups β every one of those becomes a tool call the agent makes to recover. And the agent doesn't know what it doesn't know until it tries.
So you're not estimating a script. You're estimating a probability distribution of tool-call trajectories. The goal of good design is to collapse the variance.
| Layer | What it measures | Where it bleeds |
|---|---|---|
| 1. Intent β Tool Calls | How many turns the agent takes to figure out what to do | LLM tokens, latency, user patience |
| 2. Tool Calls β API Calls | How many Salesforce calls each tool invocation actually makes | Daily API limit, governor limits |
| 3. API Calls β Outcomes | How much server-side automation each call triggers | Hidden callouts, CPU, async queue |
Most teams only think about layer 2. The expensive failure mode is layer 1 β layer 2 amplification: the agent gets confused, takes 9 tool calls to do a 1-tool-call job, and each of those tool calls fans out to 1-3 SF API calls.
This is your floor β what a well-designed deterministic call would consume.
| Approach | API calls | Notes |
|---|---|---|
Single POST to /sobjects/Opportunity | 1 | Just the Opp, no children |
| Opp + 3 OpportunityLineItems, naively | 4 | 1 per record. Anti-pattern. |
Opp + 3 OLIs via /composite/tree | 1 | Up to 200 records w/ parent-child refs in one call |
quote.generate via Apex REST β 1 call, but long (5β15s typical) and counts against API limit normally.One per record, plus lookups. Easy to write, easy to blow limits.
Single /composite/tree payload with parent-child references resolved server-side.
/composite/tree or /composite/sobjects for any multi-record headless write, you're doing it wrong.A "real" headless use case is rarely 1 call. Let's walk a credible one end-to-end and see where the calls actually go.
| # | Step | API calls | Notes |
|---|---|---|---|
| 1 | OAuth token refresh | ~0.1 | Cached for ~1h, amortized across all txns |
| 2 | Lookup Account by external ID | 1 | SOQL via /query |
| 3 | Create Opp + Quote + 5 LIs | 1 | Composite tree |
| 4 | Trigger pricing | 1 | CPQ / Revenue Cloud |
| 5 | Generate PDF | 1 | Apex REST |
| 6 | Email send | 1 | Connect API / Messaging |
| 7 | Update status | 1 | PATCH on Quote |
| Total per transaction | ~6 | Deterministic, well-designed | |
Now: 6 calls Γ N transactions/day = your daily burn. Multiply by 22 working days. Compare to your org's daily limit.
| Edition / Licenses | Daily API limit |
|---|---|
| Enterprise Β· 100 users | 115,000 |
| Enterprise Β· 1,000 users | 1,015,000 |
| Anything over 5M | Capped at 5M (need API Bundles) |
This is where most estimates go wrong. The baseline assumes a perfect world. The real world has these taxes:
| Multiplier | Cost | Mitigation |
|---|---|---|
| Bulk vs REST | REST = 1 call/record. Bulk = 1 call per 10k. | Use Bulk API 2.0 for any >200-record batch. |
| Triggers / Flows with callouts | +1β4 hidden calls per record write | Audit org for callouts in automation. Move to async. |
| Polling | 1 call Γ poll-frequency Γ hours = huge | Use Platform Events or CDC. CDC delivery doesn't count. |
| OAuth refresh | ~0.1 calls/txn if cached, 1 if not | Cache tokens for their full lifetime. |
| Retry logic | +10β30% for 429s, timeouts | Exponential backoff, idempotency keys. |
| UI API / Connect API | Same daily bucket as REST | Don't assume "UI API" is free β it isn't. |
| Metadata API | Separate limit but expensive ops | Don't use it in transactional paths. |
Everything above assumes a deterministic caller. Now imagine the caller is an LLM tool-calling against an MCP server. Same user intent β "create an opp for Acme for $50k" β can play out very differently:
Agent has full context, sends one composite create. Done.
Describe, query, fail validation, re-describe, retry picklist, fail FLS, retryβ¦
With a naive MCP that exposes one tool per endpoint (the anti-pattern most teams ship first):
User: "Create an opp for Acme for $50k"
Agent trace:
1. describe_Opportunity() β 1 SF call, ~3k tokens back
2. query_Account("Acme") β 1 SF call, returns 4 matches
3. ask_user("which Acme?") β stalls, costs a round-trip
4. create_Opportunity({...}) β FAIL: StageName required
5. describe_picklist_Stage() β 1 SF call
6. create_Opportunity({...}) β FAIL: CloseDate required
7. create_Opportunity({...}) β FAIL: validation rule "Segment__c required for Amount > $10k"
8. query_CustomField_Segment() β 1 SF call
9. create_Opportunity({...}) β finally succeeds
Total: 9 tool calls Β· ~6 SF API calls Β· ~40k tokens Β· 1 frustrated user
That's real. I've seen it in production traces. And it's not the model being dumb β it's the tool surface forcing it to discover the org one failure at a time.
The fix is to stop exposing Salesforce as "one tool per endpoint" and instead expose it as two tools: one to discover, one to act. We built this for bugs-sf-stainless, and the variance collapse is dramatic.
search(query)Fuses 5 layers: SDK catalog, cookbook skills, RAG over 6,400 SF doc chunks, live web, and live org introspection. Cohere-reranked. Returns the schema, existing examples, validation rules, and patterns relevant to what the agent is about to do.
execute(python_code)Runs Python in a 25s sandbox against a pre-authenticated sf SDK: sf.query, sf.create, sf.tooling.*, sf.metadata.deploy, sf.apex(code). The agent can batch a whole transaction in one block.
User: "Create an opp for Acme for $50k"
Agent trace:
1. search("create opportunity required fields validation rules")
β Returns: required fields, active VRs, picklist values,
similar Opps in the org, a working code example.
(1 search call, no SF API call yet)
2. execute("""
acmes = sf.query("SELECT Id,Name FROM Account WHERE Name LIKE 'Acme%'")
# agent shows list to user, picks one
opp = sf.create('Opportunity', {
'Name':'Acme - Q3','AccountId':acmes[0]['Id'],
'Amount':50000,'CloseDate':'2026-09-30',
'StageName':'Prospecting','Segment__c':'Enterprise'
})
""")
β 2 SF API calls (1 SOQL, 1 create)
Total: 2 tool calls Β· 2 SF API calls Β· ~8k tokens Β· 0 retries
Skills are playbooks for known patterns. They're the third leg of the stool alongside search and execute. They cut variance hard on:
| Good for | Not so good for |
|---|---|
| Repeatable workflows "Create Opp," "Quote-to-Cash," "Convert Lead" β encoded as a checklist of required fields, recommended composite payload, common pitfalls | Novel / exploratory intents User asks something the skill doesn't cover β skill provides no lift |
Org-specific quirks "This org requires Segment__c and uses RecordType 'Enterprise Sale' for >$50k" β pre-loaded, agent doesn't discover | Fast-moving metadata A skill written 6 months ago doesn't know about the VR added last week |
| Multi-step compound use cases Skill encodes the order, the composite shape, the cleanup | As a replacement for live introspection Skills go stale. They guide, they don't replace fresh state. |
| Anti-patterns to avoid "Don't query LIs one at a time, use a subquery" |
search('current Opportunity required fields and active validation rules in this org')." The skill gives the pattern; search gives the fresh state. Same shape, current data.If you remember one thing from this page, remember this:
Failure-driven discovery is the worst possible API consumption pattern. Every validation rule the agent learns about by triggering it is a tax you pay forever.
The architectural fix is a preflight tool:
preflight_create('Opportunity', {Name:'Acme', Amount:50000})
β {
missing_required: ['CloseDate','StageName'],
validation_rules: ['Segment__c required when Amount > 10000'],
fls_issues: [], // for running user
picklist_values: {StageName:[...], Type:[...]},
record_type_hint: 'Enterprise Sale',
composite_payload: {...} // ready-to-POST shape
}
One call, before any write. Agent now has everything it needs to either ask the user once for all missing fields or auto-fill from context. Failure trajectory: 2 calls instead of 9.
This belongs in your MCP layer, not in every skill.
What a perfect deterministic script would do. Use the per-operation table from section 3.
| Architecture | Multiplier | Why |
|---|---|---|
| Stainless-style search + execute + skills + preflight | 1.2Γβ1.5Γ | One discovery pass, one batched write, few retries. |
| Multi-tool MCP + skills, no preflight | 3Γβ5Γ | Each fail = a new tool call. Skills help but don't replace introspection. |
| Naive multi-tool MCP, no skills, no preflight | 8Γβ15Γ | Pay-per-discovery. Worst-case the agent loops on validation rules. |
Use case: 5,000 Quote-to-Cash transactions/day from a customer portal, driven by Claude + MCP.
Same use case, naive multi-tool MCP: 6 Γ 10 Γ 1.2 Γ 1.3 = ~94 calls/txn β 470,000/day β over the limit. Org gets throttled by 11 AM.
preflight_create call before a write saves 5+ retry calls.