The Platform Decision Nobody’s Making Correctly
Your model question isn't "which one is best." It's three decisions your team is collapsing into one.
You’ve diagnosed why your organization is stuck. The four inertia forces from Article 9 “Why Your Organization Is Stuck” – compliance walls, bureaucracy traps, mental model problems, verification bottlenecks – are real. But here’s what diagnosis alone doesn’t tell you: the next decision your organization makes about AI platforms will either begin unwinding that inertia or compound it for years.
Same model. Same task. Different infrastructure around it. 78% accuracy in one harness. 42% in another.
That’s what the CORE benchmark found when testing Claude Opus 4.5 across different agent frameworks – results documented at the AI Engineer Summit and analyzed by ML6. The model was identical. The harness – the infrastructure that determines how a model receives instructions, manages context, and delivers results – made the difference between passing and failing.
And it’s not an isolated finding. When SWE-bench upgraded its evaluation scaffold in February 2026, the same models on the same tasks saw performance jump significantly. The models hadn’t improved. The infrastructure around them had. Sierra’s τ-bench found that even the best-performing model (GPT-4o) couldn’t break 50% average success across real-world agent tasks – and that was the best result out of 12 frontier models tested.
The implication is uncomfortable: the vendor comparison spreadsheet your team assembled last quarter was probably measuring the wrong variable. You were comparing models. The performance gap lives in the harness.
If that doesn’t reframe how you’re thinking about your AI investments, nothing in this article will. But if it does, then what follows is the diagnostic process I use when organizations ask me which model they should choose.
Because the answer is always the same: wrong question. The model question isn’t one decision. It’s three. And most enterprises are collapsing all three into a single vendor evaluation that was over before it started.
The Accidental Strategy
For nine articles, this series has been building the foundation: what specification means, why it matters, how to know when yours is good enough, why your organization can’t absorb what’s available. Now the terrain shifts from how to think about AI to how to build with AI.
The first architecture question every enterprise faces is: which platform?
It sounds like a technology evaluation. It isn’t. It’s three sequential decisions, each operating at a different altitude, each requiring different organizational maturity.
Decision 1: Philosophy – What theory of human-AI collaboration fits your organization?
Decision 2: Fit – Which tasks belong on which models, and what does sovereignty require?
Decision 3: Lock-in – How fast are your switching costs compounding, and do you know where?
Get the sequence wrong and you end up where most enterprises are right now: locked into a vendor philosophy they chose by accident, running tasks on models that don’t match their difficulty profile, accumulating switching costs nobody’s tracking – and, for European organizations, potentially building critical infrastructure on platforms that can’t meet their regulatory obligations.
Here’s the pattern I keep seeing: whoever set up the first pilot picked a model. The team built around it. Workflows hardened. Context infrastructure accumulated. Three months later, someone suggested evaluating an alternative. The evaluation took a week. The migration would take a quarter. There was never a decision – there was a default that hardened into strategy.
Decision 1: The Philosophy Test
The frontier has split. Not on capability – the top models increasingly match each other on standard evals, and the gap narrows with each release cycle. The split is on philosophy. On fundamentally different theories about how humans and AI should work together.
Two paradigms are useful as a diagnostic lens – not a clean binary, but a spectrum that reveals how your organization thinks about control and delegation.
The Delegation End. OpenAI’s Codex embodies this. You define the task. The system disappears into a sealed environment. It comes back with finished work. You review the output, not the process. Think of it as sliding a brief under a door and getting a deliverable back. The system operates in isolated containers with sandbox and approval controls.
The Coordination End. Anthropic’s Claude Code represents this. The system works alongside you. It has access to your environment, your files, your context. You can intervene, redirect, course-correct in real time. Think of it as a collaborator at your desk, working across your local files with full session context.
These aren’t rigid categories – both platforms are evolving, and most real workflows land somewhere between the poles. But the spectrum reveals something about your organization that matters more than any benchmark: where does your team’s judgment add the most value?
If your processes are well-documented, your success criteria are checkable, and your team’s value is in evaluation rather than improvisation – you lean delegation. If your work involves ambiguity that resolves through iteration, if your best people add value by recognizing patterns mid-process – you lean coordination.
Most enterprises need both, depending on the task. Inbal Shani, formerly of GitHub, put it well: “We will live in a hybrid world where you still have specific AI models solving very specific problems... eventually we will find ourselves in the world of hybrid models and multi-models where several LLM models come together because each one of them will have their own benefit.” Aerospace, automotive, financial services – any domain with high safety regulation – will need specialized models alongside general-purpose ones.
This is where the specification thread from Articles 2, 5, and 8 connects directly. Delegation requires specification maturity upfront – before the system starts. If your team can’t pass the Specificity Audit on a given task, delegation will produce confident-looking output that misses the mark. Coordination requires specification maturity distributed throughout the process – the ability to recognize misalignment as it emerges and correct it precisely.
The philosophy decision isn’t about which model is smarter. It’s about where your organization’s specification capability is strongest.
The Philosophy Diagnostic
Score each question 1-5 for your top AI use cases:
How well-documented are our processes for this task? (1 = tribal knowledge, 5 = checkable specification)
Can we define “done” before the work starts? (1 = “I’ll know it when I see it,” 5 = a stranger could verify)
Where does human judgment add the most value? (1 = during execution, 5 = in the review)
How predictable is the output format? (1 = every output is different, 5 = highly structured)
What’s the cost of a wrong-but-plausible output? (1 = catastrophic, 5 = easily caught and corrected)
Score 20-25: Delegation-ready. Your specification maturity supports sealed-environment execution.
Score 12-19: Hybrid zone. Some tasks delegate, others need coordination. Map per task.
Score 5-11: Coordination-dependent. Your team needs to steer work in real time. Invest in specification maturity before delegating.
Decision 2: Specification-Model Fit
Once you know your philosophy profile – likely a mix across different workflows – the second decision is granular: which tasks go on which models, and what constraints does your operating environment impose?
This is where the harness research becomes operational. The model matters less than most teams think. The infrastructure around it matters more than most teams realize. But what matters most is whether the task’s difficulty profile matches the model’s strengths – and whether the model can legally and practically run in your environment.
The Difficulty Profile
Not every task is hard in the same way. Practitioners like Nate B. Jones have mapped six axes along which tasks vary in difficulty: reasoning complexity, effort required, coordination needs, emotional intelligence, domain expertise, and tolerance for ambiguity. A contract review and a customer email are both “AI tasks,” but they’re hard along completely different dimensions.
The mistake enterprises keep making: treating model selection as a capability question (”Which model is smartest?”) instead of a fit question (”Which model handles this kind of difficulty best in our operating environment?”).
The Four-Lane Architecture
The current landscape, as of March 2026, isn’t a two-horse race. It’s a four-lane architecture, and enterprises that don’t see all four lanes are making the platform decision with incomplete information.
Lane 1: US Frontier Cloud (Anthropic, OpenAI). The highest-capability models for complex reasoning, agentic workflows, and multi-step coordination. In current deployments, Anthropic’s Claude tends to lead on agentic coordination tasks – navigating complex environments and maintaining context across multi-step workflows – while OpenAI’s Codex and GPT models tend to excel at structured infrastructure and systematic execution. These positions shift with every major release, which is precisely the point: if your architecture depends on a specific model’s current strengths, you’re building on sand. These are the models most enterprises evaluate first – and often the only ones they evaluate. The limitation: data leaves your environment, processing happens on US-controlled infrastructure, and you’re subject to US jurisdiction.
Lane 2: Google/Workspace Ecosystem. Google’s Gemini models offer strong reasoning at aggressive price points, making them viable for high-volume tasks where cost-per-query determines feasibility. But the strategic play isn’t raw model capability – it’s deep integration with the Google Workspace ecosystem that many enterprises already run. For organizations whose workflows live in Google Docs, Sheets, and Gmail, the switching cost to non-Google AI may be higher than the capability gap.
Lane 3: EU-Sovereign (Mistral, Aleph Alpha). This lane didn’t exist eighteen months ago. It does now. Mistral is on track to exceed €1 billion in revenue by end of 2026, valued at €11.7 billion after a €1.7 billion Series C led by ASML. Their Forge platform – launched at NVIDIA GTC in March 2026 – lets enterprises build custom AI models trained exclusively on proprietary data, while the separately announced Mistral Compute initiative is building Europe’s largest AI infrastructure with 18,000 NVIDIA Grace Blackwell chips independent of US cloud providers. Partnerships with Accenture, Reply, Ericsson, and the European Space Agency signal where this is heading. For European enterprises in regulated industries – financial services, healthcare, defense, public administration – Mistral and peers offer something US frontier models cannot: data sovereignty by architecture, not by contract.
Lane 4: Self-Hosted Open Source. Llama, Mixtral, and other open-weight models running on your own infrastructure. The trade-off is stark: full control over data residency and processing, lower per-query costs at scale, but significantly lower ceiling on complex reasoning tasks and the full operational burden of hosting, updating, and securing the infrastructure. For specific, well-defined tasks with tight specifications – document classification, structured extraction, routine summarization – this lane often delivers the best cost-performance ratio.
Why This Matters for European Enterprises
This isn’t an abstract architecture discussion. Gartner’s 2025 CIO survey – covering over 200 Western European CIOs – found that 61% plan to increase reliance on local cloud and AI providers in response to sovereignty and regulatory pressure. The direction is unambiguous: European enterprises are actively reevaluating non-European cloud dependencies.
The EU AI Act, fully applicable by August 2026, classifies general-purpose AI models by risk tier and imposes transparency, documentation, and compliance obligations that flow through to every enterprise deploying them. If your model provider can’t demonstrate compliance, the liability sits with you.
For German enterprises the picture is sharper still. The vast majority of Mittelstand firms have not implemented AI in operational practice – and those that have invested are spending significantly less than the broader market. The capability gap described in Article 9 is compounding. And when these organizations do move, they’ll need architectures that satisfy German data protection authorities, C5 certification requirements, and industry-specific regulation from day one. Building on US-only infrastructure and retrofitting sovereignty later is the most expensive path available.
The principle holds regardless of lane: your specification determines your model, not the other way around. A task that requires high reasoning and low coordination has a different optimal model (and lane) than a task requiring domain expertise in a regulated context. Teams running multi-model routing – directing different tasks to different models based on characteristics – report significant cost reductions. The RouteLLM framework, published at ICLR 2025, demonstrated that trained routers deliver up to 85% cost reduction while maintaining 95% of GPT-4-level performance. The multi-model future isn’t a prediction. It’s already operational for teams with the specification maturity to route intelligently.
The practical question: does your team have the specification maturity to articulate why a task is hard and what constraints govern where it can run? If you can’t specify both, you’ll default to your most expensive option for everything – which is how most enterprises are operating right now.
The Fit Diagnostic
For each high-value AI workflow, answer:
What makes this task hard? (Map to the six difficulty axes: reasoning, effort, coordination, EQ, domain expertise, ambiguity tolerance)
What data does this task touch? (Public, internal-confidential, personal data, regulated)
What’s the volume and cost sensitivity? (One-off vs. thousands per day; budget per query)
What’s the required quality floor? (Must be right every time vs. 80% is useful)
What regulatory regime applies? (None, GDPR, EU AI Act high-risk, sector-specific)
Map the answers to lanes. High reasoning + non-sensitive data → Lane 1. High volume + cost-sensitive + moderate quality → Lane 2 or 4. Regulated data + domain expertise → Lane 3 or 4. Most enterprises will operate across all four lanes. That’s the architecture. The question is whether you designed it or stumbled into it.
Decision 3: The Lock-In Audit
This is the decision nobody’s making at all.
Every week your team works within a model ecosystem, they’re building dependencies. Every custom instruction file, every configured workflow, every automated pipeline – these aren’t just tools. They’re commitments. And they compound.
The pattern I call The Compounding Window works like this: switching costs in AI tooling don’t accumulate linearly. They compound. The first month, migration is trivial. By month six, it’s a project. By month eighteen, it’s a strategic initiative that needs executive sponsorship and a dedicated team. A 2026 Forrester survey found that 74% of CIOs regret at least one major AI vendor selection made in the past 18 months. The switching costs were invisible while accumulating. They became visible only when someone tried to move – or when they realized the platform they’d committed to couldn’t meet requirements that had shifted since the original decision.
The Lock-In Audit makes this visible before it becomes irreversible. Five dimensions:
1. Execution Patterns. How deeply has your team adapted their working style to a specific tool’s interaction model? If your developers think in “Claude projects” or “Codex tasks,” the tool’s mental model has become their mental model. Score: 1 (tool-agnostic habits) to 5 (deeply platform-specific).
2. Context Infrastructure. How much institutional knowledge lives in tool-specific formats? Custom instruction files, system prompts, configured rules, CLAUDE.md files, Codex linter configurations – these are specifications that live inside a vendor’s ecosystem. They represent real intellectual work. Can you export them? Score: 1 (portable, documented externally) to 5 (locked in vendor format, no export path).
3. Integration Depth. How many workflows depend on tool-specific APIs, plugins, or automation? Every integration is a switching cost. Score: 1 (API-agnostic abstractions) to 5 (deep proprietary integration).
4. Skill Specialization. Has your team developed skills specific to one platform? Prompt patterns that work for one model but not others? Debugging intuitions that assume specific failure modes? Human capital lock-in is the hardest to audit and the most expensive to unwind. Score: 1 (transferable skills) to 5 (platform-specific expertise).
5. Evaluation Infrastructure. Do your quality metrics and review processes assume a specific model’s output patterns? If your review process was built around one model’s strengths and blind spots, it won’t transfer cleanly. Score: 1 (model-agnostic evaluation criteria) to 5 (model-specific quality assumptions).
6. Regulatory Exposure. Does your current platform meet the regulatory requirements for all the data flowing through it? If you’ve been sending regulated data through a US-hosted model because nobody checked – that’s not just lock-in, it’s liability compounding alongside switching cost. Score: 1 (fully compliant, documented) to 5 (regulatory gaps unaudited).
Total score 6-12: Green. Your architecture is portable. You chose your platform; you can choose differently.
Total score 13-21: Yellow. Migration effort required. Map the specific dimensions scoring highest and build portability into your next quarter’s work.
Total score 22-30: Red. Significant rewrite territory. Your switching window is closing. That doesn’t mean you should switch. It means you should know you’re choosing not to – and you should stop telling leadership you have a “multi-model strategy” when you have a single-vendor dependency.
A Worked Example: The Mittelstand Manufacturer
Consider a composited example drawn from the patterns I see across European mid-market firms.
A German manufacturing company – 2,000 employees, €400M revenue, automotive supply chain – ran their first AI pilot eighteen months ago. An enthusiastic IT lead signed up for an enterprise ChatGPT license. The team built prompt libraries for quality inspection reports. They configured custom instructions for their internal documentation style. Three departments adopted it for different tasks: engineering for technical documentation, sales for proposal generation, HR for job descriptions.
Eighteen months later, the CISO raises a question: customer data from the automotive OEMs flows through these prompts. The data leaves the EU. The OEM contracts require C5-certified infrastructure. Nobody checked.
Running the Three Decisions retroactively:
Philosophy Test: Engineering documentation scores 21 (delegation-ready – specifications are tight, outputs are structured). Sales proposals score 13 (hybrid – each proposal requires mid-process judgment). HR job descriptions score 22 (delegation-ready). The organization needs both philosophies, but they’ve been running everything through the same tool with the same approach.
Fit Diagnostic: Engineering docs touch regulated OEM data → needs Lane 3 or 4 (EU-sovereign or self-hosted). Sales proposals handle commercially sensitive pricing but no regulated personal data → Lane 1 or 2 is viable. HR job descriptions are internal, low-sensitivity → any lane works, optimize for cost. The firm has been running all three through Lane 1, paying frontier prices for HR tasks that could run on a self-hosted model at a fraction of the cost, while simultaneously exposing regulated data that shouldn’t leave their infrastructure.
Lock-In Audit: Context infrastructure scores 4 (prompt libraries, custom instructions, department-specific configurations – all in OpenAI’s format). Skill specialization scores 3 (the team has learned ChatGPT’s patterns, not general prompting). Integration depth scores 2 (API connections are light). Regulatory exposure scores 5 (OEM data flowing through US infrastructure, unaudited). Total: 20 – red zone. Not just a switching cost problem. A compliance problem compounding alongside it.
The prescription: Move engineering documentation to Mistral or a self-hosted Mixtral instance on C5-certified infrastructure. Keep sales proposals on a frontier model with proper data handling agreements. Route HR tasks to the cheapest model that meets the quality floor. Build the prompt libraries in a portable format. Run the Lock-In Audit quarterly.
That’s three decisions, made deliberately, producing an architecture that matches task to capability, respects sovereignty, and preserves the ability to change direction. It took a diagnostic framework. It could have been the default from day one.
The Specification Thread
There’s a line running through this series that Article 9 makes explicit.
Article 2 asked: why do AI tools disappoint? Because you haven’t specified what “good” means. Article 5 said: specification is the skill that determines success. Article 8 showed why that skill is harder than it sounds and gave you the Specificity Audit. Article 10 diagnosed why your organization can’t absorb capabilities that already exist.
This article is where specification meets architecture. Your specification determines your model. Your model determines your harness. Your harness compounds your lock-in. And for European enterprises, your regulatory environment determines which lanes are even available. The chain only works if the specification is solid. Everything upstream flows downstream.
The enterprises that invested in specification maturity – that can articulate what “good” looks like at the task level, that can characterize why a task is hard, that can pass the Specificity Audit on their critical workflows – those enterprises can run the Three-Decision Framework. They can match philosophy to workflow, model to task, and track lock-in deliberately.
The enterprises that skipped the specification work are choosing vendors based on demos and benchmarks. They’re making the model decision upstream of specification – the exact inversion of what works.
The platform decision reveals your specification maturity. And right now, most enterprises are revealing more than they’d like.
If your team is evaluating models right now – or if you’ve already chosen and you’re wondering whether you chose for the right reasons – I’d like to hear what your decision process looked like. What were you actually comparing? And did anyone ask where the data goes?

