Llms.txt Was Step One. Here’s The Architecture That Comes Next via @sejournal, @DuaneForrester

Brands must move beyond llms.txt toward structured APIs, entity graphs, and provenance to earn accurate AI citations. The post Llms.txt Was Step One. Here’s The Architecture That Comes Next appeared first on Search Engine Journal.

Apr 2, 2026 - 21:18

0 50

Llms.txt Was Step One. Here’s The Architecture That Comes Next via @sejournal, @DuaneForrester

The conversation around llms.txt is real and worth continuing. I covered it in a previous article, and the core instinct behind the proposal is correct: AI systems need clean, structured, authoritative access to your brand’s information, and your current website architecture was not built with that in mind. Where I want to push further is on the architecture itself. llms.txt is, at its core, a table of contents pointing to Markdown files. That is a starting point, not a destination, and the evidence suggests the destination needs to be considerably more sophisticated.

Before we get into architecture, I want to be clear about something: I am not arguing that every brand should sprint to build everything described in this article by next quarter. The standards landscape is still forming. No major AI platform has formally committed to consuming llms.txt, and an audit of CDN logs across 1,000 Adobe Experience Manager domains found that LLM-specific bots were essentially absent from llms.txt requests, while Google’s own crawler accounted for the vast majority of file fetches. What I am arguing is that the question itself, specifically how AI systems gain structured, authoritative access to brand information, deserves serious architectural thinking right now, because the teams that think it through early will define the patterns that become standards. That is not a hype argument. That is just how this industry has worked every other time a new retrieval paradigm arrived.

Where Llms.txt Runs Out Of Road

The proposal’s honest value is legibility: it gives AI agents a clean, low-noise path into your most important content by flattening it into Markdown and organizing it in a single directory. For developer documentation, API references, and technical content where prose and code are already relatively structured, this has real utility. For enterprise brands with complex product sets, relationship-heavy content, and facts that change on a rolling basis, it is a different story.

The structural problem is that llms.txt has no relationship model. It tells an AI system “here is a list of things we publish,” but it cannot express that Product A belongs to Product Family B, that Feature X was deprecated in Version 3.2 and replaced by Feature Y, or that Person Z is the authoritative spokesperson for Topic Q. It is a flat list with no graph. When an AI agent is doing a comparison query, weighting multiple sources against each other, and trying to resolve contradictions, a flat list with no provenance metadata is exactly the kind of input that produces confident-sounding but inaccurate outputs. Your brand pays the reputational cost of that hallucination.

There is also a maintenance burden question that the proposal does not fully address. One of the strongest practical objections to llms.txt is the ongoing upkeep it demands: every strategic change, pricing update, new case study, or product refresh requires updating both the live site and the file. For a small developer tool, that is manageable. For an enterprise with hundreds of product pages and a distributed content team, it is an operational liability. The better approach is an architecture that draws from your authoritative data sources programmatically rather than creating a second content layer to maintain manually.

The Machine-Readable Content Stack

Think of what I am proposing not as an alternative to llms.txt, but as what comes after it, just as XML sitemaps and structured data came after robots.txt. There are four distinct layers, and you do not have to build all of them at once.

Layer one is structured fact sheets using JSON-LD. When an AI agent evaluates a brand for a vendor comparison, it reads Organization, Service, and Review schema, and in 2026, that means reading it with considerably more precision than Google did in 2019. This is the foundation. Pages with valid structured data are 2.3x more likely to appear in Google AI Overviews compared to equivalent pages without markup, and the Princeton GEO research found content with clear structural signals saw up to 40% higher visibility in AI-generated responses. JSON-LD is not new, but he difference now is that you should be treating it not as a rich-snippet play but as a machine-facing fact layer, and that means being far more precise about product attributes, pricing states, feature availability, and organizational relationships than most implementations currently are.

Layer two is entity relationship mapping. This is where you express the graph, not just the nodes. Your products relate to your categories, your categories map to your industry solutions, your solutions connect to the use cases you support, and all of it links back to the authoritative source. This can be implemented as a lightweight JSON-LD graph extension or as a dedicated endpoint in a headless CMS, but the point is that a consuming AI system should be able to traverse your content architecture the way a human analyst would review a well-organized product catalog, with relationship context preserved at every step.

Layer three is content API endpoints, programmatic and versioned access to your FAQs, documentation, case studies, and product specifications. This is where the architecture moves beyond passive markup and into active infrastructure. An endpoint at /api/brand/faqs?topic=pricing&format=json that returns structured, timestamped, attributed responses is a categorically different signal to an AI agent than a Markdown file that may or may not reflect current pricing. The Model Context Protocol, introduced by Anthropic in late 2024 and subsequently adopted by OpenAI, Google DeepMind, and the Linux Foundation, provides exactly this kind of standardized framework for integrating AI systems with external data sources. You do not need to implement MCP today, but the trajectory of where AI-to-brand data exchange is heading is clearly toward structured, authenticated, real-time interfaces, and your architecture should be building toward that direction. I have been saying this for years now – that we are moving toward plugged-in systems for the real-time exchange and understanding of a business’s data. This is what ends crawling, and the cost to platforms, associated with it.

Layer four is verification and provenance metadata, timestamps, authorship, update history, and source chains attached to every fact you expose. This is the layer that transforms your content from “something the AI read somewhere” into “something the AI can verify and cite with confidence.” When a RAG system is deciding which of several conflicting facts to surface in a response, provenance metadata is the tiebreaker. A fact with a clear update timestamp, an attributed author, and a traceable source chain will outperform an undated, unattributed claim every single time, because the retrieval system is trained to prefer it.

What This Looks Like In Practice

Take a mid-market SaaS company, a project management platform doing around $50 million ARR and selling to both SMBs and enterprise accounts. They have three product tiers, an integration marketplace with 150 connectors, and a sales cycle where competitive comparisons happen in AI-assisted research before a human sales rep ever enters the picture.

Right now, their website is excellent for human buyers but opaque to AI agents. Their pricing page is dynamically rendered JavaScript. Their feature comparison table lives in a PDF that the AI cannot parse reliably. Their case studies are long-form HTML with no structured attribution. When an AI agent evaluates them against a competitor for a procurement comparison, it is working from whatever it can infer from crawled text, which means it is probably wrong on pricing, probably wrong on enterprise feature availability, and almost certainly unable to surface the specific integration the prospect needs.

A machine-readable content architecture changes this. At the fact-sheet layer, they publish JSON-LD Organization and Product schemas that accurately describe each pricing tier, its feature set, and its target use case, updated programmatically from the same source of truth that drives their pricing page. At the entity relationship layer, they define how their integrations cluster into solution categories, so an AI agent can accurately answer a compound capability question without having to parse 150 separate integration pages. At the content API layer, they expose a structured, versioned comparison endpoint, something a sales engineer currently produces manually on request. At the provenance layer, every fact carries a timestamp, a data owner, and a version number.

When an AI agent now processes a product comparison query, the retrieval system finds structured, attributed, current facts rather than inferred text. The AI does not hallucinate their pricing. It correctly represents their enterprise features. It surfaces the right integrations because the entity graph connected them to the correct solution categories. The marketing VP who reads a competitive loss report six months later does not find “AI cited incorrect pricing” as the root cause.

This Is The Infrastructure Behind Verified Source Packs

In the previous article on Verified Source Packs, I described how brands can position themselves as preferred sources in AI-assisted research. The machine-readable content API is the technical architecture that makes VSPs viable at scale. A VSP without this infrastructure is a positioning statement. A VSP with it is a machine-validated fact layer that AI systems can cite with confidence. The VSP is the output visible to your audience; the content API is the plumbing that makes the output trustworthy. Clean structured data also directly improves your vector index hygiene, the discipline I introduced in an earlier article, because a RAG system building representations from well-structured, relationship-mapped, timestamped content produces sharper embeddings than one working from undifferentiated prose.

Build Vs. Wait: The Real Timing Question

The legitimate objection is that the standards are not settled, and that is true. MCP has real momentum, with 97 million monthly SDK downloads by 2026 and adoption from OpenAI, Google, and Microsoft, but enterprise content API standards are still emerging. JSON-LD is mature, but entity relationship mapping at the brand level has no formal specification yet.

History, however, suggests the objection cuts the other way. The brands that implemented Schema.org structured data in 2012, when Google had just launched it, and nobody was sure how broadly it would be used, shaped how Google consumed structured data across the next decade. They did not wait for a guarantee; they built to the principle and let the standard form around their use case. The specific mechanism matters less than the underlying principle: content must be structured for machine understanding while remaining valuable for humans. That will be true regardless of which protocol wins.

The minimum viable implementation, one you can ship this quarter without betting the architecture on a standard that may shift, is three things. First, a JSON-LD audit and upgrade of your core commercial pages, Organization, Product, Service, and FAQPage schemas, properly interlinked using the @id graph pattern, so your fact layer is accurate and machine-readable today. Second, a single structured content endpoint for your most frequently compared information, which, for most brands, is pricing and core features, generated programmatically from your CMS so it stays current without manual maintenance. Third, provenance metadata on every public-facing fact you care about: a timestamp, an attributed author or team, and a version reference.

That is not an llms.txt. It is not a Markdown copy of your website. It is durable infrastructure that serves both current AI retrieval systems and whatever standard formalizes next, because it is built on the principle that machines need clean, attributed, relationship-mapped facts. The brands asking “should we build this?” are already behind the ones asking “how do we scale it.” Start with the minimum. Ship something this quarter that you can measure. The architecture will tell you where to go next.

Duane Forrester has nearly 30 years of digital marketing and SEO experience, including a decade at Microsoft running SEO for MSN, building Bing Webmaster Tools, and launching Schema.org. His new book about staying trusted and relevant in the AI era (The Machine Layer) is available now on Amazon.

More Resources:

What The Latest Web Almanac Report Reveals About Bots, CMS Influence, And llms.txt Why Your Website Needs To Speak To Machines LLMs.txt Shows No Clear Effect On AI Citations, Based On 300,000 Domains

This post was originally published on Duane Forrester Decodes.

Featured Image: mim.girl/Shutterstock; Paulo Bobita/Search Engine Journal