Is your ERP data “AI-ready”? Wrong question.

TLDR;

There is no fixed definition of “AI-ready data,” and the bundled cleanup program a vendor wants to sell you before any AI ships is the slow, expensive path. Clean the slice that is legally or financially binding. For everything else, build a thin layer of meaning over your existing ERP (enterprise resource planning) data and let your AI read that. You can model a business domain in weeks. You can never finish cleaning the whole system.

Your ERP data is not “AI-ready,” and the proposal on your desk to make it so – the one that bundles every table in the building into a single readiness program before you ship a single agent – is the most expensive way to get the slowest result. Here is the part nobody selling that program says out loud: you do not clean the mess to use it. You build a thin layer of meaning over the mess, and your AI reads the meaning. This is for the CIO who has been handed an “AI-ready data” quote and a straight face.

To be fair to the other side: disciplined teams do scope data work narrowly, domain by domain, and that is reasonable. The target here is the blanket version – “we must make all of it pristine first” – because that is the framing that stalls AI for a year.

What does “AI-ready data” actually mean?

There is no single industry-standard definition. Real frameworks for data quality exist (DAMA-DMBOK, ISO 8000), and they are useful. But “AI-ready” as a product pitch is not one of them – vendors use it to mean whatever scopes the engagement they are selling, which is convenient when nobody can point to the finish line. Real AI systems do not need pristine data anyway. They need a structure that tells them what the data means.

That structure has a plain name: a semantic layer. A semantic layer is a translation map that sits between your messy data and the software that reads it – it says “this field is a Supplier, that one is a Work Order, and here is the rule when two systems disagree.” Almost every mid-market manufacturer runs an ERP, and almost none of that data is documented or clean. That is a large pile of mess – and a large market for anyone selling to “fix” it before AI can touch it.

Should you clean your ERP before deploying AI?

Clean the slice that is binding. Model the rest. The reason is simple: cleaning is open-ended and modeling is bounded. A cleanup project has no natural stopping point – there is always another table, another orphaned record, another field someone repurposed in 2014. A semantic model has a clear target: represent the handful of business concepts your agents actually need. You finish the second one. You never finish the first.

Some data genuinely must be clean, and this is the strongest version of the other side. Financial close, tax filings, audited inventory that flows into cost of goods, FDA lot traceability – anything where a wrong number is a legal or financial event – needs real governance, full stop. That is non-negotiable. But that is a small, named slice, not your whole ERP. The mistake is treating the entire system as if it were the regulated part.

Here is the decision, in order:

  1. Is this data legally or financially binding? If the number flows into a tax filing, an audit, or a regulated report, govern it. If it informs a decision but is not itself reported, it is not in this bucket.
  2. Is the source system being replaced or retired soon? If it is going away within months, do not invest in cleaning or modeling. If it has a year or more left, a thin model still pays in that window – and the work doubles as discovery for the migration.
  3. Does an AI agent or analyst just need to understand it? Model it with a semantic layer.
  4. Everything left over: leave it alone until it earns attention.

Most ERP data lands in step 3 or 4. Almost none of it justifies the step-1 treatment applied to the whole system.

What do you build instead? The Three-Layer Semantic Stack.

You build three layers over the raw ERP and leave the source data exactly where it sits. Call it the Three-Layer Semantic Stack. In plain terms: three lightweight pieces that let your AI read your data without touching it. The raw data stays messy. The meaning lives above it. Your agents read the meaning, not the mess.


LayerWhat it isWhat it doesWho builds it
1. Raw ProbeA read-only scan of the actual ERP schema – tables, columns, keys, real valuesDiscovers what is truly there, including the undocumented partsAutomated discovery, not a human reading a data dictionary that lie
2. Typed semantic modelA written map of your business concepts (a Work Order, a Supplier, a Lot) tied to the raw fields that represent them, with the kind of value each field holds spelled outGives meaning to messy column names and fixes ambiguity once, in one placeYour team plus a data architect for the first pass; maintainable in-house after
3. Agent interfaceThe layer your AI agents and apps actually queryLets agents ask business questions without knowing the schema’s scarsBuilt on top; swappable as agents change

That middle layer is the heart of the semantic layer – the meaning made explicit, with the kind of value in each field spelled out so software stops guessing. It is the same pattern Palantir proved at enterprise scale and calls an ontology; you are building the same idea over your own ERP, without the nine-figure price tag. The point of the whole stack: meaning is a layer you add, not a property you scrub the data into having.

One honest boundary. The stack resolves structural ambiguity – messy field names, three formats for a part number, an undocumented schema. It does not resolve value-level contradictions for free. If two systems each call something the canonical supplier record with different payment terms, somebody still has to decide which one wins. That decision is governance, and it looks like cleanup, because it is. The layer makes the conflict visible and forces the rule to be written down once. It does not make the choice for you.

How fast can you actually model a messy ERP?

Weeks for a first-pass model of one business domain – not the whole ERP, and not the year a full cleanup assumes. The work is discovery, not repair: an automated probe reads the raw schema, infers relationships from the actual data, and produces a draft a human then corrects. The slow part is not finding the structure. It is deciding what the business calls things.

Across our engagements the pattern is consistent: on a typical undocumented ERP – a couple hundred tables, no current documentation – automated discovery produces a usable first-pass model of one domain in about two to six weeks, and the argument over what to name things often takes as long as the discovery. That is our experience, not a published benchmark, and it moves with schema size, data access, and how fast your people agree on terms. The general lesson holds: when you stop trying to change the source data and start describing it, the timeline collapses.

Won’t a semantic layer just hide bad data instead of fixing it?

No – it does the opposite. The semantic layer makes bad data visible and explicit. When three systems disagree on what “complete” means, the layer forces you to write the rule down once, where everyone can see it, instead of burying it in one analyst’s spreadsheet. Hiding is when the rule lives in a single person’s head. Modeling is the opposite of that.

A cleanup project, by contrast, often does hide the mess: it overwrites the messy history with a “clean” version, and you lose the evidence of how things actually ran. The semantic stack keeps the source of truth intact and adds an interpretation on top. If your interpretation is wrong, you fix the layer, not four million rows.

What does this mean for the budget you were handed?

The bundled “AI-ready data” line item front-loads cost and back-loads value – you pay for twelve months and see an agent in month thirteen, maybe. The semantic stack inverts that: a working model of your first domain in weeks, agents reading it shortly after, and cleanup reserved for the binding slice that genuinely needs it.

You are not being asked to tolerate bad data forever. You are being asked to stop treating “make all data perfect” as the prerequisite to “do anything useful.” Those were always two separate projects. One of them ships a working first domain this quarter. The other shows you an agent in month thirteen, maybe.

FAQ

Conclusion

  • “AI-ready data” has no fixed definition; the bundled cleanup-everything-first program sells time, not outcomes.
  • You do not clean the mess to use it. You build a three-layer semantic stack over it: raw probe, typed semantic model, agent interface.
  • Clean only the legally or financially binding slice. Model everything else.
  • Discovery is fast; the slow part is naming things. That is a decision problem, not a data problem.
  • The semantic layer exposes bad data and forces one written rule. It does not silently fix contradictions – that still takes a governance call.