Skip to main content
Preview Your Audit
← All insights

The Compliance Brain: What Happens When AI Meets Evidence

Building an AI system that understands compliance context — not just rules, but the reasoning behind them. Three agents, structured citations, and the design decision to let the system say 'I don't know'.

This is part of a series on rethinking ISO 27001 compliance from first principles. Earlier articles described what continuous, structured evidence looks like and the questions auditors actually ask. This one asks: what powers a system that can answer those questions — and what happens when the same AI system is available to answer any auditor question, at any moment, without preparation?

An auditor asks: “How does your organisation ensure that risk assessments produce consistent, valid and comparable results?”

You know the answer. It’s in the risk methodology document — the one updated eighteen months ago, the one three people have read. You know the intent of the control, the relevant clause reference, and roughly where the evidence sits. What you don’t have is the ability to assemble the answer in sixty seconds with structured citations pointing to current evidence.

That gap — between knowing the answer exists and being able to produce it on demand — is the problem this article addresses.


The question bank

I’ve been compiling auditor questions for months. Not as an academic exercise — as a test harness.

The compilation now stands at 788 questions spanning every clause of ISO 27001:2022 and every Annex A control. Each question is classified by difficulty (routine, probing, or challenging), tagged with the specific clauses and controls it tests, and annotated with the expected evidence pattern — what a satisfactory answer should contain.

The original purpose was preparation: if you can answer the challenging questions, the routine ones take care of themselves. But the compilation revealed something more interesting. The questions, taken together, form a specification. They define what a compliant ISMS must be able to demonstrate — not in the abstract language of the standard, but in the concrete language of “show me.”

A specification can be tested against. And a system that produces structured evidence daily can, in principle, be queried against that specification. The question stopped being “can a person answer this?” and became “can a system answer this with evidence citations?”


Three agents, one pipeline

The system I built to answer auditor questions uses a three-stage pipeline. Each stage has a distinct role, a defined input and output, and a quality bar.

Stage 1: Triage. The question arrives as natural language. The triage agent classifies it: which clause or control does this relate to? What’s the intent — is this about a specific Annex A control, a management system clause, or something out of scope? How confident is the classification?

If confidence is low, the system asks for clarification rather than guessing. If the question is out of scope — “What’s the weather like?” — it says so. This sounds trivial, but it’s critical: a system that confidently answers the wrong question is worse than one that admits uncertainty.

Stage 2: Evidence retrieval. For classified questions, the system retrieves the relevant structured evidence. Not a keyword search — a targeted retrieval that understands which evidence corpus maps to which control. For Annex A controls, it queries the evidence index. For management system clauses, it loads the relevant context: stakeholder analysis for Clause 4.2, asset registry for Clause 4.1, clause reference text for all clauses.

The distinction matters. An Annex A question about A.8.1 (endpoint devices) needs the latest evidence collection data — device counts, compliance percentages, rule results. A Clause 9.3 question about management review needs governance artefacts — review minutes, action logs, decision records. The retrieval strategy differs by question type.

Stage 3: Compliance reasoning. The reasoning agent receives the question, the retrieved evidence, and the relevant policy context. It produces an answer with tiered citations:

  • Tier 1 citations reference specific evidence data — “As of the latest collection run, 97.3% of managed devices are compliant with baseline policies (Rule A.8.1-R1, threshold ≥95%).”
  • Tier 2 citations reference policy and procedural context — “Per the A.8.1 User Endpoint Devices Policy, Section 4.2, devices that fail compliance checks for more than 72 hours are flagged for remediation.”

The answer includes a confidence indicator. Green means the evidence is current and the answer is well-supported. Amber means the evidence exists but may be stale or the question requires interpretation. Red means no evidence was found — and critically, the system says so explicitly rather than fabricating an answer.


Why “I don’t know” is the right answer

The most important design decision in the system is what happens when there’s no evidence.

The instinct — and the default behaviour of most AI systems — is to generate a plausible answer regardless. The system has enough general knowledge about ISO 27001 to produce something that sounds correct. But an answer that sounds correct without evidence backing is precisely the kind of compliance theatre this series has been arguing against.

When the system encounters a control for which no evidence has been collected — because the script hasn’t been written yet, or the collection failed, or the control is manual-only — it responds with an explicit “No Evidence Found” indicator. This is not a failure. It’s the correct answer.

An auditor who asks about a control and is told “we don’t currently have automated evidence for this control” receives an honest answer. An auditor who receives a fabricated answer based on general knowledge receives a liability. The system’s willingness to say “I don’t know” is, paradoxically, the strongest evidence that its positive answers can be trusted.

This distinction required careful engineering. The quality evaluation framework tests for it specifically: a well-formed “no evidence” response is scored as a partial result, not a failure. A vague response that avoids admitting the gap is scored as a genuine failure. The system is trained to prefer honesty over completeness.


Testing the brain

A system that answers auditor questions must itself be auditable. How do you know the answers are correct? How do you know they’re consistent? How do you know the system handles edge cases?

The quality evaluation framework tests every question at two levels.

Structural evaluation checks the mechanics: Does the response include a confidence indicator? Does it cite evidence at both tiers? Is it the right length — substantive enough to be useful, concise enough to be actionable? Does it avoid leaking internal routing information into the response?

Pattern coverage checks the substance: Given the expected evidence pattern for a question — the topics a satisfactory answer should cover — what percentage of those topics appear in the response? The threshold differs by context: for questions tested against real evidence, coverage must exceed 60%. For questions tested against mocked data, a lower threshold applies because the mocked responses are artificially constrained.

There’s a third level for live evaluation: an independent LLM judges whether the response satisfies the question. This catches cases where structural checks pass but the answer is substantively wrong — or where structural checks fail but the answer is actually correct despite not following the expected format.

The three levels don’t always agree. That disagreement is itself informative. When the structural evaluator says “pass” but the judge says “fail,” the question’s expected pattern may be wrong. When the judge says “pass” but the structural evaluator says “fail,” the system may be producing good answers in an unexpected format. Both cases trigger investigation.

788 questions, three evaluation levels, regular test runs. The system audits itself on a schedule, producing a quality report that shows pass rates, failure categories, and trends over time. If the AI pipeline degrades — because a prompt changed, an evidence format shifted, or an API response structure updated — the quality framework catches it before an auditor does.


From answers to artefacts

The original design treated the AI system as a query engine: ask a question, get an answer. Stateless. Each interaction independent. The auditor (or team member) would ask, receive an evidence-backed response, and move on.

In practice, this wasn’t enough. Audit conversations aren’t isolated questions — they’re threads. An auditor probing A.8.5 (Secure Authentication) doesn’t ask one question. They ask ten, each building on the last, each follow-up informed by the previous answer. A stateless system that forgets the previous exchange forces the auditor to re-establish context with every question.

The system now maintains conversation persistence. Each session preserves the full thread — questions, evidence retrievals, reasoning, and citations — across multiple interactions. An auditor can start a review, leave it overnight, and return the next day with full context preserved. The conversation becomes a working document, not a series of disconnected queries.

But the more significant evolution was what happens to the conversation after it ends. In the audit programme, each per-control conversation produces a structured assessment: conformity classification, confidence score, justification with evidence citations, and audit methods used. That assessment becomes a draft observation — automatically populated from the conversation, linked to the session that produced it, and ready for the auditor to review and confirm.

The AI system doesn’t just answer questions anymore. It produces audit artefacts. The conversation is the input; the observation is the output. And the observation carries a complete provenance chain: every question that was asked, every piece of evidence that was retrieved, every citation that supported the assessment. When the external auditor asks “how did you arrive at this observation?”, the answer is the conversation itself.

This is a subtle but important shift. A query engine helps people find information. An artefact-producing system helps people do work. The AI went from being a reference tool to being an audit instrument.


The institutional memory problem

Every organisation has a person — sometimes two — who carries the ISMS in their head. They know why the exception was granted. They remember the management review where the CTO accepted the residual risk. They can explain the policy structure because they designed it.

When that person leaves, the institutional memory leaves with them. The new person inherits documents but not context. They can read the risk register but can’t explain why RM-021 is scored C:5, I:2, A:2. They can find the evidence files but can’t explain why certain devices are excluded from the denominator.

An AI system trained on structured evidence, annotated policies, and a comprehensive question bank doesn’t replace that person. But it preserves the reasoning. When an auditor asks “why did you structure your risk assessments this way?”, the system can cite the first-principles analysis that justified the approach. When a new team member asks “what’s the exception process for service accounts?”, the system can point to the exception group, the access review schedule, and the policy section that governs it.

This is institutional memory as a system capability, not a human dependency. The knowledge is versioned, searchable, and testable. When it becomes outdated — because a policy changes or a risk is reclassified — the content governance system detects the change and flags it for review.

The ISMS Roles Register illustrates this concretely: 33 named roles across five organisational tiers, each with defined responsibilities, RACI assignments, and communication requirements. When an auditor asks “Who is responsible for reviewing access rights?” (Clause 5.3), the system doesn’t just cite a policy paragraph — it names the specific role, the individual assigned to it, and the review cadence defined in the register. That level of specificity is what institutional memory looks like when it’s structural rather than personal.

The human expert is still essential. But they’re essential for judgment, not for recall.


What the AI doesn’t do

Let me be explicit about the boundaries.

The system does not make compliance decisions. It doesn’t decide whether a control is adequate, whether a risk should be accepted, or whether an exception is justified. Those are management decisions that require human judgment, organisational context, and accountability.

The system does not replace the auditor. The auditor’s role is to exercise professional skepticism — to challenge the evidence, test the processes, and form an independent opinion. An AI that produces evidence is a tool for the auditor, not a substitute for one.

The system does not guarantee compliance. It produces structured evidence and answers questions about that evidence. Whether the evidence demonstrates effective control operation is a judgment call that belongs to the auditor and, ultimately, to the organisation’s management.

These boundaries are not limitations. They’re design decisions that reflect the standard’s own architecture: management is responsible for the ISMS (Clause 5.1), the organisation determines what’s appropriate (Clause 6.1.3), and the auditor provides independent assessment (Clause 9.2). An AI that tried to do all three would be overstepping its role — and would be less trustworthy as a result.


The question I’ll leave you with

If you could ask your ISMS any question — any clause, any control, any difficulty level — and receive an evidence-backed answer in under a minute, what would you ask first?

Not the easy questions. Not “do we have a policy?” You know the answer to those.

The hard ones. “How do we know our risk methodology produces consistent results?” “What changed after the last management review?” “If our Defender configuration was wrong for three months, how would we know?”

The value of an AI-powered compliance system isn’t the answers it gives to questions you already know. It’s the confidence to face the questions you’ve been avoiding.


JJ Milner is a Microsoft MVP and the founder of Global Micro Solutions, a managed services provider operating across 1,200+ Microsoft 365 tenants. He writes about rethinking compliance from first principles.

J
JJ Milner

Microsoft MVP and founder of Global Micro Solutions. 30+ years securing Microsoft environments across 1,200+ tenants. Writes about rethinking compliance from first principles.

See what the auditor would find. In 30 minutes.

Same questions a real ISO 27001 auditor asks. Immediate gap analysis.

Start Your Audit Preview