Originally published on Substack . Republished here for readers who prefer the GMS context; substack remains the canonical location.
This is the fourth article in a series on rethinking ISO 27001 compliance from first principles. The previous article showed how three VMs broke a compliance score. This one asks a bigger question: what would compliance evidence look like if you stopped treating it as a periodic report and started treating it as a continuous signal?
Your external auditor walks in tomorrow, unannounced. No prep time. No frantic emails to the MSP. No two-week scramble to re-export Conditional Access policies, re-screenshot Intune dashboards, and re-generate the report you generated last quarter but never actually checked.
Just: “Show me your endpoint device compliance, right now.”
I used to think of this as a thought experiment. It’s not. Across the tenants I manage, we can answer that question in under sixty seconds — with the rules that were evaluated, the thresholds that were applied, the weights that determined what matters, and evidence that is collected daily, automatically, and sealed with a cryptographic hash at the point of collection.
This article describes what that system looks like and what we learned building it. Not theory. Architecture.
The temporality problem
Most compliance evidence has a fundamental defect: it’s a snapshot.
Someone ran a report. Someone exported a spreadsheet. Someone took a screenshot of an admin portal on a Tuesday afternoon in January. That artefact was placed in a folder — maybe SharePoint, maybe a shared drive, maybe an email attachment — and labelled “A.8.1 Evidence Q4 2025.”
Three problems with this:
First, the evidence is stale the moment it’s collected. Device compliance changes hourly. Users are onboarded and offboarded. Conditional Access policies are modified. The screenshot from January tells you what was true in January. It tells you nothing about today.
Second, the evidence is disconnected from the control it’s supposed to demonstrate. A screenshot of an Intune dashboard doesn’t align with any specific requirement. It’s a picture. The auditor has to mentally trace from the control text (“information stored on, processed by or accessible via user endpoint devices shall be protected”) to whatever is on screen and decide whether it’s sufficient. That interpretation happens in the auditor’s head, not in the evidence.
Third, there’s no integrity chain. Nobody can prove the screenshot wasn’t taken from a test environment, or that the exported CSV wasn’t edited before it was uploaded. The evidence is trusted because we trust the person who collected it. That’s not a control. That’s faith.
What evidence actually needs
I’ve been thinking about what compliance evidence would look like if you designed it from first principles rather than from habit. Not “what does the auditor expect to see” but “what would actually demonstrate that a control is operating effectively, right now?”
Three things emerged.
Evidence needs structure. Not a flat export, but rules — each one mapping to a specific aspect of the control. Take A.8.1, User Endpoint Devices. The control requirement is broad: protect information on endpoint devices. But “protected” isn’t a single measurement. It decomposes into distinct questions:
- Are devices compliant with your baseline policies? (coverage percentage)
- Are they encrypted? (BitLocker, FileVault — remembering the proxy problem from last time)
- Are they onboarded to endpoint detection? (Windows, macOS, Mobile)
- Is Conditional Access actually assigned to all users? (scope coverage)
Evidence needs weight, but it also needs “Kill Switches”. This is a critical distinction I missed early on. In a dashboard, you want to weight things to see trends. But in an audit, certain controls are binary.
If encryption drops to 50%, you don’t get a “partial pass.” You get a Major Non-Conformity.
Therefore, a robust evidence model needs two layers: a Weighted Score for internal prioritisation (e.g., “We are at 85% health, let’s fix the low-priority items later”) and a Gatekeeper Rule for the audit (e.g., “If Encryption < 99%, the status is FAILED, regardless of the weighted score”).
Evidence needs a pass/fail line. Not 100%. The standard doesn’t require perfection, and anyone claiming 100% compliance across all controls is either lying or measuring too narrowly. A more honest threshold might be: if you pass 95% of the weighted rules, the control is operating effectively. But for those Gatekeeper controls, the tolerance is near zero.
A worked example
Let me make this concrete. Across the tenants I manage, here’s how I’ve been experimenting with decomposing A.8.1 into structured evidence. Note the new “Gatekeeper” column:
| Rule | What it measures | Type | Threshold | Weight | Gatekeeper? |
|---|---|---|---|---|---|
| R1 | Device compliance coverage | Threshold | ≥95% | 30 | Yes |
| R2 | Encryption coverage | Threshold | ≥100% | 25 | Yes |
| R3 | Windows EDR onboarding | Threshold | ≥95% | 15 | No |
| R4 | Conditional Access Assignment | Threshold | 100% of Users | 15 | Yes |
| R5 | Mobile App Protection (MAM) | Threshold | ≥95% | 15 | No |
Total weight: 100. Pass threshold: ≥80%.
Now imagine R3 (EDR) fails. Windows EDR onboarding has dropped to 91%. That’s 4 points lost (relative to the weight). The overall score is 96. The control is still compliant. The failure is flagged, but it hasn’t brought the house down. This reflects reality: a few missing sensors are a problem, but not a disaster.
Now, imagine R2 (Encryption) fails. Encryption coverage drops to 90%. That’s points lost, sure. But more importantly, it hits the Gatekeeper rule. The dashboard might show a score of “85%”, but the Audit Status flips instantly to FAILED.
This mirrors how auditors actually think. They can forgive a few missing EDR agents if the trend is good. They cannot forgive unencrypted laptops in the wild. A weighted model without gatekeepers is dangerous — it hides critical risks behind good averages.
A note from later in the journey: when I implemented this model at scale, I found that the explicit gatekeeper mechanism was unnecessary complexity. High-weight threshold rules achieve the same effect more naturally — a rule weighted at 20 or 30 with a 95% threshold functions as a de facto gatekeeper, because when it fails significantly, the weighted score drops below the pass threshold on its own. The weighted model, when calibrated properly, already prioritises the right things.
The freshness problem
Evidence staleness is the dirty secret of compliance. How old is the evidence your organisation would present if audited today?
ISO 27001 Clause 9.1 requires “monitoring, measurement, analysis, and evaluation.” Most organisations do the monitoring (collecting logs) but fail the evaluation (looking at them).
Automated evidence addresses monitoring, but it introduces a new risk: “Set and Forget.” A dashboard that has been green for six months might be green because the data feed broke six months ago.
Freshness isn’t just about “is the data new?” It’s about “Is the collection mechanism still working?”
If your evidence file is older than 24 hours for a technical control, such as device compliance, it is not evidence. It is history.
Evidence integrity and ISO 27037
Here’s a question that almost nobody in the compliance space discusses: how do you prove your evidence hasn’t been modified?
The entire chain of trust in traditional compliance evidence rests on the person who collected it. They ran the export, they uploaded it, and we trust that they didn’t edit the CSV.
There is actually a standard for this: ISO/IEC 27037 (Guidelines for identification, collection, acquisition and preservation of digital evidence). It demands that the integrity of evidence be verifiable.
What if evidence was sealed at the point of collection? A cryptographic hash of the evidence content, signed by an independent service, timestamped, and embedded in the evidence file itself.
Instead of a screenshot, the auditor gets a JSON object like this:
{
"control_id": "A.8.1",
"timestamp": "2026-02-14T10:00:00Z",
"evidence_hash": "a1b2c3d4e5f6...",
"signature": "signed_by_compliance_bot_v1",
"data": {
"total_devices": 450,
"encrypted": 450,
"compliant": 448
}
}
Any modification — even a single cell change — invalidates the hash. The auditor can verify integrity without trusting the person who collected it. This isn’t science fiction; it’s basic cryptography applied to GRC.
What we learned when the theory became real
I described the architecture above as a set of principles. When we implemented it — across 116 collection scripts covering all 93 Annex A controls and 23 management system subclauses — the principles held, but reality added lessons that theory didn’t predict.
API drift is real and constant. The macOS encryption status format in Microsoft Graph changed twice during development. Each time, a rule that had been passing suddenly started returning incorrect data — not failing visibly, but silently producing wrong results. Evidence integrity isn’t just about tampering. It’s about the pipeline itself degrading without warning. The response: treat evidence collection like production code, with automated testing, version control, and continuous validation.
Freshness creates a new failure mode. Daily automated collection solves the staleness problem. But it introduces the “green dashboard” problem: a dashboard that’s been showing compliant for weeks might be green because the collection mechanism broke three weeks ago and nobody noticed. The solution requires monitoring the monitor — verifying that evidence collection actually ran, that it returned data, and that the data is structurally valid. A missing evidence file is more dangerous than a failing evidence rule, because the failure is invisible.
The “set and forget” trap. Automated evidence collection creates a false sense of security if nobody reviews the trends. A slow degradation — compliance drifting from 97% to 94% to 91% over three months — doesn’t trigger any single alert, but it represents a systemic problem. Daily snapshots only become useful when they’re aggregated into trends and the trends are reviewed. Collection without analysis is monitoring without intelligence.
Scale revealed architectural decisions. At one control, you can make ad hoc decisions. At 116 controls, you need a framework. Metadata standardisation, consistent naming, shared helper functions, cross-reference tracking, exception group management — these aren’t features. They’re survival requirements for a system that produces evidence at scale. The architecture of the collection system matters as much as the evidence it produces.
Why automation isn’t enough
You might ask: “Isn’t this what the new wave of SaaS compliance vendors like Drata or Vanta solve?”
Not exactly. Those platforms are excellent at orchestration — they provide the pipes to collect data and the dashboard to view it. But they operate at a different layer. They check whether settings are enabled; they don’t evaluate whether the control is effective within the context of your specific environment. A SaaS tool might tell you “MFA is enabled” because a checkbox in an API is ticked. It won’t tell you that three service accounts were excluded, that the exclusion was reviewed last quarter, and that the exclusion is documented in exception group sg-ISO27001-Exceptions-ServiceAccounts with an annual access review attached.
Architectural compliance goes deeper. It asks: “Is MFA enabled for everyone except the three service accounts we forgot about, and is that exception documented as a risk in our ISMS?” The “Unicorn” tools give you the what; this series is about the how and the why that survives a probing auditor who asks to see the raw telemetry behind the dashboard.
The gap isn’t data
Microsoft 365 already has the telemetry. Every device compliance state, every Conditional Access evaluation, every Defender detection, every DLP policy match — it’s all there, in real time, via API. The data exists.
The gap is between that telemetry and what the auditor needs. The auditor needs structured evidence that maps to specific controls, with rules that encode the organisation’s compliance thresholds (and gatekeepers), weights that reflect risk priorities, freshness that demonstrates ongoing operation, and integrity that proves the evidence hasn’t been altered.
That gap — between raw telemetry and auditable evidence — is where the entire compliance-preparation industry operates. And it’s a gap that, in principle, shouldn’t exist.
I’m not claiming this is easy. Mapping 93 Annex A controls to specific API calls, defining appropriate thresholds, calibrating weights, handling exceptions transparently — that’s real work. But it’s work that, once done, produces evidence continuously rather than periodically. And continuous evidence means the audit isn’t an event you prepare for. It’s a status you can demonstrate at any moment.
The question I’ll leave you with
How old is your oldest piece of compliance evidence?
If it’s more than a month old for a technical control, ask yourself: would you stake your certification on it? Would you show it to an auditor with confidence that it reflects your current security posture?
If the answer is “probably, nothing’s really changed” — how do you know? What evidence do you have that nothing’s changed?
That’s the evidence gap. Not a shortage of data. A shortage of evidence that’s structured, weighted, fresh, and sealed.
JJ Milner is a Microsoft MVP and the founder of Global Micro Solutions, a managed services provider operating across 1,200+ Microsoft 365 tenants. He writes about rethinking compliance from first principles.