Originally published on Substack . Republished here for readers who prefer the GMS context; substack remains the canonical location.
This is the third article in a series on rethinking ISO 27001 compliance from first principles. The previous article examined what auditors actually test — processes, not documents. This one tells the story of three virtual machines that broke our compliance score, and what fixing them taught me about measurement.
Three virtual machines were failing our device compliance checks.
Their names were avmsmcw201uat01, avmsmcw202uat01, and shpsmcsrvuat01. Azure VMs. Transient DevOps testing infrastructure spun up for CI/CD pipeline validation. They had never touched production data. They didn’t have users. They existed to test deployments and were torn down afterwards.
But Azure Policy deploys Microsoft Defender for Endpoint to all Azure VMs. Defender shares a control plane with Intune. So these ephemeral test machines appeared in Entra ID, showed up in Intune device inventory, and promptly failed every compliance check we threw at them: compliance state unknown, not encrypted, not onboarded to EDR.
The result: our device compliance coverage dropped from above 95% to 89.22%. Encryption coverage fell to 87.06%. EDR onboarding to 85.71%. Three machines used for testing — machines that had no business being measured — dropped three separate compliance rules below their thresholds.
This is the compliance equivalent of measuring your cholesterol after eating pizza. The measurement is technically accurate. It’s also completely misleading.
The denominator problem
I remember sitting in a review meeting where our compliance dashboard showed a red “89.22%” for device compliance. The CIO looked at me and asked, “Who are the 11%? Are they executives? Are they laptops in a coffee shop?”
The truth was less dramatic but more frustrating. They were three headless VMs in a Dev environment. But because our “compliance tool” was a blunt instrument, it couldn’t tell the difference between a production workstation and a transient test box. We spent three hours of senior engineering time — time that should have been spent on threat hunting — debating the status of three machines that would be deleted by Friday. This is the “tax” of a poorly defined denominator: it creates noise that masks real risk.
Here’s what I’ve come to believe after years of working with compliance measurements: most compliance failures are classification failures, not security failures.
The three VMs weren’t a security risk. They were a measurement error. The underlying security posture hadn’t changed. The devices that actually mattered — laptops with users, workstations accessing production data, mobile devices with corporate email — were fine. But the denominator was wrong, and a wrong denominator makes every percentage meaningless.
In Intune reporting, devices that don’t live long enough to report status sit in a state called “Not Evaluated.” Crucially, the reporting logic includes “Not Evaluated” devices in the total count. They act as “denominator draggers,” actively shrinking your compliance score just by existing.
This isn’t an edge case. It’s a structural problem. Compliance measurement systems collect everything they can see, because the alternative — deciding what to exclude — requires judgement. And judgement is hard to automate, hard to audit, and hard to defend.
What the compliance industry actually measures
Let’s be honest about what the major compliance measurement systems actually do:
Microsoft Secure Score measures policy existence. Is the setting configured? It doesn’t measure whether the feature is working or whether it’s protecting the things that matter.
Intune Compliance Policies measure configuration proxies. For example, the “Require Encryption” policy checks for BitLocker status. But Azure VMs are protected by Server-Side Encryption (SSE) at the storage layer. The data is encrypted, but because the OS doesn’t report “BitLocker On,” Intune marks it as non-compliant. The objective (security) was met; the proxy (BitLocker) failed.
CIS Benchmarks measure settings against a hardening standard. They’re excellent at what they do. But they measure configuration state, not runtime behaviour.
None of these is wrong. They’re all useful. But they’re all proxy measurements — they measure something correlated with security, not security itself.
And here’s the problem with proxy measurements: when the proxy diverges from the thing it’s measuring, you don’t notice. You keep optimising the score while the actual security posture drifts.
The precision trap
There’s a counterintuitive dynamic at play: the more precisely you measure the wrong thing, the worse your decisions get.
Our compliance framework uses weighted rules. Device compliance coverage carries a weight of 30 out of 100. Encryption coverage carries 20. EDR onboarding carries 10. These weights reflect operational importance.
When those three VMs appeared in the denominator, the compliance score dropped across all three rules. The system dutifully calculated the new percentages with four-decimal precision. It produced a report with a RAG status of amber instead of green. If we’d had remediation ticketing connected, it would have generated tickets for non-compliant devices — tickets that would have sent a technician to investigate machines that existed for a few hours of pipeline testing.
Precision made the problem worse, not better. The measurement system was too accurate for its own good, because it was precisely measuring the wrong population.
Exception management as a first-class concern
The standard actually anticipates this. ISO 27001 doesn’t say “measure everything.” It says “determine controls that are appropriate.” The Statement of Applicability exists precisely so organisations can document what’s in scope and what isn’t — with justification.
But in practice, exception management is an afterthought. It’s the thing you deal with when the numbers look bad, not something you design into the measurement system from the start.
After the DevOps VM incident, I started treating exceptions as a first-class architectural concern. The solution required a shift in tooling because standard tools were too slow.
First, classification via Filters, not Groups.
Initially, I used Dynamic Entra ID Groups. But Dynamic Groups have a processing latency — sometimes hours. Ephemeral DevOps VMs were being created, tested, and destroyed before the Dynamic Group rule could even process them. They lived their entire life in a “non-compliant” state because the exclusion was too slow.
The fix was Intune Filters. Filters evaluate at assignment time, in real time. I created a filter rule that matches device.deviceName -startsWith "avms". This ensured that transient assets were excluded from the compliance calculation the instant they were born.
Second, transparency.
Every exclusion appears in the evidence output. There’s a dedicated ExcludedDevices tab in the compliance report that lists every excluded device, the reason for exclusion, and the rules it was excluded from. The auditor can see exactly what was removed from the denominator.
Third, review.
Annual access reviews on the exception logic. Not because I think DevOps VMs will suddenly start accessing production data, but because naming conventions change.
Exception management at scale
The DevOps VM fix wasn’t an isolated patch. It became a pattern. Across the tenants I manage, the same denominator problem appeared in different forms: service accounts inflating MFA compliance denominators, kiosk devices failing endpoint compliance checks, shared mailboxes distorting access review coverage, break-glass accounts flagging privileged access thresholds, deprecated applications triggering OAuth reviews, and personal-use accounts skewing policy acknowledgement metrics.
Each one required the same approach: classify the exception, exclude it from the measurement with an auditable justification, and review the exclusion periodically. What started as a one-off fix for three VMs became seven distinct exception categories, each implemented as an Entra ID security group with defined membership rules, documented scope (which rules the exception applies to), and a defined review cadence — quarterly for break-glass accounts, annual for the rest.
The key insight: exception management isn’t a workaround for measurement failures. It’s the mechanism by which you translate the standard’s “determine controls that are appropriate” into a denominator that actually reflects your environment.
What first-principles thinking looks like
First principles means starting from the question the standard actually asks, not from the measurement that’s easy to collect.
Control A.8.1 (User Endpoint Devices) asks: “Information stored on, processed by or accessible via user endpoint devices shall be protected.”
This is where the classification failure happened. Servers and DevOps agents are not User Endpoint Devices. They are infrastructure. Including them in an A.8.1 report is a category error.
When I approach a control from first principles, I ask:
- What is the standard actually trying to protect? For A.8.1, it’s user laptops and mobiles in unmanaged environments (coffee shops, airports).
- Where do these VMs belong? They belong in Control 8.9 (Configuration Management) or Control 8.31 (Separation of Environments).
- What evidence demonstrates protection? For A.8.1, it’s BitLocker and MDM. For A.8.31, it’s evidence that the test environment is isolated from production.
- What’s in the denominator? This is the question nobody asks. If I move these VMs to the A.8.9 report, my A.8.1 score returns to 95% because the denominator is now accurate (only user devices).
This approach takes longer than copying a template. It requires understanding both the standard and the technology. But it produces measurements that actually mean something. And it produces evidence that survives the auditor’s follow-up questions — because when the auditor asks “why is this VM excluded from your Endpoint report?”, the answer is “Because it’s not an endpoint; here is its configuration report under Control 8.9.”
The question I’ll leave you with
What assumptions are baked into your compliance measurements that you’ve never questioned?
Which devices are in your denominator? Are you using “BitLocker status” as a proxy for “Encrypted” on systems that use Storage-Side Encryption? Are you using Dynamic Groups for assets that live shorter lives than the group processing time?
If the answer to any of these is “I don’t know” or “whatever the tool collects by default,” you might have a precisely wrong compliance score. And precisely wrong is worse than approximately right — because it gives you confidence in a number that doesn’t reflect reality.
The standard doesn’t ask for perfect measurements. It asks for appropriate ones. Getting from “everything the tool collects” to “everything that matters” requires judgment, documentation, and the willingness to question your own denominators.
JJ Milner is a Microsoft MVP and the founder of Global Micro Solutions, a managed services provider operating across 1,200+ Microsoft 365 tenants. He writes about rethinking compliance from first principles.