LLMs Are Proving That It Is Impossible to Automate Away the Human…

The Gap is Real

If your accessibility process ends with a passing scan, your users with disabilities are likely encountering barriers you do not know about.

Automated tools are useful. LLM-assisted auditing has raised the ceiling. Neither closes the gap between what a tool can verify and what a person actually experiences when navigating with a screen reader. The gap is real, it is measurable, and it has organizational consequences.

I have spent years running every category of automated accessibility tool the industry has produced: rule-based scanners like axe-core and WAVE, visual analyzers like Lighthouse, browser extensions with built-in screen reader simulations. When LLMs entered the picture, the promise was compelling. Tools that could finally understand context, interpret user flows, and reason about accessibility the way a human specialist does.

I wanted that to be true. It isn't.

What Automation Does Well

Let me be fair to the tools. Automated accessibility scanners like axe-core, Lighthouse, and WAVE are genuinely good at what they do. They catch heading hierarchy violations quickly and reliably. They flag color contrast failures with precision. They identify missing alt attributes on images in many cases.

These are real wins. If you have run these tools on your product, you have addressed real issues, and the engineers who did that work should feel good about it. Automated tooling is a force multiplier. It speeds up the work and amplifies how much ground a team can cover.

The problem is not what these tools do. It is what they cannot do, and what organizations assume they have done.

What Automation Invents

The same tools I praised above (SilkTide, axe-core, Lighthouse, and WAVE) are genuinely valuable parts of a real accessibility workflow. They speed up discovery, give teams a shared vocabulary, and help non-specialists see problems that would otherwise be invisible. They also generate a lot of output. A single scan on a moderately complex page can return dozens or hundreds of flagged items, spread across multiple categories, severity levels, and rule references. Parsing that into a useful action plan requires judgment and domain knowledge most teams do not have. And buried in that volume is a specific and costly problem: violations that are not there.

Before we talk about what tools miss, we need to talk about what they make up. This is the part that surprises most teams.

I want to be clear about something first: the SilkTide accessibility extension has been genuinely valuable in my work. Its visual screen reader simulation is one of the best tools I have found for helping stakeholders and engineers understand what the screen reader experience feels like. When you need to show a sighted team why a particular layout or element ordering matters, SilkTide makes that visible in a way that is immediately intuitive. I recommend it regularly and I mean that sincerely.

A visual approximation of the screen reader experience is not the screen reader experience. And when it is treated as one, things go wrong.

The Reality of Simulation

Not long ago, I was working with the SilkTide extension on a project. The tool flagged what appeared to be serious accessibility problems, the kind that would keep a screen reader user from navigating a page effectively. Hours were spent investigating. Engineers dug into the markup, tried different fixes, and tested and retested.

Then I opened VoiceOver, a real screen reader built into MacOS, that millions of actual users depend on, and tested the same page. The problems were not there.

The tool was wrong. The human experience was fine. The time was gone.

This is not an edge case. Visual screen reader simulations flatten the real experience into a rough approximation. Real screen readers like NVDA, VoiceOver, and TalkBack, each have distinct announcement patterns, voice characteristics, and behaviors. A simulation does not match any of them. When you test against a simulation, you are testing against something no real user will ever encounter.

False positives are particularly insidious because they look like diligence. Your team can spend enormous engineering effort chasing ghosts while the actual user experience problems sit unexamined. The energy goes somewhere, it just does not go where it matters.

LLMs compound this problem in a way that is qualitatively different. Rule-based tools at least apply consistent rules, so their false positives tend to be repeatable and, once understood, ignorable. LLMs hallucinate. They will flag violations that not only do not exist, but that they have invented wholesale, citing WCAG criteria that do not apply, describing user impact that is not possible given the markup, or referencing accessibility attributes that do not exist in any specification. Because the output is fluent and confident, it reads like expert analysis. A team without deep accessibility knowledge has no easy way to distinguish a hallucinated violation from a real one. The false positive problem with rule-based tools is a noise problem. With LLMs, it is a credibility problem.

What Automation Misses

Now for the other side: the violations that tools do not catch at all.

According to the Intelligence Community Design System, standard automated scanners typically miss 50% to 70% of the actual accessibility barriers in a modern web application. A 2025 peer-reviewed study from the International Web for All Conference put this to the test directly, evaluating three leading LLMs, GPT-4o, Gemini, and Llama 3, on their ability to detect accessibility violations in mobile apps.

The best-performing model detected just 38% of the violations that a traditional automated scanner found. The other two caught 30% or less. LLMs are not closing the gap, they are detecting even less than the tools that were already insufficient. Let's stack those numbers: if a scanner catches at most half of real barriers, and an LLM catches 38% of what the scanner finds, an LLM-only workflow may be surfacing as little as one in ten of the actual accessibility issues your users face.

Current Evolution

It is worth noting that LLM capabilities have expanded considerably since this study was conducted. Skill plug-ins, purpose-built agents that extend LLMs with accessibility-specific knowledge and tool integrations, have raised the ceiling again. Community Access, an impressive open-source project, has built 57 such agents covering WCAG 2.2 AA compliance across web, documents, and code environments, with contributions from blind and low vision developers. These tools represent a meaningful step forward. The structural problems this study identified have not gone away: plug-ins can improve detection and generate better fix suggestions, but they still cannot tell you what it feels like to encounter a barrier, and they still require an expert who knows when to trust the output and when to override it.

Automated tools cannot judge whether ARIA labels are being used appropriately, only whether they are present. A with role="button" will pass a scan, but if it lacks the JavaScript to handle keyboard activation, it is completely broken for a keyboard user. The tool sees the role. The user hits a wall.

Tools cannot evaluate focus management or keyboard operability across multi-step interactions. They take a snapshot of the page in its initial load state and miss everything that happens after a user starts navigating. A user could get trapped in a modal dialog, skip from the header to the footer when tabbing, or encounter the same control twice in a row, and the scan would still return green.

And tools cannot interpret the WCAG criteria that require understanding context and user flow. They cannot tell if a heading structure logically reflects the content it organizes, or if an error message actually helps someone fix a form field, or if instructions that say "click the red button" make any sense to someone who cannot see color.

These are not obscure technical footnotes. These are the barriers that real users with disabilities encounter every day.

A Button Inside a Link

Here is a concrete example from my own work.

During an engagement, I encountered a button nested inside a link, semantically invalid HTML that creates a confusing experience for assistive technology users. The automated tools we ran reported it as fine. The auditor on the engagement, a capable QA analyst, but relatively new to accessibility work, also missed it during his review.

I caught it through screen reader testing.

When I tabbed through the page with a screen reader, focus landed on that element twice in a row. The experience was disorienting. A moment of "Wait, am I going backward? Did something break?" Before I realized the same control had just been announced twice because of the invalid nesting.

This example matters because it has two layers. It is not just that automation missed it. A human auditor missed it too. The argument is not "humans over automation." It is that the gap requires the right human, someone who tests with the tools real users depend on and who recognizes the experiential failures that neither scanners nor inexperienced reviewers will catch.

Why the Gap Is Unfixable Without Lived Experience

LLMs understand WCAG criteria at a rule level, and they understand them well. They can parse the guideline text, identify which success criteria apply to a given element, and suggest fixes that are often technically correct.

They do not have human nervous systems.

They cannot feel what it is like to navigate a page with NVDA and hit a confusing landmark. They cannot feel the disorientation of losing track of focus, or the frustration of a flow that breaks down in a way that is hard to articulate but immediately apparent when you are living it.

The misses are not just "wrong rule applied." They are subtler: a fix that is technically correct but practically disorienting, or a violation that the LLM does not flag because the guideline text does not capture the experiential failure.

The same W4A '25 study confirmed this in practice. When the LLMs did detect a violation and attempted to fix it, the results were striking: GPT, the best-performing model for fix quality, produced syntactically valid fix code for fewer than four in ten of the violations it detected. The worst, Llama 3, which had the highest detection rate, generated zero compilable fixes out of 126 detected issues. The researchers concluded plainly: "the evaluated LLMs are not yet ready to be employed in fully automated approaches for detecting and fixing accessibility issues. Even in cases where the models generated code, manual intervention was required."

These results predate the current generation of accessibility skill plug-ins discussed earlier. The ceiling has moved. The conclusion has not.

LLMs are reactionary tools. They predict the most likely output based on their training data, which includes a vast amount of inaccessible code from the public web. They even hallucinate ARIA attributes that do not exist, or suggest changes like role="application" that actively destroy the screen reader experience. I have encountered this myself, cross-referencing ARIA attributes my LLM coding assistant confidently recommended, only to find they do not exist in any specification.

True WCAG compliance requires validating what it actually feels like to use a product; with a screen reader, keyboard-only navigation, high contrast mode, and the full range of assistive technology real users depend on. No algorithm, no matter how sophisticated, can perform that validation.

What a Real Specialist Brings

I am a disabled military veteran and an accessibility specialist. My service left me with disabilities that make some websites genuinely inaccessible to me, not as a test case, but as a real barrier I encounter. I have lived experience of how the WCAG Guidelines protect me. I am the thing automation is trying to replace. It can't.

A specialist who combines technical training with lived disability experience does what no tool can.

They validate real user flows with actual screen readers (NVDA, JAWS, VoiceOver), not simulations, not approximations. They interpret ambiguous WCAG cases that require contextual judgment, where the rule text does not cover every situation. They work with engineers to implement fixes that improve the actual experience, not just satisfy a rule check. And they identify experiential failures that have no corresponding rule: the disorientation, the lost context, the flow that quietly breaks down in a way a checklist will never surface.

These are not separable qualities. Technical training and lived experience reinforce each other. Trained engineers know the guidelines and can navigate tooling. Lived experience means you feel the failures that tools cannot detect.

Your automated tools are a valuable first layer. They are not a last layer. The organizations that treat accessibility as a genuine commitment to their users, not just a compliance checkbox, are the ones that close the gap between a dashboard full of green lights and an experience that actually works for everyone.

That gap is where specialists live. And it is exactly the kind of work we do at 8th Light.

Accessibility Initiatives in the Real World

Not all corporate environments are created equal. Sometimes, it helps to have an independent third party validate the progress you've made.

Want an evaluation of where accessibility stands for your digital experiences?

LLMs Are Proving That It Is Impossible to Automate Away the Human Experience

The Gap is Real

What Automation Does Well

What Automation Invents

The Reality of Simulation

What Automation Misses

Current Evolution

A Button Inside a Link

Why the Gap Is Unfixable Without Lived Experience

What a Real Specialist Brings

Accessibility Initiatives in the Real World

Spark a Conversation >

Celeste Aronow

Production is the New Prototype: Notes from LangChain Interrupt 2026

Accelerating Delivery: How AI Agents Can Own the Initial Code Draft

A Primer: MCP Servers and the Model Context Protocol