Rethinking Web Accessibility On Windows

For the past 18 years, screen readers on Windows have had a particular way of presenting web content to their users. This model no longer scopes well in light of modern web applications and dynamically changing web sites. The model, therefore, needs to be reimagined and modernized.<!— more —>

A little bit of history

In 1999, the two major screen reader vendors on the market, JAWS and Window-Eyes, released new versions of their software that introduced a new way of web browsing for blind and visually impaired users when using Microsoft Internet Explorer. Web content was represented in a document-like form, much as was known already from word processors such as Microsoft Word. Semantic information like links, form controls, and later headings, lists and tables, were inlined and spoken accordingly.

To achieve this, the screen readers interfaced directly with IE and slurped in all the HTML of the currently loaded web page. Then, they interpreted the HTML themselves to build up the virtual document, sometimes also called virtual buffer. Virtual, because it was not a real document, but a close approximation of one that used very similar navigation paradigms as a real document.

The advantages of this approach were that the screen reader always knew everything there was to know about the mostly static web pages of the time. They could also perform operations such as listing all links on a page, or all form fields. Later, the vendors also introduced so-called quick navigation keys. By pressing a single letter, like h, for example, the user can move to next items of a particular type. That way, users could quickly jump to the next heading, list, table, form field, visited or unvisited link, etc.

However, as the web became more dynamic, first in the mid 2000s through mechanisms such as AJAX, and later through full-fledged JavaScript frameworks, the virtual buffers began showing signs of inadequacy. Because they needed to be kept in sync with the actually loaded document, a lot of communication had to take place between the browser and screen reader while the user was operating the page, and content be changed under the user’s nose, often dramatically changing the shape of the virtual document.

New browsers, new APIs

When Mozilla began producing Firefox in the early 2000s, it became apparent that it would not only run on Windows, but on Linux and Mac as well. On Linux, the requirement was that a full-fledged accessibility programming interface would have to be used, and no own HTML parsing was wanted by the assistive technology. The saying went: The browser already knows everything, why should the assistive technology have to reinterpret it again? The browser should, instead, tell the AT what it had through a set of standard programming interfaces.

And because on Windows, there was a project being started calledNVDA, a free and open-source screen reader, the same requirements were formulated for this platform as well. Instead of devising a whole new API, it was decided to enhance an existing accessibility programming interface called Microsoft Active Accessibility (MSAA), and add to it all the pieces that it was lacking to support full-fledged web content. The result was named IAccessible2, to indicate that IAccessible, the official interface name for MSAA, had been enhanced.

Around the same time, Apple started implementing accessibility support into WebKit, the engine powering the Safari web browser on OS X, now called MacOS. It, too, went the way of defining all that was needed to know for VoiceOver, Apple’s screen reader, through a set of standardized interfaces, instead of having VoiceOver interpret the HTML itself.

And when Microsoft started its work on Edge, the successor to Internet Explorer, it took the same approach, no longer allowing screen readers access to the raw HTML, and instead defining web content through its successor to MSAA, called UI Automation, or UIA.

Differences in architecture

On Windows, there are two ways to communicate between applications when using MSAA. One is called in-process, the other out-of-process. In the in-process case, one application injects part of its code into the other application’s running instance. It essentially becomes part of that other application. In our case, the screen reader injects part of its code into the browser, in order to get at the information for building the virtual document fast.

The other method, where the screen reader would communicate with the browser across application boundaries through a set of protocols defined by the operating system, is between 12 and 50 times slower than the process-injection method, depending on the actual queries being made. So depending on the size of the loaded web page, loading it into a virtual buffer could take under 1 second when in-process methods are being used, versus 12 to over 40 seconds when doing it across process boundaries. The advantage of in-process methods is therefore very clear.

The disadvantage, however, is that either component could bring the other down with it when crashing. If the screen reader module had a problem and crashed, it could take the browser down with it, being part of it now. Likewise, if the browser crashes, it can render parts of the screen reader inoperable. There are also security implications, and rumor has it that process injection might not be possible for much longer.

On other platforms such as Linux and MacOS, there is no such thing as injecting parts of one process into another. Inter-process communication, such as between an assistive technology and a browser, has to always happen across process boundaries and through standardized protocols. Additionally, Microsoft no longer allows process injection when using UI Automation to communicate between applications. Therefore, all communication between Microsoft Edge and a screen reader has to happen out of process, or in some cases through a specialized UIA bridge between applications.

Retire virtual buffers

As a consequence, both Orca, the dominant screen reader on Linux, and VoiceOver on MacOS and iOS, have no concept of a virtual buffer as is known from Windows screen readers. Moreover, due to the above mentioned limitations, for Microsoft Edge, NVDA also no longer uses a virtual buffer concept to represent web content when browsing with Edge.

So, to make screen readers on Windows future-proof for the next 10 to 20 years in web content, it is time to retire the virtual buffer concept and work with the notion that screen readers no longer know everything about a whole web document at any given time, but only parts of it that the user is currently working with. Once process injection is no longer possible on Windows even when using MSAA or IAccessible2, trying to maintain a full virtual buffer and keeping it dynamically updated is no longer going to be feasible in a performant manner.

Browsers must take more responsibility

To achieve equality with already existing functionality in screen readers, and therefore ensuring competitiveness of current users with their sighted peers, browsers must take on several of the tasks the screen reader previously performed inside its virtual buffer. Some of these are:

  • Provide means to fetch all items of a certain type on the currently loaded web page, for example all links, all headings etc.
  • Allow a means to jump rom the current position to the next heading, table, list, or other desired particular element without the screen reader having to crawl through all accessible objects in-between. The browser knows much more about the loaded web content than the screen reader, so should leverage that and help with this type of navigation by providing the next relevant object directly.
  • Fetch a current range of text and all its attributes and other semantic information so the screen reader can give the user a proper representation of what’s hot.
  • And other things that will need to be determined as this transition happens.

Examples where parts of the above already exist are the AT-SPI collections on Linux, and the APIs introduced in MacOS 10.11 El Capitan into Safari and Chrome to make navigation and fetching particular lists of items with VoiceOver much more responsive. Especially the Linux case, though, shows that there is still room for improvement. AT-SPI collections can do some things, but not others that would help performance greatly when quickly navigating by headings or lists, for example.


The following two examples demonstrate how certain types of interaction work now, and should work in the future on Windows.

Moving to next heading

A very common task is to move to the next heading. The way screen readers work now is that they suck in all the web content at once, always having it at the ready when the user wants to do quick navigation. The user invokes the Next Heading command. The screen reader crawls through its document and places the virtual cursor onto the next heading. The browser, on the other hand, is totally unaware that the screen reader user has changed reading context. The screen reader might or might not ask the browser to focus on the new element if possible.

In the future, the screen reader should tell the browser to give it the next heading relative to the current reading position. The browser then does the searching, which it can do much faster than the screen reader across boundaries, and provide the new location back to the screen reader, which in turn will then read out the information, and both browser and screen reader are now aware of the new reading position.

Inserting new content dynamically

When using Twitter in the so-called forms or application or focus mode, in which the user interacts with the browser directly, the user can use Twitter’s keyboard shortcuts. But because the user might switch back to virtual mode at any time, the screen reader needs to keep its buffer up to date at all times regardless of whether the buffer is actually being used.

So while the user is merrily reading tweets, they suddenly are being notified that new tweets were added at the top, and to press the Perios key to load them. By default, Twitter loads 50 tweets at once, and expands to more once the user reads the last two or three currently loaded tweets. Now the user has maybe reached a number of 70 or 80, and 30 new are waiting to be inserted.

The user presses Period. Twitter loads the new tweets and puts focus at the top. The screen reader gets notified that the list of previously 70 items now has expanded to show 100. But because the whole list layout has changed, the most efficient way is to simply replace the old list in the buffer with the new one. So the screen reader goes and removes the old list, and crawls through the hierarchy and fetches the new 100 list items, which may each consist of 6 to up to 10 or more accessible objects, and rerenders it. This will take some time, even when done in-process.

The future way of doing this is far less processing power intense. The new tweets come in, the user presses Period, the browser adds the items, Twitter changes focus to that newest item, and tells the screen reader. The screen reader now only needs to go and ask the parent list container about how many items it now has, so it can tell the user. It does not need to crawl through the whole list because there is no buffer to maintain.

This is why, for example, the experience of using Twitter on MacOS in Safari or Chrome is such a much smoother experience than on Windows with, for example, Firefox or Chrome and NVDA.

In conclusion

At Mozilla, we do want to start a project to enhance current APIs to allow screen readers to move away from the virtual buffer concept. The ideas are already floating around between team members. And because NVDA is also open-source, we can use an experimental branch to actually prototype approaches and check the theory against the practicality of day to day use. Sometimes, theories sound good, but the practice then falls on its face.

We expect to fail at some first attempts, learn from the experience, and reiterate. The end goal is to give an as good as, if not better, experience for browsing the web and being ready for dynamic web applications built in frameworks such as Angular or React, or being more efficient in complex web applications such as Gmail or Slack than we are now on Windows.

And we certainly hope that our friends at other browser accessibility teams such as Chrome or Edge, as well as commercial assistive technology vendors, will take an interest in what we do, and in a collaborative fashion, we’ll all strive to give our users better experiences.

The web accessibility on Windows needs to grow up now that it’s 18, and we want to try and make this happen!

17 thoughts on “Rethinking Web Accessibility On Windows

  1. I would add, to the list of virtual buffer features to support from the browser side, the text find feature. I often recommend it to users as a way to find something of interest when the type of the sought item is not known. For example, if I’m sitting at an office workstation and a coworker says “Click Invoices,” “Invoices” may be implemented as a button, a link, a tab, a clickable item without a properly defined type, etc.


      1. Hmm yes, but this raises the question if it would be better to teach users to use the built-in search in browsers. Which currently doesn’t work due to the virtual buffer concept. Also, I wonder if some features like landmark navigation would be benefitial for keyboard users in general and might also be better placed in browsers.


    1. I just wanted to say a huge “amen” to what Doug has posted below, re the importance of implementation of a fast and reliable text “find” feature, as we have now in virtual buffer land. Also, I would add that some kind of “place marker” implementation is another feature I absolutely would not want to lose, since when they work properly, they can be an invaluable feature I feel, in speeding up general browsing navigation efficiency.


  2. Hi Marco you make some very interesting points. Thanks for the rite up of your talk. If you do need any testing of any thing in a experimental branch of NVDA I would be glad to help. I get real fed up with the lag when using google drive or web baste twitter and I will not even start about using facebook! 😀


  3. This is a fantastic post and we are at an important time for defining good Apis.

    I do think you should consider expanding the scope a little to consider other at types as well. For example when playing with jambu I found issues with the at activating nodes and lack of definition of event sequences for say a pop up menu. However that was for firefox chrome, but web apps now have as much dynamic activity. I hadn’t realised it was all so new at the time. If I had I would have spent more time with these issues.

    I wonder if the edge teams work on the accessibility tree could inform the new Apis? Perhaps acting as the underlying browser side model.

    We should aim for harmonisation across platforms as far as is possible. Thats what iaccessible2 attempted by mirroring linux/ gnome at-spi. Though of course, dropping virtual buffers and in process is doing that!

    Im excited to watch how the newly expanded mozilla a11y team progress this.


  4. The concept of there being focus on elements of a web page (such as a paragraph of text) is really hard for sighted people I work with to get across. The cursor analogy is a good one of course, and as blind people we come to rely on things like the text find feature as Doug mentioned. Interestingly, when JAWS and a big page were having speed issues for me recently I turned off JAWS virtual cursor and used Firefox’s find on page item to move focus and it worked reasonably well, but the big drawback to all of this of course is that the nonactionable stuff is hard to get at. I wonder if we could take some lessons from XPath in terms of finding content?


  5. I think it is also important to separate out the technical details here from the user experience and interaction model. Whether the assistive technology knows about the full document or not to me is independent to some degree from how the user interacts with that content.

    For any web experience with a sizable amount of text that a user wants to read that isn’t otherwise actionable i.e. Not a link button or other control, we need to still have an efficient way to move to all that text.

    Today this is far too complex requiring that a user understands different modes and a range of techniques by web authors and assistive technology makers to try and manage this behind the scenes. These fall short in many cases.

    That said, the interaction model of up, down, left and right mixed with a range of navigation keys has proven effective. You can abstract this a bit and think of it as previous and next text and then with the ability to move in various levels of detail and definitions of what previous and next are. On Windows this is often done as I say through up and down arrow, but the same concept exists on other platforms even if the user interactions are different. The screen reader can still get to all the content, even when it isn’t directly focusable otherwise.

    It would be great if more of the heavy lifting of this came from the browsers and we saw basics such as link lists, headings and more move into the browser for sure.


    1. I believe the scan mode implemented in Narrator for Windows 10 is a great attempt to simplify this interaction for the end user.


  6. Another thing I would like to throw into the mix here, is just how incredibly important *I* feel it still is, to retain mouse/(JawsCursor) direct access to text in a browser. I can’t tell you the number of times where achieving actionability in a webpage has simply *had* to be done by stumbling around witth the mouse emulation keys and clicking on the text in question! Or, as in the case of playing book samples from for I don’t know how many years, actually clicking something like 40 pixels to the right of certain text, on a totally blank area of the screen to activate the play button!!! Crazy but true! Though I’m aware this “Sccreen-scraping” methodology, (as I believe it’s dubbed, seems to be becoming harder and harder to achieve as security becomes tighter, I just so strongly wanted to put my hand up here to cry that I believe that giving us ability to read the screen in browsers with the mouse, and thus be able to click on things, as of course is the primary method in which everything is designed to work, is something *I* think would be incredibly helpful to maintain where at all possible. Fireffox seemed to break this type of access some time back, then a post was around to enable us to tweak things under the hood to repair this, but, … subsequent updates seemed to have broken this again now, and I’m now unaware of the method for how to restore it in current firefox versions to match what is still possible in IE11, at least under Windows 7.


    1. Sorry, but this ship has sailed for good I’m afraid. There’s no way this is coming back in Firefox, especially not under Windows 10. And as Microsoft obsoletes more of Windows 7 and 8, those hacks will be less and less likely to work anywhere.


      1. Does this just affect JAWS? I think I can manage to click on something with Windows 7 and NVDA mot of the time, although sometimes I need to physically move the mouse, as the mouse position is not routed properly.


    1. There is unfortunately nothing significant to report. And as I am currently on an extended sick leave (and this is still, after all, my personal blog), I unfortunately won’t be able to provide something to that effect in the foreseeable future, either.


Leave a Reply to SteveALee (@SteveALee) Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.