Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Citations and footnotes using ReadAloud can be disruptive to comprehension #72

Open
GeorgeKerscher opened this issue Feb 13, 2024 · 18 comments

Comments

@GeorgeKerscher
Copy link

Description

When using the ReadAloud function in a Reading System, or when a screen reader is being used, text citations in the text can be disruptive to reading comprehension. The same disruption occurs if a footnote is read where it occurs. The concept of skipability and escapeability has been discussed using SMIL and media overlays, but when using ReadAloud or with a screen reader has not yet been addressed.

This feature request originated in the EPUB Reading Systems accessibility testing, but it is not accessibility specific. We are requesting that the Publishing Community Group take up this issue. It relates to best practices for markup and having the feature in Reading Systems and with screen readers.

@rickj
Copy link

rickj commented Feb 13, 2024

Here is what we do (user selection on specifics):

Screenshot 2024-02-13 at 3 00 26 PM

EPUB Text Finder
This section describes the algorithm that the TTS engine uses to find the text to read, and how it breaks it up into paragraphs. (In the code they’re called “Utterances” - terminology borrowed from Apple’s speech APIs.)

The contents of the following tags are always ignored regardless of the filter settings:
script, noscript, style, object, noframes

The following list of tags are considered to be paragraph separators:
br, hr, tr, td, p, blockquote, title, ul, ol, li, table, pre, div, h1, h2, h3, h4, h5, h6, article, section, figcaption, figure, dl, dt, dd, aside, address, header, nav, footer, hgroup, caption

In addition to locating the text, filtering types can be added to each utterance. The application can filter on any of these flags (for example, exclude figures and altText). Some of these filters are very simple and are applied to certain tags (for example, figures). Alt text is more complex, as the alt text utterances are extracted from a variety of tags and attributes. The accessibility text for math is similarly complex - we attempt to extract the accessibility description of the math and read that if possible (tagged with the math filter).

The ‘altText’ Filter
This filter reads (or excludes) image alt text. Alt text is extracted from:

  1. The alt attribute of an img element.
  2. The aria-label attribute of any element where role == “img”.

The ‘figure’ Filter
A paragraph has the figure filter if it is contained in an element where any one of these is true:

  1. The element is figure
  2. The epub:type attribute is figure

The ‘table’ Filter
A paragraph has the figure filter if it is contained in an element where any one of these is true:

  1. The element is table
  2. The epub:type attribute is table

The ‘citation’ Filter
A paragraph has the citation filter if it is contained in an element where the epub:type is one of the following:

  • endnote
  • endnotes
  • footnote
  • footnotes
  • bibliography

The ‘uriLink’ Filter
The purpose of this filter is to edit out links where the text of the link is the linked URL. (As opposed to links where the text of the link is just text - these we always still want read; it’s just a part of the text with a link applied.)

To detect this, we look for a elements, and extract both the element’s text and the href attribute.

  1. If the href begins with “mailto:” (then it is an email link). Compare the rest of the href (removing “mailto:”) with the text - if they match, apply the uriLink filter.
  2. If the href begins with “http:” or “https:” then:
  • Trim the string “text” - remove all leading and trailing whitespace.
  • See if the resulting string has at least one internal period and no internal whitespace. If so, then it looks URL-ish. See if this string is a substring of href. If it is, then apply the uriLink filter.

The ‘math’ Filter
Math is one of the trickiest filters, because we extract the accessibility description from several different places. In place of the math element, we include the accessibility description and apply the “math” filter to it, so it can be excluded or included.

  1. If the element name is math and it contains an attribute named altText (the MathML standard attribute for math alt text), then the utterance returned is the altText value and it is marked with the math filter.
  2. If the element name is img and it has a a role element where role == “math” and the img contains an alt attribute, then the utterance returned is the alt attribute value, and it is marked with the math filter.
  3. If the element name is NOT img and it has a role element where role == “math” and the element contains an aria-label attribute, then the utterance returned is the aria-label value, and it is marked with the math filter.
  4. (Note: This is specific to our mathjax preprocessor code, when using chtml output): If the element is a span or div and its class contains “mjx-chtml”, then look for an aria-label on the parent of the element. The utterance returned is the aria-label value, and it is marked with the math filter.
  5. (Note: This is specific to our mathjax preprocessor code, when using svg output): If the element is a span or div and its class contains “vst-math-wrapper” - we look for a child of the element with the svg tag and look for the svg’s aria-label attribute. The utterance returned is the aria-label value, and it is marked with the math filter.

PDF Filtering
PDF has similar logic to the above when the PDF is a tagged PDF. Paragraph separators are described by the standard set of block-level PDF tags.
The tags that define the PDF filters are described as follows:

The PDF ‘altText’ Filter
This filter detects if the pdf object is an image that has alt text defined. If so the utterance is the alt text.

The PDF ‘figure' Filter
This filter detects if an utterance is contained in the Figure tag.

The PDF ‘table’ Filter
This filter detects if an utterance is contained in the Table tag.

The PDF ‘caption’ Filter
This filter detects if an utterance is contained in the Caption tag.

The PDF ‘math’ Filter
This filter detects if an utterance is contained in the Formula tag.

The PDF ‘citation’ Filter
This filter detects if an utterance is contained in the Note or FENote tag.

@wareid
Copy link
Contributor

wareid commented Feb 15, 2024

This is directly related to what is proposed in #69 and I wonder if we should just combine these two issues.

With "read aloud" are we referring to text-to-speech, screen reader output, media overlays, or a combination of the three?

@rickj
Copy link

rickj commented Feb 15, 2024

With "read aloud" are we referring to text-to-speech, screen reader output, media overlays, or a combination of the three?

Those are three different domains that cannot share a common solution:

  1. Media Overlays - dependent on the intersection of a properly marked up EPUB and a Reading System that supports media overlays
  2. Screen Reader Output - Assistive technology is a 'black box', outside of the control of a reading system. Changes here would need to be targeted differently
  3. TTS (Read Aloud)- This is inside the control of a reading system... and we should come up with a recommended approach (like the above! )

@GeorgeKerscher
Copy link
Author

I believe that for the RS to give the reader the option to avoid unwanted spoken information, the content would need to be marked up. In particular, where the author is referencing a formal citation, if this were marked up, the RS ReadAloud function could give the option to skip it.

The option of skipping the reading of footnotes could also be skipped. Here doc-footnote could be skipped.

I believe this skipping could also be implemented by screen readers.

In the case of SMIL, if the content was marked up, then this could be identified in the SMIL markup.

@GeorgeKerscher
Copy link
Author

Yes, This issue is directly related to #69, but it is much simpler to implement.

If we create a best practice for marking citations, I think there is enough general markup to resolve this issue.

RS systems could simply add the option of what to skip in their ReadAloud.

For example in ReadAloud skip :
citations
Footnotes
Alt text
page numbers

These could be toggled. In textbooks, reading of pages I would want, but in a novle, it would be disruptive. People should be able to choose.

@wareid
Copy link
Contributor

wareid commented Feb 20, 2024

With "read aloud" are we referring to text-to-speech, screen reader output, media overlays, or a combination of the three?

Those are three different domains that cannot share a common solution:

  1. Media Overlays - dependent on the intersection of a properly marked up EPUB and a Reading System that supports media overlays
  2. Screen Reader Output - Assistive technology is a 'black box', outside of the control of a reading system. Changes here would need to be targeted differently
  3. TTS (Read Aloud)- This is inside the control of a reading system... and we should come up with a recommended approach (like the above! )

So the problem I'm having here is that I have never seen TTS referred to as read aloud, I've actually seen this terminology from publishers in the context of media overlays in places like the description of the book, or in the context of a specific learning style (there is a lot of content out there on "Read Aloud" practices that include media overlays or teach parents how to read aloud).

In user discussions, we have also only heard users refer to either TTS or "reader mode", not "read aloud". I want to make sure we're using accurate and precise language, and unify on it, so we avoid confusion on both the publisher side (where I currently see a lot of confusion between the methods), and the user side.

@sueneu
Copy link

sueneu commented Feb 20, 2024

Publishers that I work with use "Read Aloud" to mean text-to-speech. Perhaps we define our terms in any resulting spec.

Media Overlays: Audio or video files embedded in an ebook
Text-to-Speech (TTS): Audio generated from text by an ebook reading system.
Screen Reader Output: Audio generated from text by user-selected assistive technology that is separate from the ebook reading system.

@clapierre
Copy link
Contributor

Hi @sueneu in the definitions you provided I have heard folks use
"Read Aloud" in-place of what you have defined "Text-to-Speech (TTS)" because in reading systems the button you press is "Read Aloud" not "TTS".

@sueneu
Copy link

sueneu commented Feb 20, 2024

@clapierre well, that explains some of the confusion!

I've been told by developers that "Read Aloud" refers only to synchronized media overlays. So there is some variation within the industry. For that reason, we should be careful to define terms in documentation.

Could we define "Read Aloud" as any audio expression of the text no matter what technology (ie. TTS, media overlays) is used? You could easily make the argument that the user needn't be aware of how the audio is generated.

Documentation for publishers, producers, and reading systems could further define the underlying tech.

@mattgarrish
Copy link
Member

You might want to refer to the guide @GeorgeKerscher wrote: https://www.w3.org/publishing/a11y/audio-playback/

It gets into the confusion around the Read Aloud v. Read Now naming.

@wareid
Copy link
Contributor

wareid commented Feb 26, 2024

But @mattgarrish, the document you're referring to explicitly makes the difference between media overlays "Read Aloud" and TTS as separate features.

There's a big gulf between the two features, and in many cases, completely different sub-features between the two. Most SMIL implementations don't allow you to adjust reading speed for instance, and SMIL allows the publisher to customize text highlighting, but TTS implementations do not. Not to mention the different audio, SMIL is most often a human narrator where TTS is computer generated. I think it's really important to be clear about what the user is going to experience. Especially in cases where both options might be available for a title.

EDIT: I also think it's important to point out that the two features have completely different origins, one is publisher-driven and provided, the other is reading-system driven.

@mattgarrish
Copy link
Member

the document you're referring to explicitly makes the difference between media overlays "Read Aloud" and TTS as separate features

It's defining "full audio" publications as those that use media overlays and TTS for the reading system/AT-generated playback, regardless of what names are assigned to those features in different reading systems. When you start using generic names like "read aloud" it means different things to different people. I'm only pointing it out as a means of standardizing the language used to talk about the issue.

@HadrienGardeur
Copy link

From a reading app perspective, I'm not sure that there's always a need to identify TTS and media overlay as two completely different affordances.

Framing this as a User Story: "As a user, I would like to listen to an ebook and have sufficient control over that experience".

The following preferences/features can apply to both of them:

  • play/pause/stop
  • skip to next/previous utterances
  • highlight colour
  • speed
  • continous playback (this mostly applies to FXL content, where you might want to automatically pause the playback until the reader moves forward to the next page/spread)
  • skippability could apply to both as Media Overlay/SMIL also provides semantic information that can be used to skip specific utterances

While Media Overlay can come with their own CSS class for highlighting, this authored preference could prove problematic to some users and it makes sense to always offer the ability to customize things. I would need to double check but as far as I can remember, this is also optional in EPUB, which means that reading systems need a way to handle highlighting if it isn't authored in the file anyway.

For reading speed, it's well known by now that many users want the ability to tweak things to their own liking. This goes beyond ebooks/audiobooks, since podcast and video apps often offer this option as well (there are many people watching anime at a higher speed for example).

I believe that this eventually comes down to two key differences:

  • Media Overlay may provide a higher quality audio experience, if it's recorded by a real human narrator (TTS could also be used to mass produce such files)
  • and the way content is broken down into utterances (more control from the reading system with TTS)

As TTS becomes better and better, I believe that the barrier between the two of them will continue to break down. Just earlier this week, I read an article about Storytel providing TTS as an alternative option in a number of audiobooks that they provide: https://www.boktugg.se/2024/02/27/rostbytaren-storytel-lanserare-voice-switcher-pa-svenska/

The key argument being: "A whopping 89% of Storytel's listeners have at some point finished a book, not because the book was bad, but because the voice didn't suit them".

@dalerrogers
Copy link

dalerrogers commented Mar 1, 2024 via email

@HadrienGardeur
Copy link

@dalerrogers you're right that this can be entirely done in JS and in fact that's what a number of reading apps do (mostly the ones that are Web Apps since there are better native options available for dividing text into utterances and then reading these utterances using a TTS engine).

In such cases, the JS handling all of that is served by the reading app though, not the publication. That's consistent with using Edge's TTS feature, which works on every website that you visit. I think Chrome has something similar in testing as well.

The main issue when implementing TTS with Web technologies right now is mostly related to inconsistencies across implementations of lower level API in browsers.
For example Intl.segmenter is very useful to divide text into utterances but it's not available on Firefox yet (it's in nightly builds, which is good news though). Getting a boundary event from SpeechSynthesisUtterance is very useful to follow the progression in how an utterance is being read, but support is also lacking and/or inconsistent across the board.

@dalerrogers
Copy link

dalerrogers commented Sep 1, 2024 via email

@dalerrogers
Copy link

dalerrogers commented Sep 1, 2024 via email

@dalerrogers
Copy link

dalerrogers commented Sep 1, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants