[api-minor] Allow specifying custom match logic in PDFFindController #18549

nicolo-ribaudo · 2024-08-02T17:33:26Z

Allow specifying custom match logic in PDFFindController

This patch allows embedders of PDF.js to provide custom match logic for seaching in PDFs. This is done by subclassing the PDFFindController class and overriding the match method.

match is called once per PDF page, receives as parameters the search query, the page contents, and the page index, and returns an array of { index, length } objects representing the search results.

This is my proposed API for #18482. It is mostly moving code around, to carve out a (public) method with the minimum possible API that non-Firefox embedders can use to provide their own custom search. More specifically:

the logic in #calculateMatch that builds the RegExp has been moved to #calculateRegExpMatch, so that #calculateMatch is agnostic to the matching logic and only takes care of running the matcher and updating the state based on the match result
#calculateRegExpMatch has been renamed to match(...), that subclasses can override
#calculateMatch supports calling match even when it's an async function. This does not affect PDF.js itself (since #calculateMatch() is already called in a non-awaited .then(() => ...)), but makes it possible for consumers to have async match logic.

I believe that this API is minimal enough that it won't cause problems if in the future PDFFindController needs to be refactored, as @calixteman mentioned in #18482 (comment).

Some examples of how it can be used:

External search provider

import fuzzySearch from "some-fuzzy-search-library";

class FuzzyFindController extends PDFFindController {
  // "query" is a string
  match(query, text) {
    const results = fuzzySearch(query, text);
    return results.map(({ index, value }) => ({ index, length: value.length }));
  }
}

Multi-word search

This is already supported by PDF.js, but as far as I can tell it cannot be used through the Firefox UI. This example is how it would be implemented in an alternative timeline where this PR would have happened before adding support for multi-word search.

This example assumes that in that alternative universe convertToRegExpString is not private, and it accepts pageIndex instead of this._hasDiacritics[pageIndex].

class MultiWordFindController extends PDFFindController {
  // "query" is an array of strings
  match(query, text, pageIndex) {
    let isUnicode = false;
    // Words are sorted in reverse order to be sure that "foobar" is matched
    // before "foo" in case the query is "foobar foo".
    const queryStr = query
      .sort()
      .reverse()
      .map(q => {
        const [isUnicodePart, queryPart] = this.convertToRegExpString(
          q,
          pageIndex
        );
        isUnicode ||= isUnicodePart;
        return `(${queryPart})`;
      })
      .join("|");

    const flags = `g${isUnicode ? "u" : ""}${this.state.caseSensitive ? "" : "i"}`;
    query = new RegExp(queryStr, flags);

    const matches = [];
    for (const { index, 0: match } of pageContent.matchAll(query)) {
      matches.push({ index, length: match.length });
    }
    return matches;
  }
}

Simple multi-page search

EDIT: This example does not apply anymore now that we only support sync .match. See #18549 (comment) for an async matcher example.

This implementation uses some _-prefixed properties of PDFFindController. Assuming that they are meant to be private (I can open a PR to replace _ with # if needed, after that this PR lands), there is also a second implementation that only uses the real public API.

class MultiPageFindController extends PDFFindController {
  // "query" is a string
  async match(query, text, pageIndex) {
    let prefix = "", suffix = "";
    if (pageIndex > 0) {
      await this._extractTextPromises[pageIndex - 1];
      prefix = this._pageContents[pageIndex - 1].slice(1 - query.length) + " ";
    }
    if (pageIndex + 1 < this._linkService.pagesCount) {
      await this._extractTextPromises[pageIndex + 1];
      suffix = " " + this._pageContents[pageIndex + 1].slice(0, query.length - 1);
    }
    text = prefix + text + suffix;

    const matches = [];
    let index = -1;
    while ((index = text.indexOf(query, index + 1)) !== -1) {
      let start = Math.max(prefix.length, index);
      let end = Math.min(index + query.length, prefix.length + text.length);
      matches.push({ index: start - prefix.length, length: end - start });
    }
    return matches;
  }
}

class MultiPageFindController extends PDFFindController {
  #linkService;
  #pageContents;
  #pageContentsPromises = [];

  constructor(opts) {
    super(opts);
    this.#linkService = opts.linkService;
  }

  async #getPageContent(index) {
    if (this.#pageContents[index] == null) {
      this.#pageContentsPromises[pageIndex - 1] ??= Promise.withResolvers();
      await this.#pageContentsPromises[pageIndex - 1].promise;
    }
    return this.#pageContents[index];
  }

  // "query" is a string
  async match(query, text, pageIndex) {
    if (this.#pageContents[pageIndex] == null) {
    	this.#pageContents[pageIndex] = text;
    	this.#pageContentsPromises[pageIndex]?.resolve();
    }

    const [prevPage, nextPage] = await Promise.all([
      pageIndex > 0 ? this.#getPageContent(pageIndex - 1) : "",
      pageIndex + 1 < this.#linkService.pagesCount ? this.#getPageContent(pageIndex + 1) : "",
    ]);
    const prefix = prevPage.slice(1 - query.length) + " ";
    const suffix = " " + nextPage.slice(0, query.length - 1);
    text = prefix + text + suffix;

    const matches = [];
    let index = -1;
    while ((index = text.indexOf(query, index + 1)) !== -1) {
      let start = Math.max(prefix.length, index);
      let end = Math.min(index + query.length, prefix.length + text.length);
      matches.push({ index: start - prefix.length, length: end - start });
    }
    return matches;
  }
}

There are two questions that for which I keep swinging back and forth:

Should the API be subclass-based, or parameter-based?
```
class CustomFindController extends PDFFindController {
  match(...) {}
}
```
vs
```
new PDFFindController({
  /* ... various other options... ,*/
  matcher(...) {}
}
```
PDFFindController already accepts multiple parameters to control its behavior, but having it as a subclass extension point leads to a cleaner implementation.
Should the isEntireWord word check apply after running the matcher, or as part of the default matcher? Running after means that it would easily be available to custom matching logic (simply by setting eintireWord: true on the dispatched "find" event), but on the other hand it feels like its part of the match logic itself.

Please let me know what you think about this :)

PS. If this feature gets accepted and it will need any maintenance in the future feel free to ping me, similarly to how I have been keeping an eye on the various bugs related to text selection.

Snuffleupagus

Based on a very quick look, there appears to be some unrelated changes in the patch.

web/pdf_find_controller.js

nicolo-ribaudo · 2024-08-04T20:35:19Z

Thanks for the first review! I addressed all the comments except for those regarding the changes related to the new await.

The reason I added await in front of the this.match call (and thus, for making #calculateMatch async) is so that subclasses can easily use async search logic (mostly, for calling to external services) without being limited by the sync-ness of the API. This causes minimal changes to PDFFindController, since #calculateMatch was already called in a .then callback. I tried to document the need for await in the JSDoc comment of match, which lists Promise<SingleFindMatch[]> | SingleFindMatch[] as the return value.

While this await is a very-nice-to-have, it's not strictly necessary for consumers that need async search logic. Even if PDFFindController only supported sync searches, they would still be able to use async logic by triggering two separate find events with the same query (one that spawns all the search jobs and returns no matches, and then once they are done one to collect all the results) — if that await is a blocking problem for this PR and my explanation isn't convincing, I can remove it.

Snuffleupagus · 2024-08-05T07:43:36Z

I tried to document the need for await in the JSDoc comment of match, which lists Promise<SingleFindMatch[]> | SingleFindMatch[] as the return value.

Sure, but my issue is that it's completely impossible to understand that from looking only at the code itself. Hence why it feels like increasing the maintenance burden, since you need to either remember or (somehow) figure out why the code has unneeded asynchronicity. (Unless I'm missing things, the async-support also doesn't appear to be tested...)

While this await is a very-nice-to-have, it's not strictly necessary for consumers that need async search logic.

If, and that's a very big if in my opinion, we should even consider that there needs to be actual users wanting this; not just that it'd be theoretically nice to have.
Can we please skip the extra asynchronicity for now, and wait until an actual real-world use-cases (that cannot be solved otherwise) emerges first?

Edit: In the event that my opinion is overruled be a majority wanting to keep the new async-behaviour, I'll however insist on that being properly covered by dedicated unit-tests.

nicolo-ribaudo · 2024-08-05T12:27:28Z

Well the reason I added it is that the application I'm working on would use it, it's not a theoretical use case. 😛 More specifically, we rely on an external search provider that supports searching semantically based on "similar meaning". This provider is however running asynchronously, and there is no way for me to call it synchronously.

However, as I mentioned above, I believe I can workaround a sync-only API as follows:

class AsyncPDFFindController extends PDFFindController {
  #eventBus;

  #currentQuery = null;
  #pendingMatches = new Map();
  #matchResults = new Map();

  constructor(opts) {
    super(opts);
    this.#eventBus = opts.eventBus;
  }

  match(query, text, pageIndex) {
    if (this.#currentQuery !== query) {
      this.#matchResults.clear();
      this.#pendingMatches.clear();
      this.#currentQuery = query;
    }

    if (this.#matchResults.has(pageIndex)) {
      return this.#matchResults.get(pageIndex);
    }

    if (!this.#pendingMatches.has(pageIndex)) {
      this.#pendingMatches.set(
        pageIndex,
        this.matchAsync(query, text, pageIndex).then(matches => {
		  if (this.#currentQuery !== query) return;
          this.#matchResults.set(pageIndex, matches);
          this.#pendingMatches.delete(pageIndex);
        })
      );
    }

    const { state } = this;
    if (state.type !== "custom-reloadmatches") {
      this.#pendingMatches.get(pageIndex).then(() => {
		if (this.#currentQuery !== query) return;
        this.#eventBus.dispatch("find", {
          ...state,
          type: "custom-reloadmatches",
        });
      });
    }

    return undefined;
  }

  async matchAsync() {
    throw new Error("Must be implemented by a sub-class");
  }
}

And then I can have my own async search provider by extending this AsyncPDFFindController class and defining a matchAsync method. It's not as nice as PDFFindController directly supporting an async match, but it should work too. For now I will remove it from this PR, and I will come back in the future if my approach ends up not working.

I agree that if we end up having async support I need to add a test for it.

timvandermeij · 2024-08-11T09:54:37Z

/botio-linux preview

moz-tools-bot · 2024-08-11T09:54:39Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/67527b455647076/output.txt

moz-tools-bot · 2024-08-11T09:55:48Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/67527b455647076/output.txt

Total script time: 1.14 mins

Published

Viewer: http://54.241.84.105:8877/67527b455647076/web/viewer.html
Viewer (legacy): http://54.241.84.105:8877/67527b455647076/legacy/web/viewer.html

timvandermeij · 2024-08-11T09:57:17Z

/botio unittest

moz-tools-bot · 2024-08-11T09:57:20Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/a5b9e383bb6ad15/output.txt

moz-tools-bot · 2024-08-11T09:57:20Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.193.163.58:8877/0932330f74df217/output.txt

moz-tools-bot · 2024-08-11T09:59:51Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/a5b9e383bb6ad15/output.txt

Total script time: 2.51 mins

Unit Tests: Passed

moz-tools-bot · 2024-08-11T10:05:14Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/0932330f74df217/output.txt

Total script time: 7.89 mins

Unit Tests: Passed

timvandermeij · 2024-08-11T10:05:31Z

/botio integrationtest

moz-tools-bot · 2024-08-11T10:05:33Z

From: Bot.io (Windows)

Received

Command cmd_integrationtest from @timvandermeij received. Current queue size: 0

Live output at: http://54.193.163.58:8877/7390ed6a542a064/output.txt

moz-tools-bot · 2024-08-11T10:05:33Z

From: Bot.io (Linux m4)

Received

Command cmd_integrationtest from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/b068649f1a0f514/output.txt

moz-tools-bot · 2024-08-11T10:14:08Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/b068649f1a0f514/output.txt

Total script time: 8.57 mins

Integration Tests: Passed

moz-tools-bot · 2024-08-11T10:23:15Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/7390ed6a542a064/output.txt

Total script time: 17.68 mins

Integration Tests: Passed

timvandermeij

Looks good to me, with one comment. Now that the asynchronous bits are removed and it's mainly moving existing code around I think that this refactoring is small and self-contained enough to be accepted.

Before we merge this let's await a check from @Snuffleupagus' too, given the previous questions about the implementation, to make sure we're all aligned. Thanks!

web/pdf_find_controller.js

Snuffleupagus

r=me, with the remaining comments addressed and passing tests; thank you.

web/pdf_find_controller.js

Snuffleupagus · 2024-08-13T07:00:05Z

Now that the asynchronous bits are removed and it's mainly moving existing code around I think that this refactoring is small and self-contained enough to be accepted.

Agreed; since thinking more about the suggested async behaviour of the match-method, I'm less convinced that it'd have been generally safe and correct unfortunately. With that being async it'd have been possible for e.g. the active search-term to change while a previously pending (and slow) match-call resolves, in which case we'd then update various state with "old" data.

This patch allows embedders of PDF.js to provide custom match logic for seaching in PDFs. This is done by subclassing the PDFFindController class and overriding the `match` method. `match` is called once per PDF page, receives as parameters the search query, the page contents, and the page index, and returns an array of { index, length } objects representing the search results.

nicolo-ribaudo · 2024-08-13T08:46:30Z

Updated — thanks for the reviews!

Snuffleupagus · 2024-08-13T09:18:19Z

/botio unittest

moz-tools-bot · 2024-08-13T09:18:21Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/04a79ef1a87555b/output.txt

moz-tools-bot · 2024-08-13T09:18:21Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/95bce410b582ce2/output.txt

moz-tools-bot · 2024-08-13T09:20:57Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/04a79ef1a87555b/output.txt

Total script time: 2.59 mins

Unit Tests: Passed

moz-tools-bot · 2024-08-13T09:25:32Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/95bce410b582ce2/output.txt

Total script time: 7.18 mins

Unit Tests: Passed

Snuffleupagus · 2024-08-13T09:26:39Z

/botio integrationtest

moz-tools-bot · 2024-08-13T09:26:41Z

From: Bot.io (Windows)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/10dc1652d064ee2/output.txt

moz-tools-bot · 2024-08-13T09:26:42Z

From: Bot.io (Linux m4)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/0baa4b24e2add7e/output.txt

moz-tools-bot · 2024-08-13T09:35:18Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/0baa4b24e2add7e/output.txt

Total script time: 8.60 mins

Integration Tests: Passed

moz-tools-bot · 2024-08-13T09:44:50Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/10dc1652d064ee2/output.txt

Total script time: 18.13 mins

Integration Tests: FAILED

nicolo-ribaudo · 2024-08-13T10:15:15Z

Regarding the timeout on Windows for "must check the new alt text flow" and "New alt-text flow", this branch does not include those tests because I have not rebased it. Should I rebase? Or is it a flaky test?

Snuffleupagus · 2024-08-13T10:29:21Z

Regarding the timeout on Windows for "must check the new alt text flow" and "New alt-text flow", this branch does not include those tests because I have not rebased it. Should I rebase? Or is it a flaky test?

I ignored that failing test here, since I don't understand how this PR could affect that one.
Given that it's a new test I'm guessing that it's got some intermittent problems.

timvandermeij · 2024-08-13T17:45:15Z

Thanks for noticing this! I have included this new one in the list of intermittents at #18396 for our overview.

/cc @calixteman in case you might have an idea what could cause this to test to fail.

Snuffleupagus requested changes Aug 2, 2024

View reviewed changes

Snuffleupagus changed the title ~~Allow specifying custom match logic in PDFFindController~~ [api-minor] Allow specifying custom match logic in PDFFindController Aug 2, 2024

timvandermeij added the viewer label Aug 4, 2024

nicolo-ribaudo force-pushed the custom-find-matcher-subclass branch from 2847d84 to 42b2f48 Compare August 4, 2024 20:17

nicolo-ribaudo force-pushed the custom-find-matcher-subclass branch from 42b2f48 to 055c4d9 Compare August 5, 2024 13:18

nicolo-ribaudo mentioned this pull request Aug 8, 2024

Provide preview text of each search items through updatefindmatchescount event #16621

Closed

timvandermeij approved these changes Aug 11, 2024

View reviewed changes

web/pdf_find_controller.js Outdated Show resolved Hide resolved

timvandermeij requested a review from Snuffleupagus August 11, 2024 10:51

Snuffleupagus approved these changes Aug 13, 2024

View reviewed changes

web/pdf_find_controller.js Outdated Show resolved Hide resolved

nicolo-ribaudo force-pushed the custom-find-matcher-subclass branch from 055c4d9 to f051597 Compare August 13, 2024 08:46

Snuffleupagus linked an issue Aug 13, 2024 that may be closed by this pull request

[Feature]: Simple API for custom search for PDF.js embedders #18482

Closed

Snuffleupagus merged commit a999b34 into mozilla:master Aug 13, 2024
6 checks passed

nicolo-ribaudo deleted the custom-find-matcher-subclass branch August 13, 2024 10:23

YuvrajKaushal mentioned this pull request Sep 30, 2024

[Snyk] Upgrade pdfjs-dist from 4.4.168 to 4.6.82 YuvrajKaushal/lobe-chatSecure#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Allow specifying custom match logic in PDFFindController #18549

[api-minor] Allow specifying custom match logic in PDFFindController #18549

nicolo-ribaudo commented Aug 2, 2024 •

edited

Loading

Snuffleupagus left a comment

nicolo-ribaudo commented Aug 4, 2024 •

edited

Loading

Snuffleupagus commented Aug 5, 2024 •

edited

Loading

nicolo-ribaudo commented Aug 5, 2024 •

edited

Loading

timvandermeij commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

timvandermeij commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

timvandermeij commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

timvandermeij left a comment •

edited

Loading

Snuffleupagus left a comment

Snuffleupagus commented Aug 13, 2024

nicolo-ribaudo commented Aug 13, 2024

Snuffleupagus commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

Snuffleupagus commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

nicolo-ribaudo commented Aug 13, 2024

Snuffleupagus commented Aug 13, 2024

timvandermeij commented Aug 13, 2024 •

edited

Loading

[api-minor] Allow specifying custom match logic in PDFFindController #18549

[api-minor] Allow specifying custom match logic in PDFFindController #18549

Conversation

nicolo-ribaudo commented Aug 2, 2024 • edited Loading

Snuffleupagus left a comment

Choose a reason for hiding this comment

nicolo-ribaudo commented Aug 4, 2024 • edited Loading

Snuffleupagus commented Aug 5, 2024 • edited Loading

nicolo-ribaudo commented Aug 5, 2024 • edited Loading

timvandermeij commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Linux m4)

Success

Published

timvandermeij commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Linux m4)

Success

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Windows)

Success

timvandermeij commented Aug 11, 2024

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Linux m4)

Success

moz-tools-bot commented Aug 11, 2024

From: Bot.io (Windows)

Success

timvandermeij left a comment • edited Loading

Choose a reason for hiding this comment

Snuffleupagus left a comment

Choose a reason for hiding this comment

Snuffleupagus commented Aug 13, 2024

nicolo-ribaudo commented Aug 13, 2024

Snuffleupagus commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Linux m4)

Success

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Windows)

Success

Snuffleupagus commented Aug 13, 2024

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Linux m4)

Success

moz-tools-bot commented Aug 13, 2024

From: Bot.io (Windows)

Failed

nicolo-ribaudo commented Aug 13, 2024

Snuffleupagus commented Aug 13, 2024

timvandermeij commented Aug 13, 2024 • edited Loading

nicolo-ribaudo commented Aug 2, 2024 •

edited

Loading

nicolo-ribaudo commented Aug 4, 2024 •

edited

Loading

Snuffleupagus commented Aug 5, 2024 •

edited

Loading

nicolo-ribaudo commented Aug 5, 2024 •

edited

Loading

timvandermeij left a comment •

edited

Loading

timvandermeij commented Aug 13, 2024 •

edited

Loading