Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to highlight sentence over multiple lines? #614

Closed
3 tasks done
thiborose opened this issue Jul 23, 2020 · 12 comments
Closed
3 tasks done

How to highlight sentence over multiple lines? #614

thiborose opened this issue Jul 23, 2020 · 12 comments
Assignees
Labels
question Further information is requested stale

Comments

@thiborose
Copy link

thiborose commented Jul 23, 2020

Before you start - checklist

  • I have checked sample and test suites to see real life basic implementation
  • I have read documentation in README
  • I have checked if this question is not already asked

What are you trying to achieve? Please describe.

I would like to highlight patterns which are spread over multiple lines.

If I try to highlight a sentence which is broken by a line break, nothing will be highlighted since each line belongs to its own tag.

Describe solutions you've tried

I thought of looking for the rest of the sentence in the following span in the DOM, but this solutions seems to be really laborious.

@wojtekmaj wojtekmaj added the question Further information is requested label Jul 27, 2020
@wojtekmaj
Copy link
Owner

wojtekmaj commented Jul 27, 2020

Hah, that's a good one!

For this to work you need to:

  • Get all page text items, which you can do in <Page />'s onLoadSuccess callback
  • Implement customTextRenderer to hook into text rendering mechanism
    • Try finding full match in current text item.
      • If found, use highlightPattern to highlight the match.
      • If not, try finding full match in current text item with n previous/next neigbours together (in my case, n was 1).
        • If found, find out what part of this highlight, if any, is in current text item. Use highlightPattern to highlight the partial match.
        • If not, return text item untouched, nothing to do here.

And this is a simplified version, since it only supports single matches! For multiple matches, you will need to make it even more complicated. Anyway, in code it would look like so:

const stringToHighlight = 'Donec sodales placerat dui';

// You might want to merge the items a little smarter than that
function getTextItemWithNeighbors(textItems, itemIndex, span = 1) {
  return textItems.slice(
    Math.max(0, itemIndex - span), 
    itemIndex + 1 + span
  )
    .filter(Boolean)
    .map(item => item.str)
    .join('');
}

function getIndexRange(string, substring) {
  const indexStart = string.indexOf(substring);
  const indexEnd = indexStart + substring.length;

  return [indexStart, indexEnd];
}

function Test() {
  const [textItems, setTextItems] = useState();

  const onPageLoadSuccess = useCallback(async page => {
    const textContent = await page.getTextContent();
    setTextItems(textContent.items);
  }, []);

  const customTextRenderer = useCallback(textItem => {
    if (!textItems) {
      return;
    }

    const { itemIndex } = textItem;

    const matchInTextItem = textItem.str.match(stringToHighlight);

    if (matchInTextItem) {
      // Found full match within current item, no need for black magic
      return highlightPattern(textItem.str, stringToHighlight);
    }

    // Full match within current item not found, let's check if we can find it
    // spanned across multiple lines

    // Get text item with neighbors
    const textItemWithNeighbors = getTextItemWithNeighbors(textItems, itemIndex);

    const matchInTextItemWithNeighbors = textItemWithNeighbors.match(stringToHighlight);

    if (!matchInTextItemWithNeighbors) {
      // No match
      return textItem.str;
    }

    // Now we need to figure out if the match we found was at least partially
    // in the line we're currently rendering
    const [matchIndexStart, matchIndexEnd] = getIndexRange(textItemWithNeighbors, stringToHighlight);
    const [textItemIndexStart, textItemIndexEnd] = getIndexRange(textItemWithNeighbors, textItem.str);

    if (
      // Match entirely in the previous line
      matchIndexEnd < textItemIndexStart ||
      // Match entirely in the next line
      matchIndexStart > textItemIndexEnd
    ) {
      return textItem.str;
    }

    // Match found was partially in the line we're currently rendering. Now
    // we need to figure out what does "partially" exactly mean

    // Find partial match in a line
    const indexOfCurrentTextItemInMergedLines = textItemWithNeighbors.indexOf(textItem.str);

    const matchIndexStartInTextItem = Math.max(0, matchIndexStart - indexOfCurrentTextItemInMergedLines);
    const matchIndexEndInTextItem = matchIndexEnd - indexOfCurrentTextItemInMergedLines;

    const partialStringToHighlight = textItem.str.slice(matchIndexStartInTextItem matchIndexEndInTextItem);

    return highlightPattern(textItem.str, partialStringToHighlight);
  }, [stringToHighlight, textItems]);

  return (
    <Document file={samplePDF}>
      <Page
        customTextRenderer={customTextRenderer}
        onLoadSuccess={onPageLoadSuccess}
        pageNumber={1}
      />
    </Document>
  );
}

CodeSandbox working demo

Yeah, I hate it too.

@wojtekmaj wojtekmaj self-assigned this Jul 27, 2020
@wojtekmaj wojtekmaj changed the title Problem highlighting sentence over multiple lines How to highlight sentence over multiple lines? Jul 27, 2020
@thiborose
Copy link
Author

thiborose commented Jul 30, 2020

Thank you for the algo and the piece of code.

However I noticed that depending on how the pdf is rendered, it may not work.

Do you know why in some PDFs, each line will be wrapped in a <span> tag, and why in some other, each token is wrapped ? In the second case, it's hard to make the algo work.

@wojtekmaj
Copy link
Owner

However I noticed that depending on how the pdf is rendered, it may not work.

Absolutely!

Things to consider:

  1. getTextItemWithNeighbors might need to "grab" more neighbors if the text to highlight is particularly large or the text nodes in PDFs are particularly small
  2. getTextItemWithNeighbors is a very naive implementation, e.g. if text nodes are "hello" and "world" it'll simply return "helloworld". You may consider .trim()ming the text nodes and adding spaces by .join(' ') instead of .join('') to make it a little smarter, but in general, it's a separate programming issue and I think you can handle this ;)

@pedro-surf
Copy link

Tried to implement this and no matter what I do it leads to an infinite re-render loop. Please consider having a look

@wojtekmaj
Copy link
Owner

@pedro-surf You have a working example in my comment above, so you need either share the full code with us or find the differences yourself. Perhaps you're creating your custom text renderer with every render because you forgot to use useCallback? Just a blind guess though.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2021

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 14 days.

@github-actions github-actions bot added the stale label Oct 4, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 14 days with no activity.

@githubdebo
Copy link

The code sand box example link is not opening up

@thijssdaniels
Copy link

Hello,

Thank you for providing the cod for highlighting text spread over multiple lines! Works great.

I have another questions. When working with pdf's with multiple pages, how would that look like? Just implementing the code above doesn't seem to work.

Thanks

@nickjohnsonucsb
Copy link

This would probably be more efficient if instead of trying to use nearest-neighbor for every textItem, we could just iterate through the textItems list and concatenate strings line by line to search for the stringToHighlight. Is there any alternative we have other than using customTextRenderer for this?

I see there is a PR for this, but how can I use that PR?

@goughjo02
Copy link

goughjo02 commented Apr 18, 2024

I am just running into this. On the items there is a transform matrix. it looks like it might be possible to get the bounding box of each item mozilla/pdf.js#5643 (comment)

@overmode
Copy link

overmode commented Jun 6, 2024

If you just want to highlight text, https://markjs.io/ worked out of the box for me

instance.mark(searchText, {
      separateWordSearch: false,
      acrossElements: true,
      // If you want to be overly robust
      ignorePunctuation: " \n :;.,-–—‒_(){}[]!'\"+=".split(""),
    });
    ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

8 participants