Bugfix for scrubber sample code which fails when scrubbing "two" #232

0dB · 2023-08-06T08:47:25Z

The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the while loop will consume the whole span and span[0] will then fail. Easy fix:

In (using my tokens instead of the ones on the page):

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while span[0].text in ("every", "other", "the", "two"): # ATTN, different tokens, will fail in original code
            span = span[1:]
        return span.text
    return scrubber_func

just add len(span) > 1 and and replace

while span[0].text in ("every", "other", "the", "two"):
by
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):

to get

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
            span = span[1:]
        return span.text
    return scrubber_func

Now, for the sample used on that page, I get

0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]

and the line for "two" is still fine

0.00000000, 02, two, [two, two]

You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.

The text was updated successfully, but these errors were encountered:

0dB · 2023-08-06T09:49:12Z

I have created a PR.

ceteri · 2023-08-07T05:23:50Z

Thank you kindly @0dB , looks great.
I'm working to resolve the CI issue and get this merge.

0dB mentioned this issue Aug 6, 2023

Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? #231

Closed

ceteri self-assigned this Aug 7, 2023

ceteri added the enhancement label Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for scrubber sample code which fails when scrubbing "two" #232

Bugfix for scrubber sample code which fails when scrubbing "two" #232

0dB commented Aug 6, 2023 •

edited

Loading

0dB commented Aug 6, 2023

ceteri commented Aug 7, 2023

Bugfix for scrubber sample code which fails when scrubbing "two" #232

Bugfix for scrubber sample code which fails when scrubbing "two" #232

Comments

0dB commented Aug 6, 2023 • edited Loading

0dB commented Aug 6, 2023

ceteri commented Aug 7, 2023

0dB commented Aug 6, 2023 •

edited

Loading