You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the while loop will consume the whole span and span[0] will then fail. Easy fix:
In (using my tokens instead of the ones on the page):
def prefix_scrubber():
def scrubber_func(span: Span) -> str:
while span[0].text in ("every", "other", "the", "two"): # ATTN, different tokens, will fail in original code
span = span[1:]
return span.text
return scrubber_func
just add len(span) > 1 and and replace
while span[0].text in ("every", "other", "the", "two"):
by while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
to get
def prefix_scrubber():
def scrubber_func(span: Span) -> str:
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
span = span[1:]
return span.text
return scrubber_func
Now, for the sample used on that page, I get
0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]
and the line for "two" is still fine
0.00000000, 02, two, [two, two]
You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.
The text was updated successfully, but these errors were encountered:
The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the
while
loop will consume the whole span andspan[0]
will then fail. Easy fix:In (using my tokens instead of the ones on the page):
just add
len(span) > 1 and
and replacewhile span[0].text in ("every", "other", "the", "two"):
by
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
to get
Now, for the sample used on that page, I get
and the line for "two" is still fine
You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.
The text was updated successfully, but these errors were encountered: