Skip to content

Protect against monster sentences

Latest
Compare
Choose a tag to compare
@uhermjakob uhermjakob released this 20 Oct 03:57
· 59 commits to main since this release

Added protection against monster sentences that might cause to exceed Python's recursion depth limit (even without an actual infinite loop). An example monster sentence had 11,051 characters, 3,368 words, 511 symbols. The new "circuit breaker" solution monitors recursion depth, with uroman stopping tokenizing a sentence once a limit is reached. This affects only one sentence at a time. The sentence is output in its fullness, but might be tokenized only partially. There are sub circuit breakers for (1) symbol and (2) number tokenization. An alert (to STDERR) is a preliminary warning (without action), a warning (to STDERR) indicates actual cessation of full tokenization. The alerts serve to collect monster sentences that can later be used in developing sentence chunking to overcome any tokenization recursion problems in the first place.