Skip to content

Commit

Permalink
Added docs for tokenize
Browse files Browse the repository at this point in the history
  • Loading branch information
tabuna committed May 10, 2024
1 parent 5a89eaf commit 2456825
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,21 @@ items: array:2 [
*/
```

## Tokenizer

The algorithm utilizes a tokenizer to segment the text into words. By default, it splits the text by spaces and includes
words with a length of more than 3 symbols. You can also define your custom tokenizer using the following example:

```php
$classifier = new Classifier();

$classifier->setTokenizer(function (string $string) {
return Str::of($string)
->lower()
->matchAll('/[[:alpha:]]+/u')
->filter(fn (string $word) => Str::length($word) > 3);
});
```

## Wrapping up

Expand Down

0 comments on commit 2456825

Please sign in to comment.