Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query with asymmetric matching for accented characters #3882

Closed
GrahamCobb opened this issue Mar 13, 2021 · 7 comments
Closed

Query with asymmetric matching for accented characters #3882

GrahamCobb opened this issue Mar 13, 2021 · 7 comments
Labels
feature features we would like to implement

Comments

@GrahamCobb
Copy link
Contributor

Use case

OK - let me admit it before I start... I am a dumb native English speaker and I don't like letters with accents and don't even know how to enter most of them on my keyboard. Yep, I know: the rest of the world despises me.

However, for people like me (and, to be fair, some other cases where accented letter use varies even between speakers of the same or very closely related languages, or has differed at various historical times), it would be very nice to have the optional ability to do simplified comparisons when searching in beets.

Today, in beets, if I use beet ls -a dvořák I get 5 matches in my database. However, if I use beet ls -a dvorak I get no matches. Although I support MusicBrainz's requirement to use correct spelling, as a non-speaker of the language I would have to search online to look up the correct accents to use. It would be useful if beets allowed (perhaps optionally, based on a config setting) less strict searching. The case is even worse with Noel, which appears in various places in my database in both the forms Noël and Noel - similar problems exist for other partly non-naturalised words which are sometimes spelt in English with accents and sometimes not.

Solution

A popular, powerful and standardised by the Unicode consortium approach, is called asymmetric matching. In this case, speakers choose a language (e.g. English) and can write search terms using the characters of that language: the system then matches accented letters not present in that language with the unaccented forms in the query. However, if the user actually enters an accent in the search term, then only the same accented letter matches. The link above includes some examples of using the term "resume" to match the many words spelt "resume" but with accents on various different letters.

This solution feels very natural and is easy for both speakers and non-speakers of the language to use. Unfortunately, I have not found an existing Python implementation (but I haven't looked particularly hard).

Alternatives

Regex searches can provide some capabilities but they require particular fields to be specified and require the use of the arcane regex format. For example, the Dvorak search would probably need to be entered as `artist::Dvo.a.' and requires me to know which letters are accented. It would also miss cases where the composer is mentioned in the title or albumtitle but not as an artist.

@jackwilsdon
Copy link
Member

The fuzzy plugin already kind of implements this, but instead using difflib with a threshold.

@GrahamCobb
Copy link
Contributor Author

Thanks for the suggestion but, unfortunately, fuzzy doesn't seem to work usefully for these sorts of cases.

I have just tried it, with default config settings. Unfortunately it seems to fail to even match many identical, unaccented strings: I have six tracks with title containing the unaccented string 'Noel' but using ~Noel (suitably quoted to avoid shell substitution) fails to find any of them (and matches just one track, apparently on the word Floe?). It isn't a threshold issue either: if I reduce threshold to 0.3, ~Noel returns 200 matches but none of them are the 6 which actually contain Noel!

It seems to work even less well for accents. I have 12 tracks with Noël and fuzzy returns no results for ~Noël. And neither ~dvorak nor ~dvořák match anything.

However, I guess that a similar plugin approach would be a good idea for implementation of this feature. Create a new plugin similar to fuzzy but which uses the Unicode consortium asymmetric matching rules.

@GrahamCobb
Copy link
Contributor Author

Info for anyone considering implementing this (maybe even a future me!):

pyicu is a python interface to the ICU library. ICU API documents USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD (which provides the Unicode-specified asymmetric matching, when used with a collator with secondary strength) and USEARCH_ANY_BASE_WEIGHT_IS_WILDCARD which may be a better choice as it is likely the track data may also often be of poor quality regarding accents.

@sampsyo sampsyo added the feature features we would like to implement label Mar 14, 2021
@sampsyo sampsyo changed the title Matching accented characters Query with asymmetric matching for accented characters Mar 14, 2021
@sampsyo
Copy link
Member

sampsyo commented Mar 14, 2021

Hello! Yes, I think this would be a cool thing to have, and similar requests come up from time to time (see #2811, for example). It was the original reason we introduced the fuzzy plugin, but it's actually more general—while a more specific kind of query for this sort of thing could be helpful.

An easy way to go about this would be to reuse our existing unidecode library and just do substring matching between the ASCIIfied query and the ASCIIfied data. That could be worth prototyping to see how satisfying it is!

Unfortunately, I think it's inevitable that this is going to be a slow kind of query. SQLite does not support this sort of ASCIIficiation, so we'd need to do the matching in Python. We have robust built-in support for this kind of slow query, but that doesn't make it fast.

@GrahamCobb
Copy link
Contributor Author

I have created a quick and dirty plugin (called "bareasc") to experiment with bare ASCII matches using unidecode. By default it is activated using # as a prefix.

It seems to work quite well with the dvořák case but needs some more experimentation. And I haven't looked into performance at all. Would be interested to hear comments from anyone else.

The draft PR is #3883.

@GrahamCobb
Copy link
Contributor Author

Apart from my idiocy last night with using == instead of in and then writing a test which worked anyway :-) the bareasc plugin seems to work. And it seems to meet my needs. What do others think?

I did a quick-and-dirty performance test... on my database (currently around 3200 tracks), a simple beet ls xyzzy (which does not exist) takes around 270ms; a beet ls \#xyzzy takes around 560ms. These are fairly consistent and remain so even for longer strings. It would be interesting to see what others find - for example with significantly larger databases.

If it turns out that doing the comparison in sqlite is actually faster, then I can imagine a way to make that happen: every database record write could add an extra internal column which contains ALL the text, from all fields, just appended together into one string and put through lower and unidecode. The initial match check could be done in SQL against that column and only if that matches is the proper, field-specific match check done in the python code.

@GrahamCobb
Copy link
Contributor Author

The PR #3883 has now been accepted into beets - thanks to all who helped me with that. I think the bareasc plugin provides a good-enough solution so I don't plan to work on a more formal approach. So, I will close this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature features we would like to implement
Projects
None yet
Development

No branches or pull requests

3 participants