Query with asymmetric matching for accented characters #3882

GrahamCobb · 2021-03-13T22:33:58Z

Use case

OK - let me admit it before I start... I am a dumb native English speaker and I don't like letters with accents and don't even know how to enter most of them on my keyboard. Yep, I know: the rest of the world despises me.

However, for people like me (and, to be fair, some other cases where accented letter use varies even between speakers of the same or very closely related languages, or has differed at various historical times), it would be very nice to have the optional ability to do simplified comparisons when searching in beets.

Today, in beets, if I use beet ls -a dvořák I get 5 matches in my database. However, if I use beet ls -a dvorak I get no matches. Although I support MusicBrainz's requirement to use correct spelling, as a non-speaker of the language I would have to search online to look up the correct accents to use. It would be useful if beets allowed (perhaps optionally, based on a config setting) less strict searching. The case is even worse with Noel, which appears in various places in my database in both the forms Noël and Noel - similar problems exist for other partly non-naturalised words which are sometimes spelt in English with accents and sometimes not.

Solution

A popular, powerful and standardised by the Unicode consortium approach, is called asymmetric matching. In this case, speakers choose a language (e.g. English) and can write search terms using the characters of that language: the system then matches accented letters not present in that language with the unaccented forms in the query. However, if the user actually enters an accent in the search term, then only the same accented letter matches. The link above includes some examples of using the term "resume" to match the many words spelt "resume" but with accents on various different letters.

This solution feels very natural and is easy for both speakers and non-speakers of the language to use. Unfortunately, I have not found an existing Python implementation (but I haven't looked particularly hard).

Alternatives

Regex searches can provide some capabilities but they require particular fields to be specified and require the use of the arcane regex format. For example, the Dvorak search would probably need to be entered as `artist::Dvo.a.' and requires me to know which letters are accented. It would also miss cases where the composer is mentioned in the title or albumtitle but not as an artist.

The text was updated successfully, but these errors were encountered:

jackwilsdon · 2021-03-14T00:32:49Z

The fuzzy plugin already kind of implements this, but instead using difflib with a threshold.

GrahamCobb · 2021-03-14T12:35:45Z

Thanks for the suggestion but, unfortunately, fuzzy doesn't seem to work usefully for these sorts of cases.

I have just tried it, with default config settings. Unfortunately it seems to fail to even match many identical, unaccented strings: I have six tracks with title containing the unaccented string 'Noel' but using ~Noel (suitably quoted to avoid shell substitution) fails to find any of them (and matches just one track, apparently on the word Floe?). It isn't a threshold issue either: if I reduce threshold to 0.3, ~Noel returns 200 matches but none of them are the 6 which actually contain Noel!

It seems to work even less well for accents. I have 12 tracks with Noël and fuzzy returns no results for ~Noël. And neither ~dvorak nor ~dvořák match anything.

However, I guess that a similar plugin approach would be a good idea for implementation of this feature. Create a new plugin similar to fuzzy but which uses the Unicode consortium asymmetric matching rules.

GrahamCobb · 2021-03-14T14:21:58Z

Info for anyone considering implementing this (maybe even a future me!):

pyicu is a python interface to the ICU library. ICU API documents USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD (which provides the Unicode-specified asymmetric matching, when used with a collator with secondary strength) and USEARCH_ANY_BASE_WEIGHT_IS_WILDCARD which may be a better choice as it is likely the track data may also often be of poor quality regarding accents.

sampsyo · 2021-03-14T17:32:34Z

Hello! Yes, I think this would be a cool thing to have, and similar requests come up from time to time (see #2811, for example). It was the original reason we introduced the fuzzy plugin, but it's actually more general—while a more specific kind of query for this sort of thing could be helpful.

An easy way to go about this would be to reuse our existing unidecode library and just do substring matching between the ASCIIfied query and the ASCIIfied data. That could be worth prototyping to see how satisfying it is!

Unfortunately, I think it's inevitable that this is going to be a slow kind of query. SQLite does not support this sort of ASCIIficiation, so we'd need to do the matching in Python. We have robust built-in support for this kind of slow query, but that doesn't make it fast.

GrahamCobb · 2021-03-15T01:08:48Z

I have created a quick and dirty plugin (called "bareasc") to experiment with bare ASCII matches using unidecode. By default it is activated using # as a prefix.

It seems to work quite well with the dvořák case but needs some more experimentation. And I haven't looked into performance at all. Would be interested to hear comments from anyone else.

The draft PR is #3883.

GrahamCobb · 2021-03-15T10:48:32Z

Apart from my idiocy last night with using == instead of in and then writing a test which worked anyway :-) the bareasc plugin seems to work. And it seems to meet my needs. What do others think?

I did a quick-and-dirty performance test... on my database (currently around 3200 tracks), a simple beet ls xyzzy (which does not exist) takes around 270ms; a beet ls \#xyzzy takes around 560ms. These are fairly consistent and remain so even for longer strings. It would be interesting to see what others find - for example with significantly larger databases.

If it turns out that doing the comparison in sqlite is actually faster, then I can imagine a way to make that happen: every database record write could add an extra internal column which contains ALL the text, from all fields, just appended together into one string and put through lower and unidecode. The initial match check could be done in SQL against that column and only if that matches is the proper, field-specific match check done in the python code.

GrahamCobb · 2021-03-17T15:50:50Z

The PR #3883 has now been accepted into beets - thanks to all who helped me with that. I think the bareasc plugin provides a good-enough solution so I don't plan to work on a more formal approach. So, I will close this issue now.

sampsyo added the feature features we would like to implement label Mar 14, 2021

sampsyo changed the title ~~Matching accented characters~~ Query with asymmetric matching for accented characters Mar 14, 2021

GrahamCobb mentioned this issue Mar 15, 2021

Experimental "bare-ASCII" matching query #3883

Merged

3 tasks

GrahamCobb closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query with asymmetric matching for accented characters #3882

Query with asymmetric matching for accented characters #3882

GrahamCobb commented Mar 13, 2021

jackwilsdon commented Mar 14, 2021

GrahamCobb commented Mar 14, 2021

GrahamCobb commented Mar 14, 2021

sampsyo commented Mar 14, 2021

GrahamCobb commented Mar 15, 2021

GrahamCobb commented Mar 15, 2021

GrahamCobb commented Mar 17, 2021

Query with asymmetric matching for accented characters #3882

Query with asymmetric matching for accented characters #3882

Comments

GrahamCobb commented Mar 13, 2021

Use case

Solution

Alternatives

jackwilsdon commented Mar 14, 2021

GrahamCobb commented Mar 14, 2021

GrahamCobb commented Mar 14, 2021

sampsyo commented Mar 14, 2021

GrahamCobb commented Mar 15, 2021

GrahamCobb commented Mar 15, 2021

GrahamCobb commented Mar 17, 2021