Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/boolean search syntax #310

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

samuellieberman
Copy link
Contributor

Adds Boolean search syntax to the library search bar. For an explanation of how the search syntax works, check out the new "Search Cheat-Sheet" section in README.md.

Some existing features were changed. Tags with spaces now require underscores when being used in searches. Special consideration was made for the upcoming SQLite migration to hopefully reduce necessary refactoring.

I am happy to discuss these details or any other ones if there are any questions or concerns.

@Thesacraft
Copy link
Collaborator

The search feature is really cool but i tested it on a bigger library (about 2.2 million entries) and it makes searching basically unusable, because it took 21 min to search for an entry with a tag. This previously took 1.71s so if you could somehow improve the Performance it would be really nice.

@samuellieberman
Copy link
Contributor Author

The search feature is really cool but i tested it on a bigger library (about 2.2 million entries) and it makes searching basically unusable, because it took 21 min to search for an entry with a tag. This previously took 1.71s so if you could somehow improve the Performance it would be really nice.

Okay, I definitely wasn't expecting that. At it's very worst the syntax parsing should only increase the runtime by a small percentage. I wouldn't have been surprised if it decreased the runtime even.

But I was able to reproduce a 6x slowdown even with only about 1.4k entries, so something has definitely gone wrong. I will try to do some profiling to figure out what is causing this. Thanks.

* improve noticeable performance loss from calling .resolve() in Library.search_library
Remove unimplemented feature search for missing file
@samuellieberman
Copy link
Contributor Author

samuellieberman commented Jul 6, 2024

The search feature is really cool but i tested it on a bigger library (about 2.2 million entries) and it makes searching basically unusable, because it took 21 min to search for an entry with a tag. This previously took 1.71s so if you could somehow improve the Performance it would be really nice.

Okay, I believe I figured out what was causing the slowdown. Do you mind if I ask what you were searching for in your library? I suggest going back to the original TagStudioDev:main branch and trying a search for "missing" or "no file". If my suspicions are correct, then I bet those searches would take about 20 minutes to complete, even in the official project.

The reason for this is that Library.search_library() contains a costly .resolve() operation in order to check an entry's filepath against Library.missing_files. The worst part is that Library tracking missing or displaced files seems to have been broken and unimplemented in for TagStudio a while. The Library.missing_files attribute stays empty, even when I delete files and refresh the library and directories.

My fix is to remove the unimplemented feature from searches for now. I left the rest of the Library.missing_files related code untouched. Please let me know if this resolves your slowdown @Thesacraft. Thanks.

@Thesacraft
Copy link
Collaborator

Thesacraft commented Jul 6, 2024

I searched for a Tag and had one file that was tagged with the tag. I tested it and it made it much faster (34s) but i think if it could be improved it would be really great. And the current search in the main is still much faster with 1.71s so it would be realy nice if you could improve it to like 2-10s. Here is the search and the found file:
grafik

@KillyMXI
Copy link

KillyMXI commented Jul 6, 2024

Where does the syntax come from?
If it were up to me, I wouldn't trust a PR without prior convincing me in the issues that the query language is good enough for the target audience to understand and use.

I sense inspiration from some boorus, but I can't say it's a good thing.

Provided readme section seems insufficient for me to grasp it even if project owners consider the syntax good enough to adopt.
I can't get through the dense explanation with my ADHD.
And slightly lesser issue is improper use of markdown.

@Qronikarz
Copy link

Wasn't the boolean search started in this PR? #284

Also, as far as I know, CyanVoxel didn't decide yet how to handle tags with multiple words - #112

Move str(Path) and os.path.join() operations inside search.py to reduce the chance that they are run
@samuellieberman
Copy link
Contributor Author

I searched for a Tag and had one file that was tagged with the tag. I tested it and it made it much faster (34s) but i think if it could be improved it would be really great. And the current search in the main is still much faster with 1.71s so it would be realy nice if you could improve it to like 2-10s. Here is the search and the found file: grafik

Thank you for trying my project and sharing your searches with me @Thesacraft. I wanted to understand your situation better, so since my last post I have created my own library with 1 million items and performed lots of different searches. Now I have some search results I want to share with you! Have you tried searching for "missing" in the main project like I suggested earlier? I tried that search with the main project, and it took a very similar amount of time to regular searches in my PR before I pushed your performance update. I also wanted to replicate and experiment with your latest result, and I found that I could also get the main project to produce similar times to after my performance update. I suggest also searching "filename: ?" in the main project if you want to replicate this second search for yourself.

I have since pushed another smaller performance update for you to try. I think this will give you another boost to performance, but so far the performance improvements of my updates are all for search dependent reasons, just like in the main project. Right now, the only improvements left that I see involve making the code harder to understand and modify, or starting to sacrifice boolean search syntax entirely. I'd have looked into some of this earlier if I didn't expect non-syntax search performance to change entirely if the project switches to using an SQL database. Non-syntax search performance isn't something I've considered much, so I'm happy to listen to any other performance concerns with my PR.

Where does the syntax come from? If it were up to me, I wouldn't trust a PR without prior convincing me in the issues that the query language is good enough for the target audience to understand and use.

I sense inspiration from some boorus, but I can't say it's a good thing.

Provided readme section seems insufficient for me to grasp it even if project owners consider the syntax good enough to adopt. I can't get through the dense explanation with my ADHD. And slightly lesser issue is improper use of markdown.

Thank you for reading the documentation I wrote for my search syntax and thank you for posting your reply @KillyMXI. I don't want you to feel pressure to trust anything or to be convinced of anything. When I started this feature, I designed it for myself and based on my own needs and preferences. There wasn't a CONTRIBUTING.md and I hadn't even decided I wanted to contribute my code to the main project.

You are discerning to correctly realize that I took inspiration from online image boorus, but my personal search preferences are just a starting point. Now that the feature is working, I have no issue at all with changing the syntax based on feedback. I just see this as an opportunity to help a project that I have already benefited from.

I would also love to collaborate to improve my README cheat-sheet section. Thanks again for your consideration into my syntax and explanation. Does any of this sound like something you would want to help improve?

@Thesacraft
Copy link
Collaborator

Thank you for trying my project and sharing your searches with me @Thesacraft. I wanted to understand your situation better, so since my last post I have created my own library with 1 million items and performed lots of different searches. Now I have some search results I want to share with you! Have you tried searching for "missing" in the main project like I suggested earlier? I tried that search with the main project, and it took a very similar amount of time to regular searches in my PR before I pushed your performance update. I also wanted to replicate and experiment with your latest result, and I found that I could also get the main project to produce similar times to after my performance update. I suggest also searching "filename: ?" in the main project if you want to replicate this second search for yourself.

I have since pushed another smaller performance update for you to try. I think this will give you another boost to performance, but so far the performance improvements of my updates are all for search dependent reasons, just like in the main project. Right now, the only improvements left that I see involve making the code harder to understand and modify, or starting to sacrifice boolean search syntax entirely. I'd have looked into some of this earlier if I didn't expect non-syntax search performance to change entirely if the project switches to using an SQL database. Non-syntax search performance isn't something I've considered much, so I'm happy to listen to any other performance concerns with my PR.

I just tested the searches you suggested and it seems like you are completle right. But with the new peformance improvment a simple Tag search is down to about 5.4s in my library which is pretty good. So i think you shouldn't really sacrifice the boolean syntax to get it faster. I think a better solution is to maybe add a loading bar or some visual indicator (in bigger librarys) and maybe detach the search from the main thread to stop it from blocking the rendering of the application. But this is probably out of scope for this pr and could be added in the future. I really appreciate the work you put in to make it more peformant.

@KillyMXI
Copy link

KillyMXI commented Jul 7, 2024

Full disclosure, I can't pretend that I understand the needs of developers and users of TagStudio.
While I'm hanging around in Discord after finding it a curious app and tangential to some of my other interests, I'm not actively using it.

This issue just happens to appear just at the moment when I noticed it, and it is also tangential to some of my interests. I was unaware of previous issue and PR and associated discussions - didn't notice them before. At the same time, I'm not engaged enough to properly catch up with all of that and confidently deciding what help I can offer. (To be honest, I should focus on my own projects that desperately need some attention...)

So, I was here mostly out of curiosity. When I found that there is something I have issue with - I decided to comment.
I'm not very interested in further work on code and documentation.

What I can add:

  • If you try to teach Boolean logic to someone on an example of tags - you may end up with half decent both pitch and documentation for the feature. At this point I simply can't discuss something I have trouble to understand.
  • I don't think the feature documentation should be crammed into Readme. It deserves a separate document or even multiple ones. Take all the space you need to provide full query examples for every introduced syntax feature.
    • One way to read documentation is to skim it diagonally making notice of all examples. It is currently not possible.
  • I have an issue with booru syntax. I'm familiar Boolean algebra stuff. I'd prefer that being handled well and presented first. I don't care it is more verbose - I can be sure I understand what combines with what. Any weird parentheses-skipping shortcuts should be just shortcuts.
  • Virtual tags that only exist for the sake of search - I'd suggest them all to be namespaced. Look at Hydrus system: namespace.
  • Use proper markdown.
    • In particular, ` for inline, ``` for blocks to wrap syntax examples. It is both easier to read and harder to break.

Most importantly, this exists along with previous discussions (that might hint at other needs for the query language syntax and implementation) and someone else will be making decisions what to accept. Mine is just a bystander opinion, also not closely familiar with previous discussions and needs for the project.

@CyanVoxel CyanVoxel added Type: Enhancement New feature or request TagStudio: Library Relating to the TagStudio library system labels Jul 19, 2024
@CyanVoxel CyanVoxel modified the milestones: Alpha 9.4, Alpha 9.5 Jul 19, 2024
@CyanVoxel CyanVoxel added the TagStudio: Search The TagStudio search engine label Jul 20, 2024
Replacing instances of the term "operand" with "in" and "input" for the sake of brevity and clarity.

Changing _TagNode.match for the sake of brevity and consistency.
…StudioDev#272 and TagStudioDev#325)

Adds ability to check the existence of fields of any type using the following syntax:
```has_<field>```
```has_<field>:<True|False>```
Adds the ability to search the content of text_line and text_box fields using the following syntax:
```<field>:<text>```
(Replace whitespace with underscore _ in `text`)
Updated test_search.py for new behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TagStudio: Library Relating to the TagStudio library system TagStudio: Search The TagStudio search engine Type: Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants