Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Indexers need an overwrite_db or last_indexed_time parameter. #279

Open
hwiorn opened this issue Mar 9, 2022 · 5 comments
Open

Some Indexers need an overwrite_db or last_indexed_time parameter. #279

hwiorn opened this issue Mar 9, 2022 · 5 comments
Labels
backend Related to indexing/serving performance

Comments

@hwiorn
Copy link

hwiorn commented Mar 9, 2022

I have made a Joplin indexer. But there is a problem that the indexer needs a incremental updating parameter when the database is large. I have 8000+ notes in my Joplin database. Joplin indexer finds 24000+ URLs which can be Visits. It takes 17 minutes long on my laptop.

Joplin has a update_time field in notes table. So I think I can implement incremental indexing(updating) in the indexer.

However, there is no overwrite_db parameter in the Indexer when a user pass --overwrite parameter and wants to restart the indexing. Or if last_indexed_time in the promnesia framework would be passed by iter_all_visits, It would be much more helpful.

@karlicoss
Copy link
Owner

Hi, sorry for late response!

It's actually surprising it takes 17 minutes, for 8K notes/24K URLs -- do you know how many lines are these? Unless your laptop is really weak, I would expect it to index much faster. Maybe you can log indexing times for individual notes, figure out the one that takes longest and then we can profile it?

Otherwise, so you suggest you could do something like

  • pass the last_indexing_time to Joplin indexer
    (currently it's not stored anywhere, but I guess won't be too hard to store)
  • Joplin indexer would query all notes between last_indexing_time and current time, extract visits from them and insert in the DB

It kinda makes sense, but one downside is that it's possible that some URLs were removed from the note, and they would still be present in promnesia database, because the 'interface' of indexers in Promnesia is currently only supporting adding new visits. So it would trigger some phantom visits. We might think of changing the interface somehow, but I'd much rather speed up the indexer for simplicity.

@hwiorn
Copy link
Author

hwiorn commented Mar 15, 2022

My laptop is Dell Inspiron 7501(i7-10750H CPU @ 2.60GHz 16GB RAM). I don't think this laptop is a slow environment. But some machine such as RPis and AWS light-sail(1 core) could be slow.

It's actually surprising it takes 17 minutes, for 8K notes/24K URLs -- do you know how many lines are these? Unless your laptop is really weak, I would expect it to index much faster.

Many notes were from Evernote. I used Joplin as an archiving tool and wrote a journal at work. Some notes are web-clipped notes, and It seems to have many useless links. Recently, I am switching the Joplin to org-roam and learning the Zettelkasten method and I use Joplin as way-back machine now.

Maybe you can log indexing times for individual notes, figure out the one that takes longest and then we can profile it?

The Joplin indexer was a proof-of-concept, and It is just an initial version. So I think I can profile the indexing.

It kinda makes sense, but one downside is that it's possible that some URLs were removed from the note, and they would still be present in promnesia database, because the 'interface' of indexers in Promnesia is currently only supporting adding new visits. So it would trigger some phantom visits.

Right. Incremental and partial update needs two metadata at least.

  • Last sync time
  • Mapping ID between source and target.

We might think of changing the interface somehow, but I'd much rather speed up the indexer for simplicity.

Yeah, you are right. I can optimize the indexer better. But I think Promnesia needs incremental update code for slow machine and indexing efficiently.

@karlicoss
Copy link
Owner

I don't think this laptop is a slow environment

Yep, looks decent, surprising it takes so much time!

But I think Promnesia needs incremental update code for slow machine and indexing efficiently.

Yep, definitely agree it makes sense to make it as fast as we can :) Just mean there is a tradeoff between that and simplicity of the architecture.

Right. Incremental and partial update needs two metadata at least: Last sync time, Mapping ID between source and target.

Yeah -- the problem is the latter: basically currently there is no way to tell for a visit from database which file it's coming from. To be more precise, no reliable way, there is a Locator thing, but it's not guaranteed to be the exact filename.

Maybe a good compromise would be adding cachew support for file-based indexers, so basically each file would have a cache of its Visits (depending on the file timestamp), and it would automatically recompute if necessary.
That would allow keeping promnesia itself simple and not worry about selectively removing stuff from the database.

@hwiorn
Copy link
Author

hwiorn commented Mar 25, 2022

Maybe a good compromise would be adding cachew support for file-based indexers, so basically each file would have a cache of its Visits (depending on the file timestamp), and it would automatically recompute if necessary.
That would allow keeping promnesia itself simple and not worry about selectively removing stuff from the database.

I had already seen the cachew and I thought it was not the right solution for caching. I guess I didn't look closely.
Let me add cachew in the indexer.

@hwiorn
Copy link
Author

hwiorn commented Mar 25, 2022

Related: #243

@karlicoss karlicoss added backend Related to indexing/serving performance labels Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to indexing/serving performance
Projects
None yet
Development

No branches or pull requests

2 participants