Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First attempt at mutable indexes for lunr.js 2.x #315

Closed
wants to merge 7 commits into from

Conversation

hoelzro
Copy link
Contributor

@hoelzro hoelzro commented Nov 16, 2017

Hi there! I chimed in a separate issue about mutable indexes for 2.x, and I took a stab at adding them for my own purposes. This PR (which I don't expect you to merge if you don't want to - I just wanted to share my work, get some review, and start a conversation) is the result of that work. I'll be trying this for my FTS TiddlyWiki plugin - if it's something you're not interested in merging in, I can always publish it as a separate module, perhaps.

I made a few tradeoffs for this initial implementation:

  • Mutable indexes work by having a handle to their original builder - obviously this inflates the index size a bit. For my purposes, I've considered this to be an acceptable tradeoff.
  • There's no sugar like lunr(function() { ... }) for mutable indexes - I might add that later.
  • Changing a builder's tokenizer won't persist across serialization boundaries - I may add this if a) I can figure out the best way how and b) there's interest.
  • Gaps in builder.termIndex may build up when documents are deleted - I consider this to be acceptable since in the kinds of documents I'm working with, terms will seldom disappear permanently.
  • The index is completely rebuilt when a document is added/updated/removed - I intend on changing index operations to lazily refresh the index contents to cut down on redundant calculations.

JavaScript isn't a language that I "speak" fluently, so any feedback on idioms that I may have improperly used and how to correct them would be most welcome. I would appreciate any feedback on this, even "this won't work because of X, Y, and Z" because that would save me some headaches. =)

@olivernn
Copy link
Owner

I need to take a closer look through the change, but I think the general approach might work. Having a subclass of the immutable index that adds mutability. I especially like the fact that it can then be packaged as an extension that users can opt into if they need it.

Let me take a closer look through your implementation to see if there are any gotchas that you need to be aware of.

@hoelzro
Copy link
Contributor Author

hoelzro commented Nov 16, 2017

Sounds good - thanks for the quick response!

Some properties on the builder are shared with the index, so
we can save a bit of serialization space by not serializing
the exact same data twice
lunr.Builder.fieldTermFrequencies and lunr.Builder.fieldLengths
have the exact same keys, and are both keyed by field refs.  It
saves us some space when serializing a mutable builder to save
the list of field refs once, along with the values in these two
objects, and recombine them into objects at load time
@hoelzro
Copy link
Contributor Author

hoelzro commented Nov 17, 2017

I added a few new commits - just trimming down the serialized size for mutable indexes a bit.

Otherwise, if you're performing multiple changes to the index in-between
queries, you end up doing a lot of redundant work
@k00p
Copy link

k00p commented Dec 20, 2017

This PR looks great - it would be extremely useful for server side applications, or any use case requiring index maintenance. Anything I can do to help get this merged?

@hoelzro
Copy link
Contributor Author

hoelzro commented Dec 20, 2017

@k00p I put this here as a sort of proof-of-concept - I was actually thinking of publishing it as a separate extension for lunr.js. I've been using it on my own for a while now and it seems to work fine - I should probably just package it up on NPM =)

@k00p
Copy link

k00p commented Dec 20, 2017

I went ahead and installed it as a node module directly from this pull request:

$ npm i --save olivernn/lunr.js\#pull/315/head

Note that the \# is an escape on the # character for my shell. Use in other shells may vary.

Test errors out of the box - the tests need to be added to test/env/file_list.json.

I am still not getting it to work yet, but I will keep trying to figure it out.

Update: Have it working now but am still validating updates and removes. If you aren't concerned with the errors for the npm test for the module, so far using a reference to this PR is successful. At this point there is nothing that would lead me to discourage accepting this PR, and I will update if anything comes up. FWIW the tokenizer persistence may be a show stopper.

Last Update: The tokenizer persistence issue ended up not being a big deal. I was concerned because I was attempting to use metadata to carry document data, but that was a fundamentally flawed strategy. The fix was to maintain a dictionary in addition to the index, and then ornament the search results from the dictionary. For my purposes, the increase in memory footprint is worth the gains in response times.

@hoelzro
Copy link
Contributor Author

hoelzro commented Jan 11, 2018

So I finally got around to bundling this PR up into a standalone NPM package: https://www.npmjs.com/package/lunr-mutable-indexes

It's my first NPM package, so any feedback on how I could improve it would be great!

@andymcm
Copy link
Contributor

andymcm commented Feb 4, 2018

Just wanted to say that I'm using this and finding it extremely valuable. Would be great to see it as part of core Lunr.

@k00p
Copy link

k00p commented Feb 7, 2018

I switched my focus as well to @hoelzro lunr-mutable-indexes project. The benefits for server side index maintenance greatly outweigh the drawbacks, and so far the project "piggybacks" on lunr, so it should continue to be a mutually beneficial relationship between the two repos.

@hoelzro
Copy link
Contributor Author

hoelzro commented Feb 10, 2018

Just want to update everyone monitoring this - @k00p was kind enough to lend me some of their time, so I integrated their changes and released version 0.2.0 to NPM! Thanks @k00p!

@hoelzro
Copy link
Contributor Author

hoelzro commented Mar 7, 2018

I'm going to go ahead and close this - I think that this functionality existing as a separate module is the best place for this, and I can always submit a new PR in the future if @olivernn wants me to fold this into lunr core.

@hoelzro hoelzro closed this Mar 7, 2018
@NallT
Copy link

NallT commented May 1, 2018

@hoelzro This works great. I have server side code where the index is rather large and often updated.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants