Question about set_key performance #199

danpatton · 2019-07-06T09:50:02Z

This is a question, not a problem per se. I'm not sure of the correct etiquette, so please accept my apologies if I've got it wrong.

My use case involves repeatedly calling set_key on a Cursor object, for a large number of keys, purely in order to determine whether or not each such key exists within the database. The values associated with the keys may be large, and I don't want to read them.

The invocation of preload inside the _cursor_get_c function (added by @dw in this commit) results in every page of the value being touched, potentially generating a number of page faults. This is clearly undesirable if the caller has no intention of actually reading the values.

My question is whether it would be possible to allow the caller to "opt out" of this behaviour on a per-call (or per-Cursor) basis? I'd be happy to attempt the change myself and submit a PR, but obviously there's not a lot of point if you think it's a terrible idea (or have an alternative suggestion).

Many thanks.

The text was updated successfully, but these errors were encountered:

jnwatson · 2019-07-08T04:07:24Z

Wow. I think you've uncovered a pretty significant problem. This would be a problem for any iterator where values=False.

From a user-facing API perspective, I don't think it makes sense to make the user decide whether to preload or not. The right place to preload is actually when they retrieve the data. We should mark in the cursor the first time we preload (and clear when the position is moved) so we don't do it twice for the same key.

Good catch. I'm going to work on this later Monday.

danpatton · 2019-07-08T08:40:15Z

Great, that sounds better than what I had in mind!

On a related note, the user-facing API has a putmulti call, but no getmulti. What's your view on adding getmulti, and also existsmulti, to move the loops required for both operations out of Python and into C?

jnwatson · 2019-07-08T20:52:20Z

I like both your ideas. The only issue I see is working out the interaction between get and dupsort DBs.

jnwatson · 2019-07-09T06:21:50Z

I have a branch up. #202. It doesn't work on Windows right now. It would be fantastic if you could see if this helps your performance issue.

I'm scratching my head thinking of a way to actually evaluate whether this worked.

dw · 2019-07-09T15:07:05Z

You can test whether a page was pulled into cache using mincore():

mincore = ctypes.CDLL(None).mincore
mincore.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_void_p]

bitset = bytearray(1)  # 8 bits, room to track status of 8 pages

somevalue = "foo"

assert 0 == mincore(
  ctypes.cast(somevalue, ctypes.c_void_p).value & ~0xfff,  # round down to pagesize
  4096,
  (ctypes.c_char*1).from_buffer(bitset),
)

print("page is in memory?", bitset[0] & 0x01)

It should be possible to extract the underlying pointer for buffer() objects returned when Transaction(buffers=True), but I can't immediately figure out how. Ctypes should be able to do this automatically, but cast() is throwing errors here

edit: seems ctypes won't do it directly, but it's still possible

danpatton · 2019-07-09T20:55:56Z

Thanks. I can confirm that it's a lot faster with this change.

My very rudimentary test was to create a new db and populate it with a number of large-ish values, each of which is just a chunk of random noise (I went with 100 x 32MB), then open the env again in a new process and do a set_key for a selection of keys, in random order, some of which are in the db and some not. With your branch, the second part runs ~1000x faster than with 0.94.

jnwatson · 2019-07-11T04:51:40Z

OK. I'm all set. David, would you mind reviewing?

jnwatson · 2019-07-15T02:29:56Z

Fix is merged to master and released in 0.96.

jnwatson mentioned this issue Jul 8, 2019

Feature request: implement getmulti, existmulti #201

Open

jnwatson closed this as completed Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about set_key performance #199

Question about set_key performance #199

danpatton commented Jul 6, 2019

jnwatson commented Jul 8, 2019

danpatton commented Jul 8, 2019

jnwatson commented Jul 8, 2019

jnwatson commented Jul 9, 2019

dw commented Jul 9, 2019 •

edited

Loading

danpatton commented Jul 9, 2019

jnwatson commented Jul 11, 2019 •

edited

Loading

jnwatson commented Jul 15, 2019

Question about set_key performance #199

Question about set_key performance #199

Comments

danpatton commented Jul 6, 2019

jnwatson commented Jul 8, 2019

danpatton commented Jul 8, 2019

jnwatson commented Jul 8, 2019

jnwatson commented Jul 9, 2019

dw commented Jul 9, 2019 • edited Loading

danpatton commented Jul 9, 2019

jnwatson commented Jul 11, 2019 • edited Loading

jnwatson commented Jul 15, 2019

dw commented Jul 9, 2019 •

edited

Loading

jnwatson commented Jul 11, 2019 •

edited

Loading