Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: getpage requests sometimes skip reading recently written image layers #9185

Open
jcsp opened this issue Sep 27, 2024 · 2 comments
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Collaborator

jcsp commented Sep 27, 2024

Via investigation of #9058 -- in that issue, it was observed that layers before recently written image layers were being visited by getpage requests.

It seems like under some circumstances, a getpage request to the exact same LSN where an image layer exists can fail to hit that image layer. Not clear if being at the exact same LSN is important or not: it might just be that we don't hit image layers for reads until the current in memory layer is closed?

Lots of uncertainty here, not claiming to have conclusively diagnosed this

Branch with experimental test:
https://github.com/neondatabase/neon/tree/jcsp/layer-map-search-at-image-lsn-2

In that branch, there are some log lines hacked in to record which layers are visited at INFO level. In the test, there is a checkpoint line commented out:

    # Uncomment this checkpoint, and the logs will show getpage requests hitting the image layers we
    # just created.  However, without the checkpoint, getpage requests will hit one InMemoryLayer and
    # one persistent delta layer.
    # env.pageserver.http_client().timeline_checkpoint(tenant_id, timeline_id, wait_until_uploaded=True)

The presence or absence of inmemory layers shouldn't make any difference to whether reads hit an image layer, but apparently it does.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Sep 27, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Oct 14, 2024

A cleaner reproducer that uses layer eviction + on-demand downloads to prove which layers are touched by a getpage request: https://github.com/neondatabase/neon/tree/jcsp/layer-map-search-at-image-lsn-3

This test does reads at exactly the LSN of the image layer, but I can also reproduce the issue with some writes between generating the image layer and doing the read, so this is not something that only occurs when reading exactly at the image layer's LSN. I suspect our reads are skipping the image layer until the next time we freeze the ephemeral layer.

@jcsp
Copy link
Collaborator Author

jcsp commented Oct 15, 2024

Perhaps this piece of logic is at fault in get_vectored_reconstruct_data_timeline:

                match in_memory_layer {
                    Some(l) => {
                        let lsn_range = l.get_lsn_range().start..cont_lsn;
                        fringe.update(
                            ReadableLayer::InMemoryLayer(l),
                            unmapped_keyspace.clone(),
                            lsn_range,
                        );
                    }

...because lsn_range is being constructed from the absolute start of the layer. Our cont_lsn jumps back to the start of the oldest inmemory layer before we start looking at historic layers at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant