Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary snapshots #489

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Conversation

lasernoises
Copy link

This is a relatively basic implementation of binary snapshots. I decided to adapt my implementation mentioned in #196 (comment) to handle binary files in general and remove the whole thing that prints images to the terminal.

The basic approach is store a file next to the snapshot file with the provided extension appended. The snapshot macro gets passed a closure with an &mut File parameter. Just using a Vec<u8> would be somewhat simpler in implementation, but I would be worried about memory usage if there's a bunch of tests running with large files.

I decided not to return a Result from the closure, because if there's an I/O error inside it probably makes sense just to panic there because we're inside a test anyway. But maybe it would still make sense to return a Result there just so there is no need for .unwrap() inside the closure?

Currently it directly writes the file at the location where it will end up at the beginning of the assert_snapshot function. This means the binary file is created as snap.new.$extension much sooner than the metadata is saved because we need the file to be able to compare it. What is a bit weird is if there is a crash somewhere between overwriting the binary file and storing the .snap file it will leave the two files in an inconsistent state. Creating it as .snap.new.$extension.tmp and then moving it when saving the metadata might be a good alternative approach.

There is also a problem if the extension changes between two runs where the old pending binary file will stay around. It doesn't seem to a problem in practice though because when running with cargo insta test because cargo-insta removes the old pending snapshots first. Maybe it would make sense to also do that generally at the start of assert_snapshot or alternatively cleanup all .snap.new.* somewhere.

The review TUI looks like this:

2024-05-02_15:36:43

2024-05-02_15:58:36

The open action opens both files (or just one if there's no existing snapshot, or the existing one is text) in an external application using the open crate.

The file paths are using the OSC-8 escape sequence (described here) to link to the files. Here it says that file:/// without a hostname should be avoided, but at least my terminal (WezTerm) seems to support it without a hostname and so I left that out for now since it would mean an extra dependency on a crate to get the hostname. As an alternative we could also just print the path including file:// to the terminal since many terminals seem to support clicking on those as well.

@mitsuhiko
Copy link
Owner

Creating it as .snap.new.$extension.tmp and then moving it when saving the metadata might be a good alternative approach.

I like this in theory. What might make this whole thing more robust would be to say that the following has to be true:

foo.snap.ANY_EXT has to be unique. On review/write if more than one extension is found, other files are deleted. A not yet fully persisted file is temporarily stored as foo.snap._ to mark it as temporary. If it's left over after a run, it would be cleaned up like any other duplicate extension.

The open action opens both files (or just one if there's no existing snapshot, or the existing one is text) in an external application using the open crate

It would be interesting to also add a p for preview on macos which would open quick-look on supported files. Not sure if an equivalent exists on windows. But this could be done in a followup.

@lasernoises
Copy link
Author

I've made some adjustments:

As you suggested I've changed it so the contents are first written to a temporary file with the .snap._ extension.

At the start of the assert_snapshot function and after accepting and rejecting in review any extra files are now cleaned up.

I had to change the Snapshot::save* methods to change &mut self so they can change the path to the newly saved path. Which makes it a little bit weird because it's not necessarily clear from the name that those functions will also move a file now.

This problem could also be solved in different ways:
Instead of moving the file we could copy it and then always regenerate the binary file path when needed. But that would probably mean an extra parameter so it knows where to copy the contents file from.
Or maybe save_new could make the assumption that there's a temporary file in .snap._.

Also the .new and ._ extensions and extensions starting with .new.* are now forbidden so there can't be conflicts.

Maybe I can add the quick-look thing for macos soon, but I'll have to borrow a mac from a colleague to test it. Also I'm not sure if opening both files in quick-look makes sense. Maybe two different shortcuts makes more sense.

insta/src/runtime.rs Outdated Show resolved Hide resolved
Comment on lines 516 to 519
let mut this = File::open(this).unwrap();
let mut other = File::open(other).unwrap();

file_eq(&mut this, &mut other).unwrap()
Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically we've had issues when running multiple tests at the same time; files can disappear under insta. Can we handle that a bit better here rather than blindly panicking? For example, at least with .expect with a message so we can debug?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to return Result's with Box<dyn Error> to be more consistent with the rest of the code. It's not quite optimal for debugging because the origin of the error is missing, but I thought it would be better to be consistent.

Variants in a new error enum might be an alternative. Those could then also be returned as dyn Error's. Or just using expect as you suggested. I'm not quite sure what is best here.

Copy link
Sponsor Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree there are tradeoffs, and just ignoring the errors might not be ideal.

A more detailed approach would be to think through when errors are really errors. For example, if a pending snapshot goes missing, that's fine — another process might have removed it. In contrast, if a target snapshot isn't available that's probably unexpected.

But so far ignoring errors with reading files seems to have been OK in practice; possibly we could print a message to protect against confusing results (though maybe others will disagree, ofc welcome).

Also fixes missing call to memoize_snapshot_file in the binary case.
@max-sixty
Copy link
Sponsor Collaborator

max-sixty commented Jun 7, 2024

While we wait for a more informed review, one thought (not confident): I notice we add new enums — would there be any case for collapsing "Inline | Named" enums and the new "String | Binary" enums into a single enum? Every Binary snapshot must be Named. But I'm not sure if that makes the structure simpler...

Edit: one low-hanging fruit as an example — the proposed code has SnapshotType with String | Binary. We currently don't encode whether it's an Inline snapshot in that struct, we just check whether it has a name — we could instead add Inline as an option there.

@lasernoises
Copy link
Author

While we wait for a more informed review, one thought (not confident): I notice we add new enums — would there be any case for collapsing "Inline | Named" enums and the new "String | Binary" enums into a single enum? Every Binary snapshot must be Named. But I'm not sure if that makes the structure simpler...

Edit: one low-hanging fruit as an example — the proposed code has SnapshotType with String | Binary. We currently don't encode whether it's an Inline snapshot in that struct, we just check whether it has a name — we could instead add Inline as an option there.

I suppose the refval thing is one of those cases. I could change the SnapshotContent enum to something like this:

type SnapshotName<'a> = Option<Cow<'a, str>>;

pub enum SnapshotValue<'a> {
    String {
        name: SnapshotName<'a>,
        content: &'a str,
    },

    InlineString {
        reference_content: &'a str,
        content: &'a str,
    },

    Binary {
        name: SnapshotName<'a>,
        write: &'a mut dyn FnMut(&mut File),
        extension: &'a str,
    },
}

or maybe

type SnapshotName<'a> = Option<Cow<'a, str>>;

pub enum ReferenceValue<'a> {
    Named(SnapshotName<'a>),
    Inline(&'a str),
}

pub enum SnapshotValue<'a> {
    String {
        refval: ReferenceValue<'a>,
        content: &'a str,
    },

    Binary {
        name: SnapshotName<'a>,
        write: &'a mut dyn FnMut(&mut File),
        extension: &'a str,
    },
}

I'm a bit less sure about SnapshotType. I don't think we need to represent the inline case there because the metadata isn't being stored in that case if I understand correctly.

@max-sixty
Copy link
Sponsor Collaborator

max-sixty commented Jun 14, 2024

Yes those both seem reasonable, though on reflection not obvious changes either.

(I note we're handling the contents differently between named string and named binary snapshots — string snapshots we hold the value in memory while binary snapshots we hold only the path, which leads to some of the split. Maybe that's because we expect the binary snapshots to be larger?)

I'm a bit less sure about SnapshotType. I don't think we need to represent the inline case there because the metadata isn't being stored in that case if I understand correctly.

(edit) — I tried creating a SnapshotType on the existing master; it doesn't pull its weight since the Snapshot is largely used similarly across types.

@lasernoises
Copy link
Author

(I note we're handling the contents differently between named string and named binary snapshots — string snapshots we hold the value in memory while binary snapshots we hold only the path, which leads to some of the split. Maybe that's because we expect the binary snapshots to be larger?)

Yes, the reason for storing them directly in files is the size. I was concerned about large files, especially when multiple tests are running concurrently. But this is definitely a tradeoff since passing around a Vec<u8> would make the code simpler and more uniform. I also think and it's unclear whether creating huge binary snapshots is a realistic usecase especially if you want to store them in git.

@max-sixty
Copy link
Sponsor Collaborator

But this is definitely a tradeoff since passing around a Vec<u8> would make the code simpler and more uniform. I also think and it's unclear whether creating huge binary snapshots is a realistic usecase especially if you want to store them in git.

Yeah. And IIUC we are storing them in memory when writing them at the moment. (+ agree given they're in git, the likelihood of multi-GB values seems very remote...)

@max-sixty
Copy link
Sponsor Collaborator

@lasernoises I guess we're still waiting on a review.

Not a verdict, but on further reflection, I do think simplifying the code to load snapshots into memory would make sense. The prospect of multi-GB sized snapshots seems quite unlikely — they're stored in git, the computational cost of diffing snapshots of that size will likely bind before the memory size becomes an issue. Even before your changes, the code handling different types of snapshots is a bit complicated.

(But ofc no need to make changes based on my opinion)

@lasernoises
Copy link
Author

Sorry for replying so late.

Yeah. And IIUC we are storing them in memory when writing them at the moment. (+ agree given they're in git, the likelihood of multi-GB values seems very remote...)

Because it's a callback that get's an &mut File it doesn't have to ever be in memory as a whole thing. For example in my usecase, which is for PDFs the PDF as an array of bytes is never fully in memory, only an intermediate datastructure that can then be written out as bytes.

@lasernoises I guess we're still waiting on a review.

Not a verdict, but on further reflection, I do think simplifying the code to load snapshots into memory would make sense. The prospect of multi-GB sized snapshots seems quite unlikely — they're stored in git, the computational cost of diffing snapshots of that size will likely bind before the memory size becomes an issue. Even before your changes, the code handling different types of snapshots is a bit complicated.

(But ofc no need to make changes based on my opinion)

I think you're right and that for real usecases you'll want to minimize size anyways. In my PDF case I avoid embedding fonts for example and use the builtin fonts instead.

I think I'll try to take some time next week to try it with the binaries in memory.

@max-sixty
Copy link
Sponsor Collaborator

Thanks for resolving the merge conflicts! That must have been a decent amount of work...

One note — I've recently spent lots of time simplifying the internals, and I'd be very hesitant to merge the approach that uses files rather than in-memory values — I think that brings another layer of complication without much benefit. (at least — I want to understand whether extremely large snapshots are really being used, such that it's worth maintaining the code for two different approaches)...

@lasernoises
Copy link
Author

Thanks for resolving the merge conflicts! That must have been a decent amount of work...

One note — I've recently spent lots of time simplifying the internals, and I'd be very hesitant to merge the approach that uses files rather than in-memory values — I think that brings another layer of complication without much benefit. (at least — I want to understand whether extremely large snapshots are really being used, such that it's worth maintaining the code for two different approaches)...

Yes, my plan is still to try it with the binary in memory (it just took me way longer to get to it than I thought..). I just decided to still try basing it on the previous work on this branch because after looking over the changes again it seems that a bunch of them will still be needed.

Although, maybe for a clean history and preventing unnecessary changes I now think I'll try to start clean. (Also because there's already more conflicts...)

@max-sixty
Copy link
Sponsor Collaborator

OK great! Hopefully it could use at least some of the existing code...

One note is that a couple of outstanding PRs change the SnapshotContents struct slightly, adding a kind field: https://github.com/mitsuhiko/insta/pull/581/files#diff-2666420e8389f1d842ce108018b05f3f68fa4b920f8c1793a3d4d01f21443544R549-R553.

We could adjust that to make SnapshotContents an Enum itself if the behavior becomes more different per kind, like you've done here, or delegate more behavior to that kind.

I'd like to merge that ASAP, though waiting on feedback from @mitsuhiko

lasernoises added a commit to lasernoises/insta that referenced this pull request Sep 18, 2024
This is the somewhat simplified version of mitsuhiko#489 where we keep the full files
in-memory during the assert and when reviewing.
@lasernoises
Copy link
Author

I ended up getting relatively far today. I currently have the new version at: https://github.com/lasernoises/insta/tree/in-memory-binary-snapshots. There's still some smaller stuff I need to take care of and then I'll probably have some questions about some of the details. (I'll also need to take a more detailed look at #581 and see how compatible that one is with my changes.)

But first: should I create a new PR and close this one or reset this branch to the other one?

@max-sixty
Copy link
Sponsor Collaborator

But first: should I create a new PR and close this one or reset this branch to the other one?

Totally as you wish!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants