-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metadata schema equality and hence table equality is not well-defined #763
Comments
Here is a more puzzling example. I have two NodeTables,
Note that if I'd done I'm not providing the code to make these because it's produced by development SLiM. Doing e.g. |
Hm, another difficulty: comparing dicts (i.e., the |
If the |
Yep - that did inject some confusion, but it's a minor bit at this point. Thanks! |
To show what I'd like to avoid: here's what I'm currently doing to compare table collections: first, I made sure all my schema had |
This is tricky! For comparing things in C I don't think we can do any better than just byte-by-byte comparisons, but we can surely do better than this in Python. We can change the definition of So, there'll be some discrepancies between the definition of equality in C and Python, but I think we can get over that? I think we should probably catch this for 0.3.0, since it's a key interoperability issue. @benjeffery, any thoughts? |
Hmm, thinking about this there is a way to canonicalise the strings generated by the python using |
I'm making a PR with the canonicalised strings, but keeping |
Sounds good @benjeffery, and simple canonical form would help a lot. |
Take a look at #764 and see if it is enough to satisfy? |
Hello! Thanks for the quick reply. #764 would make some of this make more sense, but doesn't fix the bigger problem, which is table equality. So, the proposal is to
|
If both the schema and metadata are in canonical representation doesn't that fix the table equality issue? I'd really like to avoid special casing equality. |
Yes, it does! But, I'm not sure if I have a good way to get the metadata schema in to the canonical representation, since I'm writing it in SLiM's code (using this json library). I guess the question is whether that library can match Note that I could (and probably should) just use json::dumps to dump a string representation of the table metadata schemas and include those in SLiM's code, rather than using a json parser - except that won't work for the top-level metadata and metadata schema, if we want to be able to edit those. Maybe we should abandon the goal of being able to edit those for the moment? |
@petrelharp - are you gaining much from writing out JSON using an external library in C++? If you write the JSON by hand you can control the ordering. Writing JSON is pretty simple. |
Well, it's not the writing out that's helpful, it's the parsing, which we only need for the top-level metadata. I think that the plan is for different programs to co-exist in the top-level metadata, so that if SLiM reads in a tree sequence, it should keep whatever is in the metadata already and only mess with the EDIT: Seems @benjeffery and I were typing at the same time. =) |
@jeromekelleher The issue here is adding extra keys to an exisiting tree sequence, hence the need to parse, add keys then dump. From this issue it seems like the library @petrelharp is using might be fine as it alphabetises by default! nlohmann/json#2179 EDIT: Seems @petrelharp and I were typing at the same time. |
Let's not worry about this for now. We're going to need to coordinate on some shared vocabulary here if we want the metadata to be interoperable, so let's just use it as we like for the moment and figure out what's useful later. |
We might be - but it depends on whitespace handling also, so it seems fragile... |
Ok - I'm abandoning editing of existing top-level metadata in SLiM's code for now, or the json parsing of any metadata schema at all. It looks like #764 plus carefully pasting in the text output from However, I'm still using the json library to write out the metadata itself, and running into the same issue: table collections don't match because equality requres equality of the underlying string representations of top-level metadata, written by different programs. It doesn't seem possible to get SLiM's json library to output something matching
Notice that Er, talking through this: the issue is that when we recapitate, we do like
The second line is doing the following:
To make this thing work, I just need to actually copy the underlying bytes, which I could do by
Maybe in the future we want a way in python to get (and, set?) the "metadata bytes" directly? Same for the metadata schema? I would like to be able to set metadata in python and have it match metadata set by SLiM. To do this, we need to at least add Edit: it looks like setting the |
Ok - I've got everything to work (using #764), and hopefully we remember this discussion when we next need to think about top-level metadata. =) Thanks! |
I'm going to reopen this, as I'm not convinced adding indentation is the right thing to do: #764 (review) |
Maybe "not well-defined" isn't the right phrase, but it is currently pretty hard to get two tables or table collections produced by different sources to be equal to each other, because their metadata schema must match at the byte level, and so rely on whitespace and ordering of json objects matching. Furthermore, in python it seems impossible to diagnose the problem, since (a) equality of MetadataSchema objects is identity-only equality, (b) their dict or string representations may match even if the underlying bytes do not.
So, for instance:
See below for other examples.
This is being pretty bothersome for testing things in pyslim: the schema as set by SLiM differs from that set by pyslim in ordering and whitespace, even though the source code for the two are as identical as possible. Testing for equality of tables and table collections is important other places, too.
To do this properly we'd have to parse the json in C, right? Which we don't really want to do - any other ideas? Even if we didn't check metadata schema when testing equality, we'll run into similar problems with top-level metadata, if we are proposing that applications edit the top-level metadata schema to add new keys - it could easily be that two operations commute except for the adding to the metadata schema.
The text was updated successfully, but these errors were encountered: