-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of the DataSet spec #476
Conversation
@alan-geller Thanks for the spec. I am assuming this is for the next generation DataSet. I am missing a couple of points in the specification that are important for our workflow:
The
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Alan,
Thanks for sharing this spec with us. I have made quite a few comments while reading. There are some more high level remarks that I would like to make in this comment as well. I think we should think of some use cases to test the DataSet against, mostly because I have experienced that the Loop that sets and gets single values as a core assumption is too simple and leads to a lot of problems further on. I asked some questions that relate to this in the review but it boils down to supporting non-matching shapes in a DataSet and potential nesting of DataSets within DataSets, a concept which I think would be very powerful but not free of controversy.
I am looking forward to your thoughts.
The DataSet class should be usable on its own, without other QCoDeS components. | ||
In particular, the DataSet class should not require the use of Loop and parameters, although it should integrate with those components seamlessly. | ||
This will significantly improve the modularity of QCoDeS by allowing users to plug into and extend the package in many different ways. | ||
As long as a DataSet is used for data storage, users can freely select the QCoDeS components they want to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply the other components will not work if the DataSet is not being used or is the modularity intended to work both ways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modularity should work both ways, when it makes sense. Specifically, modularity should respect layering: it should always be possible to use lower-level, more "base" components without higher-level components, but it may be acceptable to have higher-level components rely on base components.
As an example, DataSet should be usable stand-alone, but Loop can require DataSet if that makes sense. An important corollary is that anything else that uses DataSet, like plotting, should work just as well if the DataSet is filled in by hand as when it is filled in using Loop.
|
||
Result | ||
A result is the collection of parameter values associated to a single measurement in an experiment. | ||
Roughly, a result corresponds to a row in a table of experimental data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would always put them in columns, but maybe that's just my transposed intuition. I do think columns are more readable if the dataset itself is intended to be directly viewable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I think of a parameter as a column and a result as a row. Too much relational database history...
I also tend to think of columns as more-or-less fixed, while you can add rows forever.
specs/DataSet.rst
Outdated
Role | ||
Parameters may play different roles in a measurement. | ||
Specifically, they may be input to the measurement (set-points) or outputs of the measurement (measured or computed values). | ||
This distinction is important for plotting and for replicating an experiment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am affraid of the problems this distinction might impose. I can think of three examples where the relevant distinction between setpoints and outputs is not natural.
- The R&S vector network analyzer (VNA) returns S-parameters as a function of frequency. However, instead of specifying the frequencies you specify start-freq, step and stop frequency (or some similar parameterization). The VNA will return you the frequencies and the measured S-parameters all as once. However, the frequencies (which is your intended set-parameter) is not known beforehand, nor do you directly set it.
- There are measurements where you want to plot certain qualities against each other, e.g., qubit frequency vs T1. Both are measured values and not set-points. I think for the purpose of plotting having some need to specify this will only hinder the analysis.
- Last there are adaptive measurements in which the set-points are generated during the measurement but are also output of the measurements, I'm not sure how to fit these into this distinction.
All of the above are issues that can be worked around and as such not breaking. However, I think that we should make an architecture in which hacking is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very much agree with this. From my experience the difficulty is the difference between the handling of 'real' setpoints (ie things that actually get set on the instrument) vs calculated setpoints (ie exactly the frequency example above which you dont directly set on the instrument or read from it but rather calculate from the values of some other parameters) vs actual measured values (returned by the instrument). I think some examples should be worked out before too much time is spent writing this aspect of the data set to avoid (as much as possible) problems coming up later which result in unpleasant hacking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further to this should one parameter be able to have setpoints AND measured values? Currently they can but from the above it doesn't sound like the ones you imagine have that functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example would be a 'monitor' loop, that would make measurements for an indeterminate amount of time, i.e. measure the current through the device until the fridge is below 100 mK. You don't set anything (or all things are set at the beginning of the loop) and the size of the dataset is not known initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any problem dropping this distinction. It doesn't affect the behavior of the DataSet at all, as far as I can tell.
The primary use overall seems to be in plotting: inputs are X axis, outputs are Y, so knowing the role of a parameter allows some defaulting. I don't know how much this is ever used, though.
So I would be in favor of dropping this notion entirely, as long as it doesn't break plotting expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most simple would maybe be having no 'setpoints' or 'measured' and just having the user specify what to plot...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the impression that some users would strongly prefer that there is a default plot representation of 1D and 2D datasets. However this could probably just as well be handled by writing some metadata to the dataset naming the default x and y axis in the dataset (naturally over-writable for more advanced plots)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jenshnielsen I like your suggestion as it is far less constraining and would also work in more general cases. Moreover, I understand this would be optional, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could be optional. It would make sense for loop or what ever stores the data to flag data as sweep axis 1, 2 ... but some other mechanism might not do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jenshnielsen I like your idea as well. The plotting package can specify what metadata it looks for, and then it's up to the code that creates and fills the DataSet (e.g., Loop) to make sure that the right metadata is there.
specs/DataSet.rst
Outdated
Depending on the state of the experiment, a DataSet may be "in progress" or "completed". | ||
|
||
ExperimentContainer | ||
An ExperimentContainer is a QCoDeS object that stores all information about an experiment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is an experiment?
Mostly because we can either argue that an experiment runs for several weeks/months and has many datasets or it can be a single run of some experiment.
I'm mostly asking this as how DataSets are grouped into experiments and how we can combine and split multiple of these ExperimentContainers has far reaching consequences on the usability of the DataSet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can quickly answer on his point as I have been very involved in the Experiment container.
The experiment is a collection of datasets, up to the user to "flip the switch" to a new experiment, or load an old one and keep on going.
The dataset itself has no notion of being part of an experiment container.
Does this clarify ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @giulioungaretti can you please clarify the following about the Experiment container.
- can an individual dataset be part of multiple experiments?
- can individual datasets be split out from multiple experiment containers? e.g. for comparison purposes
- can multiple experiments run simultaneously?
- are the datasets available without an Experiment container
- what exactly does the experiment store? links to the datasets, or actual datasets? default parameters? instrument drivers? executable scripts?
To me, these features are all absolutely crucial for running long-term, complex experiments. We have to be able to easily compare data from all points in history, as well as between different Station Q experimental sites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majacassidy thanks for the feedback.
-
I would say yes, although it's more of an implementation than a concept issue (i.e. how to make so that it's easy to do this).
-
Datasets live on their onw, so yes. The container just acts like a file system on steroids (allowing to perform searches and so on. I guess we would also want to add this feature: select x from container z, and y from container w and compare them.
-
Yes. Although one is always limited by hardware that can't operate simultaneously.
-
Yes.
-
The first implementation will store a pointer to the dataset. But it may in the long run be way more convenient to store the actual data (but this opens up a lot of accessibility problems).
The container would store:
- references to all the datasets generated with an unique hash and a timestamp
- metadata of all the datasets linked to the unique hash of the dataset they describe
- metadata of the experiment ( something that a user would want to stick inside, literally anything)
- GitHub hash of the qcodes version one is using ( easy tro trace bugs, make refeneces), and a diff of what is changed locally (custom drivers and so on)- scripts (but to do that we'd have to agree on a standard way to include them
re: parameters those are saved in the metadata of every dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - thanks for the clarification @giulioungaretti !
specs/DataSet.rst
Outdated
--------- | ||
|
||
#. A DataSet can store data of (reasonably) arbitrary types. | ||
#. A completed DataSet should be immutable; neither its metadata nor its results may be modified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We like to add metadata, fitting results and figures, to a DataSet after it has been marked as completed. You also list similar purposes below. I think it makes sense to lock the main DataSet after the experiment is complete. However, I think that it should remain possible to add (and also modify) post experiment metadata such as fitting results etc.
Maybe a distinction is in place here but I sense conflicting requirements here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just being able to add comments to a DataSet afterwards like 'oops, power cut midway through' for example? This seems like a good use case for being able to add user generated metadata after a data set is complete which would add a lot of functionality. Whether or not analysis should be something other than 'metadata' I am a bit conflicted but it's worth a conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with allowing metadata to be modified post-completion.
In general, I've tried to start with the most restrictive possible requirements because it's better to add functionality than remove.
specs/DataSet.rst
Outdated
This is a static method in the DataSet class. | ||
It returns a new DataSet object. | ||
|
||
DataSet.read_updates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit worried about the implications of having a read_updates() command.
Such a feature is only useful if the data in the underlying location is expected to change, and as such suggests the possibility of having multiple copies of the same dataset open. This may be exactly what we want (I can think of numerous applications where it is useful to have the DataSet open in another proces) but we should then also ensure we properly manage any potential for conflicting changes to the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, having multiple in-memory copies of the same persistent object can be problematic.
The current design tries to address this by making DataSet append-only, so you can't overwrite or modify data written by some other process. You could step on metadata, though, and you could also add multiple copies of the same result.
We could require just a single copy of the DataSet, but for applications like plotting that might cause performance problems.
Would could also tag one copy as the master copy, and only allow updates at the master. In many ways that's the simplest way to solve the problem.
Thoughts?
specs/DataSet.rst
Outdated
|
||
At least for now, it seems useful to maintain the current behavior of the DataSet flushing to disk periodically. | ||
|
||
#. Should there be a DataSet method similar to add_result that automatically adds a new result by calling the get() method on all parameters that are defined by QCoDeS Parameters? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there should not be such a method.
That is exactly what the Loop (or analogous function) is intended to do. I think the DataSet should be as simple and modular as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. In the best of all possible worlds, DataSet should be compilable and usable without any other QCoDeS files at all.
Right now that's not quite true; the ability to construct a DataSet from a QCoDeS Parameter breaks that layering. We could fix it by having Parameter inherit from ParamSpec, which might be worth the effort.
specs/DataSet.rst
Outdated
It should be possible to read "new data" in a DataSet; that is, to read everything after a cursor. | ||
#. It should be possible to subscribe to change notifications from a DataSet. | ||
It is acceptable if such subscriptions must be in-process until QCoDeS multiprocessing is redone. | ||
Change notifications should include the results that were added to the DataSet that triggered the notification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this at least should be optional.
I can think of cases where the size of the data would just blow up the required memory if everything is send along for every change.
Think of experiments where raw traces are stored and multiple of these come in at the same time, or just think of the Alazar card where speed and performance are critical. Having the computer slow down because of some change notification is not desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I think I actually changed this in the API, but forgot to modify the requirement; the notification indicates which results were added, but doesn't include the actual new results.
specs/DataSet.rst
Outdated
Basics | ||
--------- | ||
|
||
#. A DataSet can store data of (reasonably) arbitrary types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not read anything about the shapes of these data in the document. However, I think it is important to specify some use cases that do not involve the typical column of values that come in as it has consequences for a lot of the methods defined (for instance, what is the length of a dataset when there are arrays of different shapes in there).
To give a few examples
- Simple loops, values come in 1 by 1 for 1 or more parameters. (here basic columns would suffice)
- Array based measurements, values come in in chunks of n-values that may or may not match the corresponding set-points. (already not matching shapes)
- Array based measurements with metadata, think of some hardware that gives you back raw traces but also the result of integrating them with some weight function. Depending on the experiment the saving of these raw traces may be turned on or off, but they will certainly belong in the same dataset.
I realize these descriptions may be a bit vague as I try to describe them in terms of their consequences to the DataSet but they relate to experiments we do on a daily basis. Let me know if you have any questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will finish up final Alazar9360 tweaks but this is very relevant for this and I think some thinking for how drivers should shape and label data (which relates to my comment above) is important. Ideally the 'shape' of the QCoDeS parameter in being measured/set/saved should be mirrored as intuitively as possible in the dataset shapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that anything you can represent in a NumPy data type object should be allowable as the value of a single parameter in a single result. I think this covers simple scalars, tuples of scalars, arrays of scalars, tuples of arrays, arrays of tuples, etc.
For the array-based measurement with metadata, I would probably model that as a tuple containing an array and one or more scalars. I think storing more flexible (JSON-ish) metadata with each result is likely to cause problems accessing the data in a simple and efficient way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not really clear how the metadata fits into it but perhaps that is best demonstrated with an example when we progress to that point. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nataliejpg @AdriaanRol One thing I'm not clear on: do we need result-level metadata as well as DataSet-level metadata? We could certainly have metadata on each result, but that feels like it might be overkill -- and it might be really hard to use effectively.
specs/DataSet.rst
Outdated
|
||
#. A DataSet object should allow writing to and reading from storage in a variety of formats. | ||
#. Users should be able to define new persistence formats. | ||
#. Users should be able to specify where a DataSet is written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to suggest some helper functions which should be possible (though I realize that this may not be for the DataSet itself.
- It should be possible to load (parameter) settings from a DataSet onto the currently active instruments
- It should be possible to easily compare parameter settings between different datasets and between datasets and the active environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both make sense. I agree that they should be separate helper functions, though, in order to keep DataSet from being dependent on Instrument.
specs/DataSet.rst
Outdated
Creates a parameter specification from a QCoDeS Parameter. | ||
If optional is provided and is true, then the parameter is optional in each result. | ||
|
||
ParamSpec(name, role, type, desc=, optional=) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely an example of this would be really great in terms of helping to clarify how these ParamSpecs which don't correspond to actual Parameters would be fab. I'm trying to envisage a use case and wondering if it's to replace calculated setpoints or if it should be for data calculated from the measured values when you want to store both raw data and some calculated data, is either of those what is envisaged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, calculated values certainly fit this scenario.
It also allows you to create a DataSet without using QCoDeS Instruments and Parameters at all, and then use QCoDeS plotting and persistence for your data. For instance, when Damaz was using QCoDeS to drive simulations using LIQUi|>, this might have been an easier way (although it might have made it harder, too).
I would like to start a discussion on the ExperimentContainer and nesting of Datasets. Besides previous discussions the following comments are related to this idea.
I think there is a general consensus that it should be possible to group DataSets. @giulioungaretti proposes an Experiment Container which is a "file system on steroids". The current discussions raise that we want to be able to group some closely related experiments. I would propose that instead of creating 3 or more different layers for experiments (Container, to be discussed and DataSet), we instead make it possible to nest DataSets within Datasets. This could be as simple as allowing the GUID of the DataSet (as a string identifier) to be allowed as an entry in a DataSet. The viewer can then take care of opening the desired other dataset when required. The different DataSets can still be different files, allowing easy transfer of only part of the Data and grouping them into one or more Experiment Containers. Another advantage of solving this problem by nesting as opposed to adding another layer is that one only has to learn one abstraction and that the superpowers @giulioungaretti is talking about will be applicable to searching the original dataset directly. Looking forward to everyone's thoughts |
@AdriaanRol I like the idea of the DataSet being as simple as possible and so if you have a DataSet you know how deep it can be and what sort of objects it can contain (ie nothing more complicated than some labels and numpy arrays sensibly organised and an id/tag used to identify it) and not need information about it's depth before you can interact with it. I appreciate the desire for not having too many layers though so I would be in support of having only one container object (which I wouldn't have a big problem with being nestable although I think that in general the sql relational table version is cleaner). I do think that there should be a simple, well defined DataSet at the lowest level though. I've included the relevant quote from giuilo.
Also not sure how metadata fits into this, perhaps a jsonlike object with some pretty print and search functions with a 1:1 relationship between metadata and dataset objects which share an id (or each store their own id and the corresponding object's id), a container can then have the ids of the datasets and its own metadata and analysis plots etc and you could have a 'calibration' container inside an 'experiment' container (if we went with nestable containers). This would also be nice because then you could have multiple containers knowing about/using the same DataSet just by having it's id in the container rather than a copy of the DataSet. |
@AdriaanRol great discussion. Some thoughts
|
I've loaded a new version of the spec. Mostly I've removed things that don't seem to be required, and added a few new things. I really appreciate the feedback!! |
I like the spec. Overall it reminds me quite strongly of a pandas Should there be metadata associated with single measurements? Would a typical measurement result contain most parameters from a dataset, or only a handful? |
@AdriaanRol (and everyone else) On DataSets containing DataSets: To me, this doesn't really fit my conceptual model of what a data set is, but that is probably because my mental model of an experiment is not accurate. My (simplistic) mental model is that you do some basic set-up, then you run your experiment by setting a few parameters to different values and measuring a few variables at the different settings. From your example, it sounds like this is incorrect, and that a more accurate description of an experiment would allow for different stages to occur, where each stage is a sweep more or less as I've described (potentially adaptive, if it's a calibration sweep, but that's not conceptually a problem), but where you may perform a particular stage (e.g., calibration) more than once, and indeed you might want to interrupt one sweep (the main measurement sweep) and interject another sweep. I fear that trying to glue together DataSets within DataSets to handle all of these possibilities will turn into a nightmare of complexity. I'd be much more inclined to use references (by GUID or filename or whatever) from one DataSet to another. One somewhat related question: it sounds like the values in each result may be different for different sweep points; e.g., you might only do certain computations every N points, but you want to store the computed values in the DataSet. This was my thinking behind optional parameters: values that were in some results but not all. In the latest version I've dropped "optional" and effectively all parameters are optional. Does this give you enough flexibility? |
Removed API reference to QCoDeS Parameters, allowed addition of parameters with value arrays, dropped "optional", added min_count and min_wait to subscriptions, and some RST cleanup.
@alan-geller I like references as a way to navigate the nesting-type functionality, it would be nice to come up with a way to visualise these so you could look at your experiment/data structure etc but that's not a first iteration necessity by any means. @akhmerov good point, time would definitely be something good to have for all data points, maybe even a config option for whether in an experiment you want all results or even datapoints to be timestamped (and have this be an array in the dataset, ie have time as a paramspec) Metadata is the still unresolved issue in my mind, do other people have a clear picture of how it is/should be structured (especially at what level we want to have it: datapoint, result or dataset)? |
For me allowing the use of a GUID of a DataSet as a valid entry for a value in a DataSet would solve the problem and avoid the fears you are listing. It would allow nesting to arbitrary levels and thus allow constructing arbitrarily complex nested DataSets (which ofcourse should be avoided whenever possible). It would at the same time not require us to rethink anything about the DataSet itself (as a GUID string is a valid Numpy DataType). A requirement that follows from this is to have some way to browse/view these nested GUID references as @nataliejpg notes. I think a suitable way to manage these things would be some DataSet container that can then include some DataSets. I think it is important to explicitly allow not including all linked DataSets in this container to ensure portability of files. Some more small points.
|
I shall not let anything that is not data inside a dataset. It's too complex and dangerous. I am not sure I have ever seen a numpy array with a pointer to another numpy array, same goes for pandas. What you may want is a way to relate different data_sets.
don't worry about implementation, this is about specs.
Let's see, this more of an implementation.
Yes, part of the container not the dataset. |
I have not finished reading the spec yet, but I think it would be good if I add my two cents about nesting datasets. Exactly the same with datasets, As @AdriaanRol said, just having a GUID as an entry in the dataset would suffice to do this. All datasets contained in the experiment could then be stored in a single container.
A GUID is a proper datum imho. it is not as iffy as a pointer
Wouldn't be good if the are entries in the dataset to tell what is in there in the first place. I am really quite convinced that nesting datasets is the right way to go, so I think we should spend some time in figuring out what problems this might cause an thinking towards a solution for them |
Added a name parameter to the constructor and an id attribute that returns a unique identifier for the DAtaSet, suitable for use as a reference.
Specified that the identifier should be automatically stored in the DataSet's metadata.
I've added a unique identifier as a fundamental attribute of the DataSet. I think this addresses many of the nesting scenarios. @damazter I think complex result data can be handled without nesting. My assumption is that a result is a collection of values, and each value can hold arbitrary (NumPy-compatible) data -- not just scalars, but anything you can define a dtype object for, so records holding arrays of records of arrays... @akhmerov @nataliejpg @AdriaanRol @giulioungaretti On time-stamping results, I think that functionality belongs to the layer that is filling in the DataSet, rather than to the DataSet itself. It would be easy to build a TImedDataSet extension that adds a "time" parameter to the passed-in list of parameters at construction time, and adds in the current date/time to the values in the result dictionary in add_result, so I would prefer to keep that functionality out of the core class because you might not always want it. |
@alan-geller This would result in that any measurement code that spits out a dataset, could then be used as a calibration measurement in a more convoluted experiment. |
@damazter For the calibration scenario, is it sufficient to have the more convoluted experiment's dataset contain a reference (by unique ID) to the calibration dataset? You could put this in metadata; alternatively, if the calibration is performed multiple times during the experiment, you might want to add a result value that holds a reference to the calibration dataset, perhaps with the resulting calibration values, and every time you do a calibration insert a result with only those values filled in (and leave them empty the rest of the time)? It occurs to me that a really useful helper function (not part of this spec) will take a dataset identifier and return the dataset, possibly looking up the dataset's location in a database or in a formatted text file or using a web service or... |
It occurs to me that we'll need to figure out how to plot (and analyze) optional parameters that don't always have values. There may need to be a helper function that takes a NumPy array and strips out all of the "empty" entries. If for some reason that's not feasible, we might have to store which parameters were given values in each result in the DataSet itself, and have a somewhat more complicated version of get_data that allows you to skip empty results or partially empty results or somehow mark empties. Does anyone have any insight here? |
I have finally caught up with you all, and finished reading the document.
Finally: about nesting. If measurement containers can hold an arbitrary amount of datasets, then I see not how nesting is a bad thing (or how it can even be prevented)
One could make the easy convention that a dataset stops when it contains a GUID of another dataset, such that the top dataset does not actually contain any measurement data, but only a collection of references. How we solve that with plotting is a different issue. (but I think this is not so important, because for many of the cases I have in mind, plotting data from different datasets at the same time via this nesting structure is not directly needed)
I am not a big fan of this, because it is less general than your suggestion below, but a user would always be free to add this reference in the metadata I guess.
I don't see how this is different from what I had in mind (Am I missing something @AdriaanRol ). It would be good if all these datasets would be part of the same measurement_container for mess_prevention. |
@damazter On the very last item: yes, I think having a result value that holds a DataSet reference by GUID is exactly the same as what you're suggesting. |
A general question for everyone: Would it be sufficient for DataSet to simply be a class that holds a metadata dictionary and a pandas dataframe? If sufficient, would it be usable?
|
@alan-geller My suggestion to consider pandas dataframes or other structures was a suggestion to use existing known python packages and structures as much as possible. For example, right now the |
@peendebak I agree on the suggestion -- the more we can leverage from existing packages, the less work we have to do. My mental model of the new DataSet class would use a NumPy array for each parameter. I don't anticipate bringing the DataArray class forward; I'm not sure what value it has, beyond letting you get at the underlying NumPy array. Looking at pandas, though, it does look like adding a helper function that creates a dataframe from a (completed) DataSet might be useful. |
Removed persistence, added more metadata details, added utilitiy function section
I uploaded a new draft:
I'm going to plan to send out a "public" request for feedback on Friday afternoon (Seattle time), and try to close this spec and start implementation by the end of February. |
Many items in this spec have metadata associated with them. | ||
In all cases, we expect metadata to be represented as a dictionary with string keys. | ||
While the values are arbitrary and up to the user, in many cases we expect metadata to be nested, string-keyed dictionaries | ||
with scalars (strings or numbers) as the final values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that the final values need to support anything that a parameter can return.
The most direct example is some parameters that are arrays (like a vector of integration weights). We want to store this in the metadata (and are currently able to do so). I think it is important to note this here.
@alan-geller , I did not notice any reference to the ideas of nesting and the use of GUID's. I think they solve a very basic problem in experiments (as also pointed out by @damazter ). A simple workaround for me would be to store the GUID string as an entry in a dataset (which is currently allowed). It would then only amount to having the right helper functions to show the nested relation. Is there any consensus or verdict on this topic? |
@AdriaanRol re: nesting consensus is no nesting for now |
@giulioungaretti |
@AdriaanRol @damazter @giulioungaretti The GUID idea is there -- it's the DataSet identifier (last requirements in Basics and Creation, and the DataSet.id attribute. |
Hey @alan-geller I've just read over the dataset specs and it feels like it would definitely be very useful in our measurements. |
Author: Alan Geller <[email protected]> Add DataSet spec (#476)
This is an initial draft of the specification for a new DataSet class. Please post comments, suggestions, and other feedback!