Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define outline for retire a dataset guidelines #10

Open
paolap opened this issue Sep 16, 2021 · 10 comments
Open

Define outline for retire a dataset guidelines #10

paolap opened this issue Sep 16, 2021 · 10 comments
Assignees

Comments

@paolap
Copy link
Contributor

paolap commented Sep 16, 2021

Outline the steps for this guidelines

@hot007 hot007 assigned hot007 and unassigned hot007 Oct 20, 2021
@paolap paolap self-assigned this Oct 20, 2021
@hot007
Copy link
Contributor

hot007 commented Jan 27, 2022

Could @paolap , @chloemackallah and @organisergirl please review https://acdguide.github.io/Governance/retire/retire-intro.html with their particular experience hats on?
Are there any cases completely not covered?
Obviously I mostly talk about DOI'd data but I've tried to give a suggested guideline for unpublished data withdrawal too.
Thanks to Katie for all the references other than DataCite!

Also, I don't think I like how the new layout now does title - contents - title again, is there something I can do to avoid that? my 'index' list is redundant now, but there's just generally problems with this layout I think??

Anyway hopefully this is a useful example from which to populate the other create/publish/update pages.

@paolap
Copy link
Contributor Author

paolap commented Jan 27, 2022

I had a look but will need to check more in depth for specific comments. I think it looks good only we'll need to split it in separate pages as is already covering a lot.
My main suggestion would be to first break it in pages roughly where you have headers.
Then beef up the practical steps pages with more info/examples.
I think what is missing now is a page dealing with replicas as most of what you wrote applies to data either published or unpublished but that is produced by whoever manages the data.
I think in these cases we often tend to keep the data around a lot longer than it's useful because of a lack of process. So rather than having retirement triggered by an error or a new version, or in extreme cases a shortage of storage. It would help to manage storage more efficiently to have a process to assess data usefulness regularly.
It would be good to add a list of possible strategies to do so as possibly there's not a single one that would work.
The aim would be asking usage:

  • accessed time, as you mentioned , however this might not work on all systems
  • anything else equivalent to access time??
  • surveys?
  • number of users and significance of use ie. is it used to produced more datasets?
  • clearly release of new versions, keeping into consideration that this doesn't necessarily stop use of the older version immediately
  • etc.
    Another checklist might suggests strategies to assess storage availability and costs, including ways of projecting future storage needs. Most people are completely unaware of these.
    It would be good to have even just these as checklists.

Finally great work you did so much more than writing an outline, there's so much content already! guidelines

@hot007
Copy link
Contributor

hot007 commented Feb 1, 2022

Sharon T at CSIRO today raised the related point of managing working space volumes, e.g. if people are running a model and creating model output working data, how to create a retirement policy around that data? Store it for x time and then require the user to deal with it if they still want it after that time? automatically purge after some time? Do we have requirements for reproducibility to store for a particular period? If so would it be sufficient to store only the model config and not the output longer term? I have shared this page with her so hopefully she may be able to contribute some thoughts :)

@paolap
Copy link
Contributor Author

paolap commented Feb 1, 2022

At CLEX we have probably touched this several time in different places, I'm not sure we have anything that goes past "try to consider this" the main example is probably our policy where we tried to depict a few potential scenarios.
Anyway good point another aspect of data retirement, which is in fact usually a gradual process.

@organisergirl
Copy link
Contributor

This data versioning best practices document from the Research Data Alliance is probably worth referring too as well. https://www.rd-alliance.org/group/data-versioning-wg/outcomes/principles-and-best-practices-data-versioning-all-data-sets-big. I had a brainwave today that I should probably have a hunt through the RDA outputs for anything of relevance.

@organisergirl
Copy link
Contributor

Something else that I've just been reading as well, the FAIR Data Maturity Model: specification and guidelines has the recommendation RDA-A2-01M Metadata is guaranteed to remain available after data is no longer available (This is on page 20 of the document.) This indicator is linked to the FAIR principle "Metadata should be accessible even when the data is no longer available". (https://www.go-fair.org/fair-principles/a2-metadata-accessible-even-data-no-longer-available/).

Maybe we need to make sure that this is included.

@hot007
Copy link
Contributor

hot007 commented Feb 1, 2022

Good finds, thanks @organisergirl ! Please feel free to edit the page, otherwise I'll just use this issue to accumulate links for next time I get round to making edits :)

@sharon-tickell
Copy link

Thanks to @hot007 for the link to this thread! I'm approaching this from the use case of managing the eReefs RECOM modelling system, which allows uses to define model configurations, run the models and store the results all within a CSIRO-hosted toolsuite. Users can choose to download and archive both their configurations and their run-results if they wish, but it's not enforced.... and the main eReefs visualisation portal only knows about the ones that stay in the system.

We're anticipating a fairly dramatic increase in the size of our userbase in the next few years to support RRAP and other GBR modelling efforts, and so need a policy for retiring several bits of this system, including:

  • intermediate results files for failed model runs.
  • users' run results for successful and published/DOI'd model runs
  • users' run results for successful but unpublished model runs
  • users' configuration files and uploaded custom forcing data (e.g. bathymetry)
  • user accounts?
  • older / superseded eReefs forcing datasets (boundary conditions)

Plus links into the various metadata catalogues, discovery portals and so on that integrate with this system. We don't yet have proper terms and conditions that either the admin team or the end users can refer to for any of this, but are going to need some really soon.

hot007 added a commit that referenced this issue Apr 29, 2022
hot007 added a commit that referenced this issue Apr 29, 2022
I have not added this file to the toc until it's been reviewed!!! Please add on merge if approved @paolap or @chloemackallah 

I am not happy with the number rendering here where I've done 'custom' things, if you wish to rework the numbered lists I think it'd help?

This commit attempt to address most of Sharon's issues raised in #10
@paolap
Copy link
Contributor Author

paolap commented May 16, 2022

These pull_requests have now all been merged into main.
I'll close this issue.

@paolap paolap closed this as completed May 16, 2022
@hot007
Copy link
Contributor

hot007 commented May 19, 2022

I'm just going to reopen this one for now so we remember to come back particularly to the unpublished data page and cross-check if all issues raised here are addressed in it.

@hot007 hot007 reopened this May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants