Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timezone-less NWB files are failing DANDI-side validation #1944

Open
bendichter opened this issue May 23, 2024 · 11 comments
Open

timezone-less NWB files are failing DANDI-side validation #1944

bendichter opened this issue May 23, 2024 · 11 comments
Labels
bug Something isn't working UX Affects usability of the system

Comments

@bendichter
Copy link
Member

We have a server-side issue with validation of NWB files. Until this is resolved, most NWB files submitted to DANDI will be unpublishable.

NWB recently removed the requirement for NWB session_start_time to have a timezone (PR). We previously had been defaulting to assigning the timezone of where the conversion was run, but were concerned that this was often incorrect. It’s also very unusual for recording systems or any other metadata to include the timezone, so the timezone requirement made it challenging to automate conversion.

The necessary changes have been made to the software, and these new timezone-less files propagate just fine through NWB GUIDE, NeuroConv, PyNWB, HDMF, NWB Inspector, and the dandi CLI. Creation, validation, and upload work well.

However, these new files are failing validation on DANDI server side, which means it is currently impossible to publish these files. See here for an example.

Our preferred solution would be to relax the validation on DANDI server side to allow for NWB files that have datetime that do not have timezones. We're not sure where this line would be. The schema just lists the requirement as datetime, and it's not clear how that is meant to be serealized/deserialized and if that requires a timezone to be present.

https://github.com/dandi/dandi-schema/blob/e135307e38e8fccf83676cf1d6faba9b142a8de0/dandischema/models.py#L1186

@yarikoptic
Copy link
Member

hm, not yet sure if that was a good step forward. @bendichter you said

, but were concerned that this was often incorrect.

was there some discussion of concern and/or why you would think they were incorrect? so far I seems to spot only technical considerations/problems which could have possibly inspired such a change.

@bendichter
Copy link
Member Author

@yarikoptic see discussion here: NeurodataWithoutBorders/nwb-schema#321

@satra
Copy link
Member

satra commented May 23, 2024 via email

@satra
Copy link
Member

satra commented May 23, 2024

i downloaded this example and ran it through dandischema validate and it does not error. so we need to check if versions are matched between the archive and the schema.

@bendichter
Copy link
Member Author

@satra I can pull the metadata down using the API into an dandischema.models.Asset object, which I assume means it passes schema validation.

client = DandiAPIClient(api_url="https://api-staging.dandiarchive.org/api")
metadata = client.get_asset("d5c1c760-2d7e-4a52-bbf0-52225e88cc3c").get_metadata()
type(metadata)
dandischema.models.Asset

"d5c1c760-2d7e-4a52-bbf0-52225e88cc3c" is the asset ID of one of the files.

@candleindark
Copy link
Member

candleindark commented May 23, 2024

The datetime validation is unlikely to be the cause of the problem, at least not at the Pydantic model level. I think @satra example has demonstrated that, but here is a more pointed example.

from typing import Optional

from datetime import datetime
from pydantic import BaseModel


class A(BaseModel):
    t: Optional[datetime]


a = A(t='2024-05-06T10:56:41')  # no error from here

@yarikoptic
Copy link
Member

@yarikoptic see discussion here: NeurodataWithoutBorders/nwb-schema#321

ok, thank you! I think it would still be great may be at the level of nwb inspector now to warn on times without timezone and possibly loosing ability to align multiple data files/points.

@CodyCBakerPhD
Copy link

I think it would still be great may be at the level of nwb inspector now to warn on times without timezone and possibly loosing ability to align multiple data files/points.

We agree, already a WIP PR about that (awaiting some final debugging): NeurodataWithoutBorders/nwbinspector#458

@yarikoptic
Copy link
Member

Summary so far from my looking at the problem - it is due to jsonschema formatting check requiring date-time to be standard compliant RFC3339 and thus with time zone. More info etc

@rly
Copy link

rly commented Sep 20, 2024

It would be nice to fix this upstream in pydantic, but until then, we could create our own special datetime type that uses a regex pattern in the JSON schema, rather than the built-in RFC 3339 date-time validator, to validate a datetime that may or may not have an offset/timezone. We could decide on whatever pattern is best here - ISO 8601, or RFC 3339 where offset is optional, or something else.

From what I can gather, there is no good or consensus regex for ISO 8601. Code seems to be preferred. But I think we are limited to using regex within JSON schema. The below code uses one regex for ISO 8601 that I found on the internet.

regex = r"(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d)"

from pydantic import WithJsonSchema, BaseModel
from typing import Annotated
import datetime

BetterDatetime = Annotated[
    datetime.datetime,
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="serialization"),
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="validation")
]

class Model(BaseModel):
    dt: BetterDatetime

schema = Model.model_json_schema()
schema
# {'properties': {'dt': {'pattern': '(\\d{4}-[01]\\d-[0-3]\\dT[0-2]\\d:[0-5]\\d:[0-5]\\d\\.\\d+)|(\\d{4}-[01]\\d-[0-3]\\dT[0-2]\\d:[0-5]\\d:[0-5]\\d)|(\\d{4}-[01]\\d-[0-3]\\dT[0-2]\\d:[0-5]\\d)', 'title': 'Dt', 'type': 'string'}}, 'required': ['dt'], 'title': 'Model', 'type': 'object'}

m1 = Model(dt="2032-04-23T10:20:30")
j1 = m1.model_dump_json()
j1
# '{"dt":"2032-04-23T10:20:30"}'

m2= Model(dt="2032-04-23T10:20:30+02:30")
j2 = m2.model_dump_json()
j2
# '{"dt":"2032-04-23T10:20:30+02:30"}'

import json
import jsonschema
jsonschema.validate(schema, json.loads(j1))  # success
jsonschema.validate(schema, json.loads(j2))  # success

@candleindark
Copy link
Member

regex = r"(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d)"

from pydantic import WithJsonSchema, BaseModel
from typing import Annotated
import datetime

BetterDatetime = Annotated[
    datetime.datetime,
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="serialization"),
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="validation")
]

class Model(BaseModel):
    dt: BetterDatetime

I think it is a good approach. This allows custom JSON schema for a particular set of datetime fields without customizing the fields of datetime type. However, I think BetterDatetime can be sufficiently defined as the following according to the doc for WithJsonSchema.

BetterDatetime = Annotated[
    datetime.datetime,
    WithJsonSchema({"pattern": regex, "type": "string"}),
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working UX Affects usability of the system
Projects
None yet
Development

No branches or pull requests

7 participants