Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describe Primary data origin #1

Open
kyyberi opened this issue Mar 26, 2024 · 3 comments
Open

Describe Primary data origin #1

kyyberi opened this issue Mar 26, 2024 · 3 comments
Assignees
Labels
RFC To be developed next in meetings

Comments

@kyyberi
Copy link

kyyberi commented Mar 26, 2024

It is data genesis!

Which problem is this feature request solving?

  • Where and from whom data came from.
  • Currently ODPS does not offers any attributes or mechanism to define it.
  • Expands the checksums etc

Describe the solution you'd like

  1. A way to surely identify define the source, provide the mechanism to describe it
  2. What kind of steps in between. From where it started and what was the path. Verify the stakeholders?
  3. What kind of entities: machine, human, or something else is related
  4. Where and by what exactly the data was created, prove it. Authenticity
  5. This is something similar to observability, but this is about describing the context among other things from the actual sources behind the value chain (backward)

Any known practical use cases to apply?

Yes we do. It will provide business value. This is coming from practitioners. Details not revealed yet to protect frontrunner business

Can you submit a pull request?

No.

---- Leave intact! Approval of Contributor Agreement -----

By submitting issue you approve the Contributor Agreement, https://governance.opendataproducts.org/v1/contributions/contributor-agreement

@kyyberi kyyberi added the enhancement New feature or request label Mar 26, 2024
@kyyberi kyyberi self-assigned this Mar 26, 2024
@kyyberi
Copy link
Author

kyyberi commented Mar 31, 2024

Implemented first version of this in the DataOps component. The origin attempts to describe the sources of data

dataOps:
  data:
    schemaLocationURL: http://http://192.168.10.1/schemas/2016/petshopML-2.3/schema/petstore.xsd
    origin:
      - source: human # sensor, human, analytics
        sourceId: 
        type: raw # raw, cleansed
        description: 
        checksum: # ?
      - source: sensor # sensor, human, analytics
        sourceId: 
        type: cleansed # raw, cleansed
        description: 
        checksum: # ?
    lineage:
      dataLineageTool: Collibra
      dataLineageOutput: http://192.168.10.1/lineage.json

  infrastructure:
    platform: Azure
    region: West US 2 (Washington)
    storageTechnology: Azure SQL
    storageType: sql
    containerTool: helm

  build:
    format: yaml
    hashType: SHA-2
    checksum: 7b7444ab8f5832e9ae8f54834782af995d0a83b4a1d77a75833eda7e19b4c921
    signatureType: JWK
    scriptURL: http://192.168.10.1/rundatapipeline.yml
    deploymentDocumentationURL: http://192.168.10.1/datapipeline

@kyyberi
Copy link
Author

kyyberi commented Mar 31, 2024

Can we somehow add here "as code" part as well? A method to verify the source system and authenticity of the data directly?

Something similar to what is in data quality

dataQuality:
  - dimension: accuracy
    objective: 98
    unit: percentage
    monitoring:
      type: SodaCL 
      spec:
        - require_unique(member_id) 
        - require_range(age_band, 18, 100)

@kyyberi
Copy link
Author

kyyberi commented Apr 29, 2024

To be moved to the next version.

@kyyberi kyyberi transferred this issue from Open-Data-Product-Initiative/v3.0 May 21, 2024
@kyyberi kyyberi added RFC To be developed next in meetings and removed enhancement New feature or request labels Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC To be developed next in meetings
Projects
None yet
Development

No branches or pull requests

1 participant