diff --git a/EIPS/eip-7517.md b/EIPS/eip-7517.md new file mode 100644 index 0000000000000..d9802095c2574 --- /dev/null +++ b/EIPS/eip-7517.md @@ -0,0 +1,209 @@ +--- +eip: 7517 +title: Content Consent for AI/ML Data Mining +description: A proposal adding "dataMiningPreference" in the metadata to preserve the digital content's original intent and respect creator's rights. +author: Bofu Chen (@bafu), Tammy Yang (@tammyyang) +discussions-to: https://ethereum-magicians.org/t/eip-7517-content-consent-for-ai-ml-data-mining/15755 +status: Draft +type: Standards Track +category: ERC +created: 2023-09-12 +--- + +## Abstract + +This EIP proposes a standardized approach to declaring mining preferences for digital media content on the EVM-compatible blockchains. This extends digital media metadata standards like [ERC-7053](./eip-7053.md) and NFT metadata standards like [ERC-721](./eip-721.md) and [ERC-1155](./eip-1155.md), allowing asset creators to specify how their assets are used in data mining, AI training, and machine learning workflows. + +## Motivation + +As digital assets become increasingly utilized in AI and machine learning workflows, it is critical that the rights and preferences of asset creators and license owners are respected, and the AI/ML creators can check and collect data easily and safely. Similar to robot.txt to websites, content owners and creators are looking for more direct control over how their creativities are used. + +This proposal aims to propose a standardized method of declaring these preferences. Adding `dataMiningPreference` in the content metadata allows creators to include the information about how they want their work whether the asset may be used as part of a data mining or AI/ML training workflow. This ensures the original intent of the content is maintained. + +For AI-focused applications, this information serves as a guideline, facilitating the ethical and efficient use of content while respecting the creator's rights and building a sustainable data mining and AI/ML environment. + +The introduction of the `dataMiningPreference` property in digital asset metadata covers the considerations including + +* Accessibility: A clear and easily accessible method with human-readibility and machine-readibility for digital asset creators and license owners to express their preferences for how their assets are used in data mining and AI/ML training workflows. The AI/ML creators can check and collect data systematically. +* Adoption: As Coalition for Content Provenance and Authenticity (C2PA) already outlines guidelines for indicating whether an asset may be used in data mining or AI/ML training, it's crucial that onchain metadata aligns with these standards. This ensures compatibility between in-media metadata and onchain records. + +## Specification + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 and RFC 8174. + +This EIP introduces a new property, `dataMiningPreference`, to the metadata standards which signify the choices made by the asset creators or license owners regarding the suitability of their asset for inclusion in data mining or AI/ML training workflows. `dataMiningPreference` is an object that can include one or more specific conditions. + +* `dataMining`: Allow the asset to be used in data mining for determining "patterns, trends, and correlations". +* `aiInference`: Allow the asset to be used as input to a trained AI/ML model for inferring a result. +* `aiGenerativeTraining`: Allow the asset to be used as training data for an AI/ML model that could produce derivative assets. +* `aiGenerativeTrainingWithAuthorship`: Same as `aiGenerativeTraining`, but requires that the authorship is disclosed. +* `aiTraining`: Allow the asset to be used as training data for generative and non-generative AI/ML models. +* `aiTrainingWithAuthorship`: Same as `aiTraining`, but requires that the authorship is disclosed. + +Each category is defined by a set of permissions that can take on one of three values: `allowed`, `notAllowed`, and `constraint`. + +* `allowed` indicates that the asset can be freely used for the specific purpose without any limitations or restrictions. +* `notAllowed` means that the use of the asset for that particular purpose is strictly prohibited. +* `constrained` suggests that the use of the asset is permitted, but with certain conditions or restrictions that must be adhered to. + +For instance, the `aiInference` property indicates whether the asset can be used as input for an AI/ML model to derive results. If set to `allowed`, the asset can be utilized without restrictions. If `notAllowed`, the asset is prohibited from AI inference. + +If marked as `constrained`, certain conditions, detailed in the license document, must be met. When `constraint` is selected, parties intending to use the media files should respect the rules specified in the license. To avoid discrepancies with the content license, the specifics of these constraints are not detailed within the schema, but the license reference should be included in the content metadata. + +### Schema + +The JSON schema of `dataMiningPreference` is defined as follows: + +```json +{ + "type": "object", + "properties": { + "dataMining": { + "type": "string", + "enum": ["allowed", "notAllowed", "constrained"] + }, + "aiInference": { + "type": "string", + "enum": ["allowed", "notAllowed", "constrained"] + }, + "aiTraining": { + "type": "string", + "enum": ["allowed", "notAllowed", "constrained"] + }, + "aiGenerativeTraining": { + "type": "string", + "enum": ["allowed", "notAllowed", "constrained"] + }, + "aiTrainingWithAuthorship": { + "type": "string", + "enum": ["allowed", "notAllowed", "constrained"] + }, + "aiGenerativeTrainingWithAuthorship": { + "type": "string", + "enum": ["allowed", "notAllowed", "constrained"] + } + }, + "additionalProperties": true +} +``` + +### Examples + +The mining preference example for not allowing generative AI training: + +```json +{ + "dataMiningPreference": { + "dataMining": "allowed", + "aiInference": "allowed", + "aiTrainingWithAuthorship": "allowed", + "aiGenerativeTraining": "notAllowed" + } +} +``` + +The mining preference example for only allowing for AI inference: + +```json +{ + "dataMiningPreference": { + "aiInference": "allowed", + "aiTraining": "notAllowed", + "aiGenerativeTraining": "notAllowed" + } +} +``` + +The mining preference example for allowing generative AI training if mentioning authorship and follow license: + +```json +{ + "dataMiningPreference": { + "dataMining": "allowed", + "aiInference": "allowed", + "aiTrainingWithAuthorship": "allowed", + "aiGenerativeTrainingWithAuthorship": "constrained" + } +} +``` + +### Example Usage with ERC-721 + +The example using the `dataMiningPreference` property in NFT defined in [ERC-721](./eip-721.md). + +We can put the `dataMiningPreference` field in the NFT metadata below. The `license` field is an example for specifying how to use a constrained condition. A NFT has its way to describe its license. + +```json +{ + ... + "dataMiningPreference": { + "dataMining": "allowed", + "aiInference": "allowed", + "aiTrainingWithAuthorship": "allowed", + "aiGenerativeTrainingWithAuthorship": "constrained" + }, + "license": { + "name": "CC-BY-4.0", + "document": "https://creativecommons.org/licenses/by/4.0/legalcode" + } + ... +} +``` + +### Example Usage with ERC-7053 + +The example using the `dataMiningPreference` property in onchain media provenance registration defined in [ERC-7053](./eip-7053.md). + +Assuming the Decentralized Content Identifier (CID) is `bafyaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa`. After following up the CID, got the Commit data: + +```json +{ + "assetTreeCid": "bafyaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", + "assetTreeSha256": "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb", + "assetTreeSignature": "0xcccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc", + "author": "0xdddddddddddddddddddddddddddddddddddddddd", + "committer": "0xeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee", + "timestampCreated": 1691576790, +} +``` + +Following up the `assetTreeCid` which describes the properties of the registered asset: + +```json +{ + ... + "dataMiningPreference": { + "dataMining": "allowed", + "aiInference": "allowed", + "aiTrainingWithAuthorship": "allowed", + "aiGenerativeTrainingWithAuthorship": "constrained" + }, + "license": { + "name": "CC-BY-4.0", + "document": "https://creativecommons.org/licenses/by/4.0/legalcode" + } + ... +} +``` + +## Rationale + +The technical decisions behind this EIP have been carefully considered to address specific challenges and requirements in the digital asset landscape. Here are the clarifications for the rationale behind: + +1. Adoption of JSON schema: The use of JSON facilitates ease of integration and interaction, both manually and programmatically, with the metadata. +2. Detailed control with training types: The different categories like `aiGenerativeTraining`, `aiTraining`, and `aiInference` let creators control in detail, considering both ethics and computer resource needs. +3. Authorship options included: Options like `aiGenerativeTrainingWithAuthorship` and `aiTrainingWithAuthorship` make sure creators get credit, addressing ethical and legal issues. +4. Introduction of `constrained` category: The introduction of `constrained` category serves as an intermediary between `allowed` and `notAllowed`. It signals that additional permissions or clarifications may be required, defaulting to `notAllowed` in the absence of such information. +5. C2PA alignment for interoperability: The standard aligns with C2PA guidelines, ensuring seamless mapping between onchain metadata and existing offchain standards. + +## Security Considerations + +When adopting this EIP, it’s essential to address several security aspects to ensure the safety and integrity of adoption: + +* Data Integrity: Since this EIP facilitates the declaration of mining preferences for digital media assets, the integrity of the data should be assured. Any tampering with the `dataMiningPreference` property can lead to unauthorized data mining usage. Blockchain's immutability will play a significant role here, but additional security layers, such as cryptographic signatures, can further ensure data integrity. +* Verifiable Authenticity: Ensure that the individual or entity setting the `dataMiningPreference` is the legitimate owner or authorized representative of the digital asset. Unauthorized changes to preferences can lead to data misuse. Cross-checking asset provenance and ownership becomes paramount. Services or smart contracts should be implemented to verify the authenticity of assets before trusting the `dataMiningPreference`. +* Data Privacy: Ensure that the process of recording preferences doesn't inadvertently expose sensitive information about the asset creators or owners. Although the Ethereum blockchain is public, careful consideration is required to ensure no unintended data leakage. + +## Copyright + +Copyright and related rights waived via [CC0](../LICENSE.md).