Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Typesafe(r) properties #599

Open
c-thiel opened this issue Sep 3, 2024 · 7 comments
Open

Discussion: Typesafe(r) properties #599

c-thiel opened this issue Sep 3, 2024 · 7 comments

Comments

@c-thiel
Copy link
Contributor

c-thiel commented Sep 3, 2024

We currently have property names scattered around the crate. If a property needs to be used, it typically needs to be parsed from string first. This is a potentially falling operation that should be done on load. Also defaults are hard to maintain as we don't have a type holding the property key, a type safe value and its default.

For our iceberg catalog, we are currently using a system inspired by http's HeaderValue, which is type safe and customizable / expandable for properties defined even in downstream crates. It depends heavily on macros to generate their types.

We are thinking about polishing what we have, add things like defaults, and PR it to iceberg-rust. As this is quite some work, I want to make sure that everybody agrees that this could be a good way to go forward, or if other preferred ideas are already floating around.

Looking forward for your opinion!

A very raw reference: https://github.com/hansetag/iceberg-catalog/blob/0a01eaf87f32e7f393f0d8f0d104594171dccfce/crates/iceberg-ext/src/configs/mod.rs#L12

@twuebi
Copy link
Contributor

twuebi commented Sep 3, 2024

We've been through some iterations with this, our first idea was to use enums which did not work well since it's not really possible to have compile-time type checks attached to each variant.

The idea behind it is that something like this should already fail at compile time:

config.insert(NEEDS_URL, "not-a-url");

The current design aims to prevent errors at compile time when programmatically constructing properties, that is, if a property is expected to a Url, then there should be no way of passing in a String that is not a Url.

In some cases, that's not preventable, e.g. when we're on the receiving end of a REST request and some client decided to send a non Url for some property. In this case, a consumer can decide on their own if they want to use a type-safe api, which would return Err if the type couldn't be parsed or an unchecked api which just returns whatever String is stored.

In general, I think it's worthy to try and make as much of the HashMap<String, String> parameters as type-safe as possible. It can help us avoid many subtle papercuts in the future.

@liurenjie1024
Copy link
Collaborator

Thanks @c-thiel for raising this. I love this idea of type safe properties and believe this is the right direction to go. I took a look at your reference, but I didn't get a good understanding of how the api looks like. Would you mind to show some code examples of the usage? For example, what rest/sql catalog properties would look like? In fact, we can leave macros to last part since it's a tool to reduce duplication and maintaince effort.

@twuebi
Copy link
Contributor

twuebi commented Sep 6, 2024

Hi @liurenjie1024,

https://github.com/hansetag/iceberg-catalog/blob/213560059d69248f6f67a979ce2cc3c1e808602f/crates/iceberg-catalog/src/service/storage/s3.rs#L317.

config.insert(&s3::AccessKeyId(access_key_id));
config.insert(&s3::SecretAccessKey(secret_access_key));
config.insert(&s3::SessionToken(session_token));

and https://github.com/hansetag/iceberg-catalog/blob/213560059d69248f6f67a979ce2cc3c1e808602f/crates/iceberg-catalog/src/service/storage/az.rs#L174

config.insert(&custom::CustomConfig {
    key: self.iceberg_sas_property_key(),
    value: sas,
});

would be two locations showcasing writing of properties.

The first location uses well-known keys, so we can make use of our type-safe properties. The second example has dynamic keys for azdls sas tokens, hence we use the CustomConfig type here which has fewer guarantees.

https://github.com/hansetag/iceberg-catalog/blob/213560059d69248f6f67a979ce2cc3c1e808602f/crates/iceberg-catalog/src/catalog/tables.rs#L1039-L1043 constructs the config struct using the unchecked path and then fetches the location from it. The access to the underlying storage works elegantly through generics e.g.:

self.get_prop_opt::<Location>();

The creation API of TableProperties can be seen here: https://github.com/hansetag/iceberg-catalog/blob/213560059d69248f6f67a979ce2cc3c1e808602f/crates/iceberg-ext/src/configs/table.rs.

Declaration looks like this:

// creates Properties struct & marker trait
impl_properties!(TableProperties, TableProperty);

pub mod s3 {
    use super::super::ConfigProperty;
    use super::{ConfigParseError, NotCustomProp, ParseFromStr, TableProperties, TableProperty};
    use crate::configs::impl_config_values;
    use url::Url;

    impl_config_values!(
        Table, // prefix used for the property newtypes
        {
            // Suffix of the newtype, type of the newtype, key of the newtype, name of the accessor function
            Region, String, "s3.region", "s3_region";
            Endpoint, Url, "s3.endpoint", "s3_endpoint";
            PathStyleAccess, bool, "s3.path-style-access", "s3_path_style_access";
            AccessKeyId, String, "s3.access-key-id", "s3_access_key_id";
            SecretAccessKey, String, "s3.secret-access-key", "s3_secret_access_key";
            SessionToken, String, "s3.session-token", "s3_session_token";
            RemoteSigningEnabled, bool, "s3.remote-signing-enabled", "s3_remote_signing_enabled";
            Signer, String, "s3.signer", "s3_signer";
            SignerUri, String, "s3.signer.uri", "s3_signer_uri";
         }
    );
}

Expanding both the impl_properties! and impl_config_values! macros then yields this:

impl_properties!

#[derive(Debug, PartialEq, Default)]
pub struct TableProperties {
    props: std::collections::HashMap<String, String>,
}
pub trait TableProperty: ConfigProperty {}
impl TableProperties {
    ///   Inserts a property into the configuration. 
    ///
    ///   If the property already exists, the previous value is returned. 
    pub fn insert<S: TableProperty>(&mut self, pair: &S) -> Option<S::Type> {
        let prev = self
            .props
            .insert(pair.key().to_string(), pair.value_to_string());
        prev.and_then(|v| S::parse_value(v.as_str()).ok())
    }

    ///   Removes a property from the configuration. 
    ///
    ///   If the property exists, the value is returned. 
    ///   If the property exists but cannot be parsed to the expected type,  `None`  is returned. 
    pub fn remove<S: TableProperty + NotCustomProp>(&mut self) -> Option<S::Type> {
        self.props
            .remove(S::KEY)
            .and_then(|v| S::parse_value(v.as_str()).ok())
    }

    ///   Removes a property from the configuration. 
    ///
    ///   If the property exists, the value is returned. 
    pub fn remove_untyped(&mut self, key: &str) -> Option<String> {
        self.props.remove(key)
    }

    #[must_use]
    ///   Get a property from the configuration. 
    ///
    ///   If the property exists but cannot be parsed to the expected type,  `None`  is returned. 
    ///   If you want to fail on unparsable values, use  `get_prop_fallible` . 
    pub fn get_prop_opt<C: TableProperty + NotCustomProp>(&self) -> Option<C::Type> {
        self.props
            .get(C::KEY)
            .and_then(|v| ParseFromStr::parse_value(v.as_str()).ok())
    }

    #[must_use]
    ///   Get a property from the configuration. 
    ///
    ///   If the property does not exist  `None`  is returend 
    ///
    ///   # Errors 
    ///
    ///   If the property exists but cannot be parsed to the expected type, a  `ConfigParseError`  is returned. 
    pub fn get_prop_fallible<C: TableProperty + NotCustomProp>(
        &self,
    ) -> Option<Result<C::Type, ConfigParseError>> {
        self.props
            .get(C::KEY)
            .map(|v| ParseFromStr::parse_value(v.as_str()))
            .map(|r| r.map_err(|e| e.for_key(C::KEY)))
    }

    #[must_use]
    ///   Get a custom property from the configuration. 
    ///
    ///   A custom property is a property that is either dynamic in its key or a property that 
    ///   is not part of the standard interface. As such it is not bound to a specific type. 
    ///   Both key and value are strings. 
    pub fn get_custom_prop(&self, key: &str) -> Option<String> {
        self.props.get(key).cloned()
    }

    ///   Convenience constructor to create a new configuration from an optional list of 
    ///   properties. 
    ///
    ///   # Errors 
    ///
    ///   Returns an error if a known key has an incompatible value. 
    pub fn try_from_maybe_props(
        props: Option<impl IntoIterator<Item=(String, String)>>,
    ) -> Result<Self, ConfigParseError> {
        match props {
            Some(props) => Self::try_from_props(props),
            None => Ok(Self::default()),
        }
    }

    #[must_use]
    ///   Non validating constructor to create a new configuration from a list of properties. 
    ///
    ///   Useful when going from wire format ( `HashMap<String, String>` ) to this type when 
    ///   failing on a single invalid property is not desired.  `get_prop_fallible`  still 
    ///   offers the possibility to have runtime errors above unexpected failures elsewhere. 
    pub fn from_props_unchecked(props: impl IntoIterator<Item=(String, String)>) -> Self {
        let mut config = Self::default();
        for (key, value) in props {
            config.props.insert(key, value);
        }
        config
    }
}

impl_config_values

use crate::configs::impl_config_value;
#[derive(Debug, PartialEq, Clone)]
pub struct Endpoint(pub Url);
impl ConfigProperty for Endpoint {
    const KEY: &'static str = "s3.endpoint";
    type Type = Url;

    fn value(&self) -> &Self::Type {
        &self.0
    }

    fn into_value(self) -> Self::Type {
        self.0
    }

    fn parse_value(value: &str) -> Result<Self::Type, ConfigParseError>
    where
        Self::Type: ParseFromStr,
    {
        Self::Type::parse_value(value).map_err(|e| e.for_key(Self::KEY))
    }
}
impl NotCustomProp for Endpoint {}
impl TableProperty for Endpoint {}
impl TableProperties {
    #[must_use]
    pub fn s3_endpoint(&self) -> Option<Url> {
        self.get_prop_opt::<Endpoint>()
    }

    pub fn insert_s3_endpoint(&mut self, value: Url) -> Option<Url> {
        self.insert(&Endpoint(value))
    }
}
...

#[derive(Debug, PartialEq, Clone)]
pub struct SignerUri(pub String);
impl ConfigProperty for SignerUri {
    const KEY: &'static str = "s3.signer.uri";
    type Type = String;

    fn value(&self) -> &Self::Type {
        &self.0
    }

    fn into_value(self) -> Self::Type {
        self.0
    }

    fn parse_value(value: &str) -> Result<Self::Type, ConfigParseError>
    where
        Self::Type: ParseFromStr,
    {
        Self::Type::parse_value(value).map_err(|e| e.for_key(Self::KEY))
    }
}
impl NotCustomProp for SignerUri {}
impl TableProperty for SignerUri {}
impl TableProperties {
    #[must_use]
    pub fn s3_signer_uri(&self) -> Option<String> {
        self.get_prop_opt::<SignerUri>()
    }

    pub fn insert_s3_signer_uri(&mut self, value: String) -> Option<String> {
        self.insert(&SignerUri(value))
    }
}
pub(crate) fn validate(
    key: &str,
    value: &str,
) -> Result<(), ConfigParseError> {
    Ok(match key {
        Region::KEY => {
            _ = Region::parse_value(value)?;
        }
        Endpoint::KEY => {
            _ = Endpoint::parse_value(value)?;
        }
        PathStyleAccess::KEY => {
            _ = PathStyleAccess::parse_value(value)?;
        }
        AccessKeyId::KEY => {
            _ = AccessKeyId::parse_value(value)?;
        }
        SecretAccessKey::KEY => {
            _ = SecretAccessKey::parse_value(value)?;
        }
        SessionToken::KEY => {
            _ = SessionToken::parse_value(value)?;
        }
        RemoteSigningEnabled::KEY => {
            _ = RemoteSigningEnabled::parse_value(value)?;
        }
        Signer::KEY => {
            _ = Signer::parse_value(value)?;
        }
        SignerUri::KEY => {
            _ = SignerUri::parse_value(value)?;
        }
        _ => {}
    })
}

@Xuanwo
Copy link
Member

Xuanwo commented Sep 6, 2024

Hi, thank you for bringing this up. I feel it's a bit too complex for us. Could we maintain a static structure like TableProperties and TableS3Properties? We could implement a from_map method for it. After parsing, all our usage would be type-safe with no additional cost.

@twuebi
Copy link
Contributor

twuebi commented Sep 6, 2024

Hi @Xuanwo, to me that depends on how closed of a set the properties are. If they are a rather closed set, I'd also go for the static approach.

Going with that, I think we could arrive at nice solution by replacing the properties: HashMap<String, String> in e.g. TableMetadata with a typed struct which would serialize into HashMap<String, String> for wire-transfer. For custom properties, there could still be a HashMap<String, String>. In our Deserialize implementation, we'd then do all the verifications.

It would still come with a large amount of boilerplate and the properties struct would have to be a rather complex nested structure of enums to avoid having a ton of Options in it where only certain permutations are valid configurations for the type but that issue is shared with the original proposal in here.

@c-thiel
Copy link
Contributor Author

c-thiel commented Sep 6, 2024

I am not so sure about the closed set. Currently there is a massive amount of TableProperties in Java (https://iceberg.apache.org/javadoc/1.6.1/constant-values.html, search for TableProperties) and we only have a small set now. I have a tendency towards an open approach because:

  1. It allows us to granularly add properties as we need them
  2. Allow iceberg-rust users to define their own properties only relevant for their use-case without reinventing the type-system.

But I also understand the concern of complexity. We have a good template with http though.

@liurenjie1024
Copy link
Collaborator

I also have concern that this approach is too complicated. Could we have sth like:

#[config]
pub struct S3Properties {
  // We could write a macro for this
  #[config(key="s3.access.id")]
  pub access_key: String,
  pub access_key_id: String,
  pub region: Url,
  pub custom: HashMap<String, String>,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants