Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement user defined delimiter for value format #3393

Conversation

purplefox
Copy link
Contributor

@purplefox purplefox commented Sep 21, 2019

Description

This PR implements configurable delimiters for value formats.

It is based on an original PR #2055 authored by @hasnat. Many thanks for your contribution!

Since the original PR was submitted the codebase has changed a lot so it was a complex merge, and I've had to reapply a lot of the changes as many of the originally modified classes do not exist or have moved in the current code base.

I've also added some more tests and implemented TAB and SPACE delimiters too.

Here’s a summary:

  • We introduce a new optional property VALUE_DELIMITER which can be used in WITH clauses when VALUE_FORMAT=‘DELIMITED’.
  • The default value for VALUE_DELIMITER is ','
  • VALUE_DELIMITER must be a single character or the special values ‘SPACE’ or ‘TAB’ which, unsurprisingly, represent space and tab delimited values.
  • We need to use special values for space and tab because
    • With clause properties in KSQL are managed by subclasses of org.apache.kafka.common.config.AbstractConfig. This class trims (i.e. removes whitespace) from any values it maintains. Thus, it cannot be used where the value is just whitespace (e.g. a tab or space character).
    • The KSQL cli currently does not support entering escaped tab characters, so it's much easier to use a different value to represent them.
  • Docs have been updated appropriately
  • Removed two param constructor of FormatInfo as it was only being used from tests
  • Updated DataGen

Testing done

  • Modified and added unit tests
  • Modified and added more QTT tests

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

hasnat and others added 30 commits October 31, 2018 18:41
* fork/master: (107 commits)
  copy-edited (confluentinc#2299)
  copy-edited (confluentinc#2299)
  remove unused map from command store `getRestoreCommands` (confluentinc#2296)
  Speed up integration tests. (confluentinc#2288)
  Move KSQL Clickstream demo to Examples repo (confluentinc#2270)
  Bump Confluent to 5.1.1-SNAPSHOT, Kafka to 2.1.1-SNAPSHOT
  Set Confluent to 5.1.0, Kafka to 2.1.0-cp1.
  Add ServiceContext (confluentinc#2243)
  Update to use CCL (confluentinc#2278)
  Remove redundant intro sentence in README.md (confluentinc#2277)
  Switch to use CCL (confluentinc#2275)
  Update readme for CCL (confluentinc#2276)
  Minor: add Hamcrest matchers for KsqlRestException and KsqlErrorMessage (confluentinc#2273)
  Update per relicensing (confluentinc#2274)
  CP-584: 5.1.0 changelog (confluentinc#2267)
  add null checks to min, max, and count aggregates; call Math.min/max for consistency (confluentinc#2246)
  Return and accept command topic offsets via REST API (confluentinc#2159)
  MINOR: Fix bug encountered when restoring RUN SCRIPT command. (again) (confluentinc#2265)
  More improvements to the way the CLI handles inline comments in multi-line statements. (confluentinc#2241)
  Fix issue where Avro subject does not exist (confluentinc#2260)
  ...
Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple of nits inline.

Copy link
Member

@JimGalasyn JimGalasyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @purplefox

LGTM 'cept a load of nits and suggestions.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @purplefox

I think the code is looking much better with the introduction of the Delimiter class.

There are still a fair few comments outstanding from previous reviews, plus some new ones below.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 'cept for a few final nits.

Feel free to merge once you've cleaned them up.

* @param parser the parser.
* @return the validator
*/
public static Validator parses(final Function<String, ?> parser) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would benefit from a unit test - there's some in the patch I sent you.


private final char delimiter;

public static Delimiter parse(final char ch) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol - I only mean change the of(String) to parse. I think this one with a single char should of remained of. sorry. Should have been more explicit!

"CREATE STREAM S2 as SELECT id, name, value FROM test;"
],
"inputs": [
{"topic": "test_topic", "key": 0, "value": "0,zero,0", "timestamp": 0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, for future reference, (no need to change), where the test case doesn't really care about the key or timestamp they can be excluded to make the test more succinct, e.g.

Suggested change
{"topic": "test_topic", "key": 0, "value": "0,zero,0", "timestamp": 0},
{"topic": "test_topic", "value": "0,zero,0"},

"statements": [
"CREATE STREAM TEST WITH (kafka_topic='test_topic', value_format='JSON', value_delimiter='|');"
],
"topics": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before - if you correct the error handling to store the properties class being created in an invalid state, (which it looks like you have), you shouldn't need this topics second.


public KsqlDelimitedSerdeFactory(final Optional<Delimiter> delimiter) {
this.csvFormat =
CSVFormat.DEFAULT.withDelimiter(delimiter.orElse(DEFAULT_DELIMITER).getDelimiter());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: strange formatting... blank line etc.

I'd probably go with:

    this.csvFormat = CSVFormat.DEFAULT
        .withDelimiter(delimiter.orElse(DEFAULT_DELIMITER).getDelimiter());

It it were me.

@purplefox purplefox merged commit b84d0aa into confluentinc:master Sep 26, 2019
@purplefox purplefox deleted the hasnat-add_optional_delimiter_for_delimited_format branch September 26, 2019 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants