Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A dummy/marker writer component for gathering outcome of validation rules #942

Closed
kaspersorensen opened this issue Nov 20, 2015 · 15 comments

Comments

@kaspersorensen
Copy link
Member

In a number of scenarios I have seen that users/customers want to build validation rules in DataCleaner and monitor the success-/failure-rate of that rule. Example rules could be:

  • We should always be able to match a column with either of Dictionaries X, Y or Z.
  • TimestampX should always be greater than TimestampY.
  • The validation code from a correction transformation should be in range 1-100, whereas 101+ is regarded as failure.

We do already have various filters etc. which enable the user to build filtering rules. But we don't have a simple "writer"/consumer or analyzer which just considers all records passed to it as "counted" and pertaining to some category.

I'm having trouble figuring out if such an analyzer would be nice to have. For beginners it's purpose would be quite unclear, but for the usecases above it would make sense (I think). Feedback welcome.

@kaspersorensen
Copy link
Member Author

We're considering different names instead of "Mark records as" (specifically the 'mark' word - discussed with @LosD) and we could consider:

  • Collect
  • Regard
  • Tag
  • Label
  • Annotate

More suggestions very welcome

@kaspersorensen
Copy link
Member Author

Implemented a first draft of this extension in https://github.com/kaspersorensen/extension_annotate

Here's a screenshot:

screen shot 2016-01-03 at 16 20 31

@kaspersorensen
Copy link
Member Author

What do you think @LosD? I'm thinking of adding this to the extension swap (also for preparing the workshop where we might need it later this month).

@LosD
Copy link
Contributor

LosD commented Jan 3, 2016

Seems nice and simple... I guess we can continue discussing the name forever (my biggest worry with using any kind of "mark", "tag", "label" or "annotate" is that I could easily see it clash with some sort of feature where we actually DO mark or tag a record for inspection later in the chain, rather than more or less count it).

What's the plan for the result output?

@kaspersorensen
Copy link
Member Author

Right, so far I just inherited the existing result renderer for "AnnotatedRowsResult":

screen shot 2016-01-03 at 19 36 53

And I added a metric "Row count" for monitoring it in the DC monitor webapp.

@LosD
Copy link
Contributor

LosD commented Jan 3, 2016

I not sure if it would be feasible, or even desirable for your use case, but I imagine it could be nice with a common result screen for all "mark" results, so it would be possible to compare them directly; i.e. "20 invalid, 13 valid and 7 special", or whatever the user had decided would be interesting counts. They don't seem quite as interesting in a vacuum (unless the annotation part is interesting in itself for your use).

@kaspersorensen
Copy link
Member Author

I think we should close this story since the functionality (except for the part described in the last comment by @LosD) is delivered via the "Mark rows as..." component.

As for the last remark: I would suggest that to be more of a concern for filters, that there could be a kind of overview somewhere of all categorizations made throughout all components (especially filters since they "direct" the flow somewhere).

@LosD
Copy link
Contributor

LosD commented May 17, 2016

Should we consider moving Mark Rows into main DC, then?

However, we'll probably need to document it better, then. For a component that is just an extension, it's pretty weird to have gotten several "but... What does it do"? ☺

@kaspersorensen
Copy link
Member Author

I believe it IS included in the default install of DC5! :-)

@LosD
Copy link
Contributor

LosD commented May 17, 2016

Ah, yes you are right. But that is just in commercial distribution. I'm thinking we should put it into it into the actual DC project.

But that explains the many "huh?" reactions :)

@kaspersorensen
Copy link
Member Author

Ok with me to change the way it is bundled if you prefer. But from an end user point this issue has been fixed for long I think.

@LosD
Copy link
Contributor

LosD commented May 17, 2016

I don't think it is part of the community edition at all.

Of course, they can always just fetch it from the ExtensionSwap.

@LosD
Copy link
Contributor

LosD commented May 17, 2016

It's also quite an issue that no one seems to have any idea what to use it for (even less why to fill the sole non-inputcolumn property).

@kaspersorensen
Copy link
Member Author

I would like to revisit this issue by contributing the extension to the community edition. Consider it bumped :-)

@LosD
Copy link
Contributor

LosD commented Nov 21, 2017

That sounds like a great idea! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants