Skip to content
This repository has been archived by the owner on Apr 22, 2022. It is now read-only.

Avoid needless data loss during schema mapping failures #110

Open
cbsmith opened this issue Dec 4, 2015 · 5 comments
Open

Avoid needless data loss during schema mapping failures #110

cbsmith opened this issue Dec 4, 2015 · 5 comments

Comments

@cbsmith
Copy link

cbsmith commented Dec 4, 2015

DIvolte should minimize data loss due to schema mapping errors, and should provide better logging when they occur. It appears right now that a failure while mapping a single field of a single event will result in the loss not only of that event, but of all other events in the same micro batch. It will also result in the exception unwinding the stack all the way up to Thread.run(), despite this comment in ItemProcessor:

        // Note: processing should not throw an unchecked
        // exception unless no further processing should
        // take place.

At the very least, the rest of the batch should be preserved. Ideally, failures for one field shouldn't prevent other fields being populated, and the unstructured raw data should be passed in to a field that allows the case to be later analyzed and potentially even corrected.

Here's a sample of what I'm seeing in production:

2015-12-04 02:49:47.013Z [Incoming Request Processor - 0] WARN  [ProcessingPool]: Uncaught exception in incoming queue reader thread.
java.util.concurrent.CompletionException: java.lang.NullPointerException: null of string of map of union in field eventParameters of com.foo.bar.CustomRecord
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[na:1.8.0_66-internal]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[na:1.8.0_66-internal]
    at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1629) ~[na:1.8.0_66-internal]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66-internal]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_66-internal]
    at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_66-internal]
Caused by: java.lang.NullPointerException: null of string of map of union in field eventParameters of com.foo.bar.CustomRecord
    at org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:93) ~[avro-1.7.7.jar:1.7.7]
    at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:87) ~[avro-1.7.7.jar:1.7.7]
    at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) ~[avro-1.7.7.jar:1.7.7]
    at io.divolte.server.AvroRecordBuffer.<init>(AvroRecordBuffer.java:63) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.AvroRecordBuffer.fromRecord(AvroRecordBuffer.java:89) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.IncomingRequestProcessor.process(IncomingRequestProcessor.java:159) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.IncomingRequestProcessor.process(IncomingRequestProcessor.java:46) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.processing.ItemProcessor.process(ItemProcessor.java:32) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.processing.ProcessingPool.lambda$microBatchingQueueDrainerWithHeartBeat$6(ProcessingPool.java:148) ~[divolte-collector-0.3.0.jar:na]
    at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) ~[na:1.8.0_66-internal]
    ... 3 common frames omitted
Caused by: java.lang.NullPointerException: null
@cbsmith cbsmith mentioned this issue Dec 4, 2015
@friso
Copy link
Collaborator

friso commented Dec 4, 2015

Agreed on the better logging.

The reason we let the exception propagate all the way up to Thread.run() is intentional. We force a new instance of the processor to be created, because whatever state the existing instance had built up, could be the cause of the exception. This is similar to how actor systems deal with failure.

The fact that you lose the entire micro-batch is a bit of a tradeoff. This just happens to be the granularity of processing.

In essence, a mapping that throws exceptions is in fact a misconfigured systems and as such might lose some events.

@cbsmith
Copy link
Author

cbsmith commented Dec 4, 2015

Take a look at #111, I think it avoids losing the micro-batch (though perhaps there is a case where the entire batch needs to be discarded). Sure it is possible that there is some state that is messed up, but more than likely that isn't the case.

@cbsmith
Copy link
Author

cbsmith commented Dec 4, 2015

Actually, I'm struggling to imagine what state is in the thread that will get cleaned up by discarding it and drawing blank.

@friso
Copy link
Collaborator

friso commented Dec 4, 2015

This is a conceptual thing. We don't know what state could cause the exception now and we don't know what future changes might incur additional ones. We guard against all of that by just cleaning up and starting fresh. The real remedy is: if mapping throws exceptions, fix that, not the mechanism that catches it. An exception is an exceptional condition; it shouldn't occur frequently enough to warrant handling that takes performance into account.

@friso
Copy link
Collaborator

friso commented Dec 4, 2015

As an aside: it's better to start of these kinds of discussions on the mailing list and see if it requires an issue to be logged from there. It helps keep this space clean.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants