Avoid needless data loss during schema mapping failures #110

cbsmith · 2015-12-04T05:11:51Z

DIvolte should minimize data loss due to schema mapping errors, and should provide better logging when they occur. It appears right now that a failure while mapping a single field of a single event will result in the loss not only of that event, but of all other events in the same micro batch. It will also result in the exception unwinding the stack all the way up to Thread.run(), despite this comment in ItemProcessor:

        // Note: processing should not throw an unchecked
        // exception unless no further processing should
        // take place.

At the very least, the rest of the batch should be preserved. Ideally, failures for one field shouldn't prevent other fields being populated, and the unstructured raw data should be passed in to a field that allows the case to be later analyzed and potentially even corrected.

Here's a sample of what I'm seeing in production:

2015-12-04 02:49:47.013Z [Incoming Request Processor - 0] WARN  [ProcessingPool]: Uncaught exception in incoming queue reader thread.
java.util.concurrent.CompletionException: java.lang.NullPointerException: null of string of map of union in field eventParameters of com.foo.bar.CustomRecord
    at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[na:1.8.0_66-internal]
    at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[na:1.8.0_66-internal]
    at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1629) ~[na:1.8.0_66-internal]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_66-internal]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_66-internal]
    at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_66-internal]
Caused by: java.lang.NullPointerException: null of string of map of union in field eventParameters of com.foo.bar.CustomRecord
    at org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:93) ~[avro-1.7.7.jar:1.7.7]
    at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:87) ~[avro-1.7.7.jar:1.7.7]
    at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58) ~[avro-1.7.7.jar:1.7.7]
    at io.divolte.server.AvroRecordBuffer.<init>(AvroRecordBuffer.java:63) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.AvroRecordBuffer.fromRecord(AvroRecordBuffer.java:89) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.IncomingRequestProcessor.process(IncomingRequestProcessor.java:159) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.IncomingRequestProcessor.process(IncomingRequestProcessor.java:46) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.processing.ItemProcessor.process(ItemProcessor.java:32) ~[divolte-collector-0.3.0.jar:na]
    at io.divolte.server.processing.ProcessingPool.lambda$microBatchingQueueDrainerWithHeartBeat$6(ProcessingPool.java:148) ~[divolte-collector-0.3.0.jar:na]
    at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626) ~[na:1.8.0_66-internal]
    ... 3 common frames omitted
Caused by: java.lang.NullPointerException: null

The text was updated successfully, but these errors were encountered:

friso · 2015-12-04T07:09:32Z

Agreed on the better logging.

The reason we let the exception propagate all the way up to Thread.run() is intentional. We force a new instance of the processor to be created, because whatever state the existing instance had built up, could be the cause of the exception. This is similar to how actor systems deal with failure.

The fact that you lose the entire micro-batch is a bit of a tradeoff. This just happens to be the granularity of processing.

In essence, a mapping that throws exceptions is in fact a misconfigured systems and as such might lose some events.

cbsmith · 2015-12-04T07:30:47Z

Take a look at #111, I think it avoids losing the micro-batch (though perhaps there is a case where the entire batch needs to be discarded). Sure it is possible that there is some state that is messed up, but more than likely that isn't the case.

cbsmith · 2015-12-04T07:32:40Z

Actually, I'm struggling to imagine what state is in the thread that will get cleaned up by discarding it and drawing blank.

friso · 2015-12-04T08:14:46Z

This is a conceptual thing. We don't know what state could cause the exception now and we don't know what future changes might incur additional ones. We guard against all of that by just cleaning up and starting fresh. The real remedy is: if mapping throws exceptions, fix that, not the mechanism that catches it. An exception is an exceptional condition; it shouldn't occur frequently enough to warrant handling that takes performance into account.

friso · 2015-12-04T09:11:47Z

As an aside: it's better to start of these kinds of discussions on the mailing list and see if it requires an issue to be logged from there. It helps keep this space clean.

cbsmith mentioned this issue Dec 4, 2015

Support/0.3.1 #111

Closed

cbsmith mentioned this issue Dec 8, 2015

Handle exceptions during event mapping. #113

Open

asnare added the enhancement label Aug 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid needless data loss during schema mapping failures #110

Avoid needless data loss during schema mapping failures #110

cbsmith commented Dec 4, 2015

friso commented Dec 4, 2015

cbsmith commented Dec 4, 2015

cbsmith commented Dec 4, 2015

friso commented Dec 4, 2015

friso commented Dec 4, 2015

Avoid needless data loss during schema mapping failures #110

Avoid needless data loss during schema mapping failures #110

Comments

cbsmith commented Dec 4, 2015

friso commented Dec 4, 2015

cbsmith commented Dec 4, 2015

cbsmith commented Dec 4, 2015

friso commented Dec 4, 2015

friso commented Dec 4, 2015