-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC spout to emit captures into topology (implements #755) #799
Conversation
Nice one @sebastian-nagel! Would some of the existing components of the WARC module benefit from jwarc? Our code to generate WARCs leverages storm-hdfs (and can of course deal with local files as well), why not have a spout based on HdfsSpout instead? This way we could deal with remote and or larger input data? |
Maybe, it's on my list to try to rewrite the WARCHdfsBolt using jwarc.
After a look on HDFS spout: the way files are marked before and moved after processing seems not really applicable for web archives - those usually stay in place. We would need the same mechanism as in the first version of WarcSpout: write paths/URLs pointing to WARCs into files and put these in the input folder of HdfsSpout. I'll think about it. Maybe it's the next step. |
makes sense Could you please update the README for the WARC module as part of the PR? Good to go otherwise thanks |
Yes, I'll update the README. Give me a few days to do some more testing - I've found already one open point: WARC records with payload still compressed using Content-Encoding. |
- add WARCSpout to warc module README - refactor WARCSpout - add WARC record location to metadata: warc.file.name, warc.record.offset and warc.record.length - upgrade jwarc dependency to 0.12.0
- after emited fetched tuple: sleep to avoid the topology queues overflow (configurable via `warc.spout.emit.fetched.sleep.ms`) - sleep 1 microsec. after "failed" fetches (HTTP status != 200)
9fa1cae
to
f201dff
Compare
Rebased to current master and addressed the following points:
Open points:
Testing:
|
if (warcReader == null && buffer.isEmpty()) { | ||
// input exhausted | ||
try { | ||
Thread.sleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need it - Storm will handle sleeping between calls to nextTuple if you don't emit anything during a call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will remove it - was following the API doc of ISpout::nextTuple, obviously outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me double check in Storm's code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see http://storm.apache.org/releases/current/Performance.html -> wait strategy
default values
topology.sleep.spout.wait.strategy.time.ms | 1
topology.spout.wait.strategy | "org.apache.storm.spout.SleepSpoutWaitStrategy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok: sleep removed.
return; | ||
|
||
LOG.info("Reading WARC file {}", warcFileInProgress); | ||
ReadableByteChannel warcChannel = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try with resource to simplify the code and remove the close section below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it does not work: the channel needs to stay, it'll be closed together by warcReader.close()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
LOG.debug("Skipped WARC record of type {}", | ||
record.get().type()); | ||
} | ||
if (storeHTTPHeaders && "request".equals(warcType)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that string "request" is used in other places in the module - have it as a constant of enum somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. But maybe better use instanceof WarcRequest
as the literal "request" is an internal of jwarc.
Implementing our own timeout mechanism is not right, max spout pending should be the way to go. Is there more than one emit per call to nextTuple()? if so this would explain why topology.max.spout.pending doesn't have any effect. Looking at the code this does not seem to be the case though, any idea? |
- remove spout-internal sleep - upgrade jwarc dependency to 0.13.0
Hi @jnioche, I've addressed all your comments. Some example topologies which push WARC content into the topology are in sebastian-nagel/warc-crawler. The "devnull" topology reads 1.2 million WARC records in 9 min. on my laptop. However, the workers of the "rewarc" topo repeatedly crash with an OOM exception. An analysis of the heap dump shows that almost the entire memory is occupied by the overflow queue (cf. here). Some 20k tuples are emitted but not yet acked: |
Thanks @sebastian-nagel I can reproduce the issue and will look into it. |
You need to emit with a tuple ID, otherwise, Storm won't track the acks and fails and the number of tuples being processed.
The change above seems to have fixed the issue in my case. |
|
||
private int maxContentSize = -1; | ||
private int contentBufferSize = 8192; | ||
private long sleepEmitFetched = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used - remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- emit content tuple with ID (URL)
Thanks, @jnioche! I've added the missing ID to the emit call. I've run only a short test: looks good, memory consumption is low. |
Thanks @sebastian-nagel, this is a great addition to our WARC module. |
A simple WARC spout:
Tested so far only with a few WARC files, both local and Common Crawl files referenced by URL
http://commoncrawl.s3.amazonaws.com/crawl-data/...