-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray] Consistent Crashes w/ Actor-Handle-Based Implementation of Parquet File Compactor #8687
Comments
Is it possible to create a simplified reproduction script you can share? |
Full reproduction script available at: |
Can you mock out the data/etc dependencies so the file can be run without any additional setup? |
@pdames we merged a possible fix. Is it possible for you to try again with the latest master? |
@rkooo567 Fix verified on the latest build from master - 10/10 job runs succeeded using the same source dataset and cluster config that consistently crashed before. Thanks! |
@pdames Happy to hear that!! |
What is the problem?
I’ve been running a task-based compactor against production Parquet datasets successfully and stably for the last few days, but just had my first crash with an actor-handle-based compactor running against a relatively small ~7GB parquet input dataset (divided into 20MB chunks).
The original pure-task-based compactor launches one distributed task per input delta in a table’s stream, takes a hash of the primary key modulo a desired number of buckets to group “like” primary keys together, saves each parallel task’s bucket groupings into distinct S3 files, and then "compacts" (i.e. sorts by sort key columns and dedupes by primary keys) these like-groupings by creating one parallel compaction task per hash bucket index.
The actor-based compactor I was just testing tries to reduce the number of small hash bucket grouping files by passing a list of hash bucket actor handles into the task - in this case 75 for 75 hash buckets. Basically this:
So, whereas each parallel task would previously just write multiple small dataframe parquet file to S3 for the same bucket, they now call hashBucket.append.remote(dataframe).
This step went just fine.
Once all appends have completed, I iterate through the completed hash buckets to produce 1 output file in S3 per bucket, like so:
The failure occurred here - it wrote several outputs successfully, then crashed with the following stack trace (while calling hb_actors[hash_bucket_index].write.remote(file_path)):
simple_compactor_actor_stack_trace.txt
Subsequent ray exec attempts then fail to connect to the cluster:
Ray version and other system information (Python version, TensorFlow version, OS):
Python 3.6
99cc2e2
4 r5n-8xlarge instances in us-east-1 w/ ami-0dbb717f493016a1a (Deep Learning AMI, Ubuntu 18.04, Version 27.0)
Reproduction (REQUIRED)
Reproduction relies on the files available at #8707.
The text was updated successfully, but these errors were encountered: