Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executer memory issue on CDSE #595

Closed
mbuchhorn opened this issue Nov 22, 2023 · 7 comments
Closed

executer memory issue on CDSE #595

mbuchhorn opened this issue Nov 22, 2023 · 7 comments
Assignees

Comments

@mbuchhorn
Copy link

I have a processing line which runs fine on the Terrascope backend. But unfortunately the same processing line is not running on the CDSE backend. I always get an executer memory issues.
I raise already several times the executer-memory and/or executer-overhead setting but still the same issue. On Terrascope I can run it with 3G executor memory and 2G overhead. On CDSE I have already 2.5 times of this amount and it still fails.

last job id: j-2311213ddcc94063ae7f28b03dad3b3e

Job-settings for CDSE (last test):
OPENEO_EXTRACT_JOB_OPTIONS = {
"driver-memory": "4G",
"driver-memoryOverhead": "8G",
"driver-cores": "2",
"executor-memory": "4G",
"executor-memoryOverhead": "8G",
"executor-cores": "2",
"max-executors": "50",
"soft-errors": "true"
}

job-id's which also failed:
j-2311175da37e4fa997fc1e6c68007d37
j-231119e0c85b45368f81875cd73c6bc6
j-231119cec93544f08d81c6ee73580916
j-23112146b0c0410c897c563140644f5a

@jdries jdries self-assigned this Nov 24, 2023
@jdries
Copy link
Contributor

jdries commented Nov 24, 2023

@mbuchhorn executor-cores needs to be set to 1, now it will try to run 2 tasks in the same executor, so you roughly need twice the memory

@jdries
Copy link
Contributor

jdries commented Nov 24, 2023

Ok, the jobs fail in the final task, where it seems that all data is loaded in a single task on one executor.
The crash says 'JVM OOM', so we need to focus on increasing executor-memory rather than executor-memoryOverhead.

I'll try running it myself with adjusted settings.
I also need to figure out if it can do smarter partitioning, to avoid putting everything into 1 task. The main reason I see is that it's probably an output with only 4 tiles and 1 timestamp, but many bands.

@jdries
Copy link
Contributor

jdries commented Nov 24, 2023

The logging tells the same story:
Computed band count 237 from metadata of OpenEORasterCube[199] at RDD at ContextRDD.scala:32

Here also the linear_scale_range trick would help by casting to a smaller datatype, but that's what we tried before and of course gave issues in setting the ranges correctly.

Found something weird, based on the code, I would expect 4x4 tiles == 16 tasks rather than 1 task in stage 70, which is doing the statistics calculation.
image

jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Nov 24, 2023
@jdries
Copy link
Contributor

jdries commented Nov 24, 2023

In my own attempt, I did get the expected number of tasks somehow, so maybe my previous screenshot was bogus caused by the retries:
image

jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Nov 24, 2023
@jdries
Copy link
Contributor

jdries commented Nov 24, 2023

the last merge_cubes process was basically setting a partitioner that could explain this OOM, added a potential workaround, to be rolled out still.

@jdries
Copy link
Contributor

jdries commented Nov 24, 2023

It seems like the fix worked, I now have 16 tasks at the end instead of one, only thing I still require is a bit more driver memory, as I was running with default options.
Merging cubes with partitioners: Some(SpacePartitioner(KeyBounds(SpatialKey(0,0),SpatialKey(3,3)))) - None - many band case detected: true

jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Nov 25, 2023
jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Nov 25, 2023
jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Nov 25, 2023
jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Nov 26, 2023
@jdries
Copy link
Contributor

jdries commented Nov 26, 2023

Job worked, cost is down to 76 credits thanks to achieving higher parallelism and lower memory use!
The compressed output tiff is 3.3GB, hence the need for a lot of driver memory. Download will probably take longer than processing now.

The key thing are these two extra parameter in the context of apply_dimenstion target='bands':

"context": {
          "parallel": true,
          "TileSize": 128
        },

Next step is to make this work without those parameters, after validating that the output is still correct.

Image


job_options = {
        "driver-memory": "8G",
        "driver-memoryOverhead": "5G",
        "driver-cores": "1",
        "executor-cores": "1",
        "executor-request-cores": "800m",
        "executor-memory": "1500m",
        "executor-memoryOverhead": "2500m",
        "max-executors": "25",
        "executor-threads-jvm": "7",
        "logging-threshold": "info"
    }

jdries added a commit to Open-EO/openeo-geotrellis-extensions that referenced this issue Dec 5, 2023
@jdries jdries closed this as completed Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants