Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFdiffusion-1.1.0-foss-2022a-CUDA-11.7.0.eb with models and missing dependencies #18359

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

ThomasHoffmann77
Copy link
Contributor

@ThomasHoffmann77 ThomasHoffmann77 commented Jul 20, 2023

@lexming
Copy link
Contributor

lexming commented Jul 23, 2023

Test report by @lexming
FAILED
Build succeeded for 6 out of 10 (9 easyconfigs in total)
node200.hydra.os - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/lexming/92ef10c7ef1ab4a78e49ae586dd4cada for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @lexming FAILED Build succeeded for 6 out of 10 (9 easyconfigs in total) node200.hydra.os - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (broadwell), Python 3.6.8 See https://gist.github.com/lexming/92ef10c7ef1ab4a78e49ae586dd4cada for a full test report.
@lexming I updated the checksums for the DGL easyconfigs.

{'dgl-1.1.1.tar.gz': '076026a7818f2396056252269d7960c561840bb0fe89494e11f8d6c524891c49'},
{'metis-5.1.1-DistDGL-v0.5.tar.gz': 'cedf0b32d32a8496bac7eb078b2b8260fb00ddb8d50c27e4082968a01bc33331'},
{'GKlib-METIS-v5.1.1-DistDGL-0.5.tar.gz': '52aa0d383d42360f4faa0ae9537ba2ca348eeab4db5f2dfd6343192d0ff4b833'},
{'tensorpipe-20230206.tar.gz': '97dcc6ab648d38f62ddc9d2a33de2a789c967e43e819481f817341881912a524'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksums for git checkouts are not stable, so should be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoors I was not aware of the unstable checksums. Is this the case for any commit or only for recursive checkouts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the case whenever you are not downloading a tarball directly, but creating the tarball yourself, so any git clone

remove unstable checksums for tensorpipe, pcg, and libxsmm
remove unstable checksums for tensorpipe, pcg, and libxsmm
Comment on lines 76 to 77
# uncomment for installing models and schedules as components.
# Then comment out _get_schedules and _get_models in postinstallcmds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the components and _get_schedules/_get_models?
i would prefer to use the best method and remove the other one to keep things clean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoors Using components would archive the models and would allow to reinstall RFdiffusion even if the models were not available for download any more. I'd prefer this option since RFdiffusion is not usable without the models.
However, I understand @lexming's concerns about to store the models data (3.9GB) at least twice, in the module directory and the sources directory.
Therefore I kept the components approach commented out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which solution do you prefer and which should I delete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i agree with you that having to re-download on reinstall is suboptimal, but i also agree with @lexming that storing the models twice is suboptimal.

a possible solution is to create a custom easyblock that checks the installdir and does not delete/re-download the models if they are already present and if their checksums are correct.
to make this configurable, we could define an extra parameter in the easyblock, e.g. keep_models_if_installed.
this is similar to easyblock Tarball, which allows merging an installation into an existing installation instead of wiping it first.

what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For long term it is probably better to place the data into a separate module RFdiffusion_data, which can be reused for minor updates of RFdiffusion, if the data does not change.
It might be good to have a generic easyblock for such data modules, because we can expect more of them with all the upcoming AI packages. (see also #19157 Model-Angelo_data, #19240 relion-classranker_data + relion-blush_data).
There was also a brief discussion at https://github.com/easybuilders/easybuild/wiki/Conference-call-notes-20230719

Copy link
Contributor

@smoors smoors Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's indeed a better solution long-term.
i'll try to give a stab at a generic data easyblock, but feel free to do this yourself if you're up for it.

in the meantime, i suggest to already go ahead with creating a separate easyconfig for RFdiffusion_data. you should be able to use the Bundle easyblock (which does not install anything), and download the models in a postinstallcmds, as you have done here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look on RFdiffusion_data in the next few days. RFdiffusion-1.1.0_data_paths.patch requires an update allowing to read the models and schedules from elsewhere defined in a environment variable.

@easybuilders easybuilders deleted a comment from boegelbot Nov 20, 2023
@easybuilders easybuilders deleted a comment from boegelbot Nov 20, 2023
@easybuilders easybuilders deleted a comment from boegelbot Nov 20, 2023
@easybuilders easybuilders deleted a comment from boegelbot Nov 20, 2023
@easybuilders easybuilders deleted a comment from boegelbot Nov 20, 2023
@pavelToman
Copy link
Contributor

Hi! I juts start to work on vscentrum/vsc-software-stack#283 pipeline of RFdiffusion - ProteinMPNN -AlphaFold. I'd like to use RFd with data models either. Do you working on that or you stop?

@ThomasHoffmann77
Copy link
Contributor Author

Hi! I juts start to work on vscentrum/vsc-software-stack#283 pipeline of RFdiffusion - ProteinMPNN -AlphaFold. I'd like to use RFd with data models either. Do you working on that or you stop?

@pavelToman I am currently working on AlphaFold/2.3.2-foss-2023a-CUDA-12.1.1. see also #19841


sanity_check_commands = [
"run_inference.py --help",
"run_inference.py 'contigmap.contigs=[150-150]' " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command will create two output directories in %(installdir)s, is there a way to avoid this?

],
}),
]

sanity_pip_check = True
postinstallcmds = [
'cp -rpP config examples %s/%%(namelower)s' % _pysp,
Copy link
Contributor

@WilleBell WilleBell Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'cp -rpP config examples %s/%%(namelower)s' % _pysp,
'cp -rpP config examples helper_scripts %s/%%(namelower)s' % _pysp,
'cp -pP %%(namelower)s/inference/sym_rots.npz %s/%%(namelower)s/inference/' % _pysp,

RFdiffusion needs sym_rots.npz to perform symmetry operations which is needed to use different symmetry settings.

@garadar
Copy link
Contributor

garadar commented Sep 19, 2024

From https://github.com/easybuilders/easybuild-easyconfigs/pull/18359/files#diff-03d4d74a13067c23b1773097b070efb41c27ac715ad10f07b0fd98775ba1a144

    {
        'source_urls': ['https://github.com/KarypisLab/GKlib/archive'],
        'download_filename': 'METIS-v6.1.1-DistDGL-0.5.tar.gz',
        'filename': 'GKlib-METIS-v6.1.1-DistDGL-0.5.tar.gz',
        'extract_cmd': "tar -C %(namelower)s-%(version)s/third_party/METIS/GKlib --strip-components=2 -xf %s",
    },

It seems the source v6.1.1-DistDGL-0.5 is not available : https://github.com/KarypisLab/GKlib/tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants