[draft] proposed fix for incorrect mask application in FSDP #1807

bfineran · 2023-10-31T20:28:21Z

in the current layer masking implementation, parameters are stored at initialization and references to them are used to apply masks on modifier update

in FSDP mode, it seems that masking on top of these references does update these parameter references, but these parameters no longer have an effect on the FSDP module.

this fix just implements the simple flow @dsikka used in the sparsify MVP to apply the masks over a reference to the current FSDP module at update time. (ie instead of applying masks on the saved references to the layers, the masks are applied directly over fresh references from the model)

handing off to @Satrat

confirmation of fix
test command:
accelerate launch --config_file fsdp_config.yaml test_trainer.py

snippet of output with sparsity log (previously 0.0 for all sparsity values):

2023-10-31 16:22:18 sparseml.transformers.finetune.session_mixin INFO     Finalized SparseML recipe argument applied to the model
2023-10-31 16:22:18 sparseml.transformers.finetune.session_mixin INFO     Sparsification info for ./obcq_deployment: 15191712 total params. Of those there are 15187968 prunable params which have 15.291973225121358 avg sparsity.
2023-10-31 16:22:18 sparseml.transformers.finetune.session_mixin INFO     sparse model detected, all sparsification info: {"params_summary": {"total": 15191712, "sparse": 2322540, "sparsity_percent": 15.288204515725418, "prunable": 15187968, "prunable_sparse": 2322540, "prunable_sparsity_percent": 15.291973225121358, "quantizable": 15187968, "quantized": 0, "quantized_percent": 0.0}, "params_info": {"_fsdp_wrapped_module.model.layers.0.self_attn.q_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.0.self_attn.k_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.0.self_attn.v_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.0.self_attn.o_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.0.mlp.gate_proj.weight": {"numel": 221184, "sparsity": 0.5000135898590088, "quantized": false}, "_fsdp_wrapped_module.model.layers.0.mlp.up_proj.weight": {"numel": 221184, "sparsity": 0.5000135898590088, "quantized": false}, "_fsdp_wrapped_module.model.layers.0.mlp.down_proj.weight": {"numel": 221184, "sparsity": 0.0, "quantized": false}, "_fsdp_wrapped_module.model.layers.1.self_attn.q_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.1.self_attn.k_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.1.self_attn.v_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.1.self_attn.o_proj.weight": {"numel": 82944, "sparsity": 0.5000361800193787, "quantized": false}, "_fsdp_wrapped_module.model.layers.1.mlp.gate_proj.weight": {"numel": 221184, "sparsity": 0.5000135898590088, "quantized": false}, "_fsdp_wrapped_module.model.layers.1.mlp.up_proj.weight": {"numel": 221184, "sparsity": 0.5000135898590088, "quantized": false},

Update 11/1/23

The above fix works for n_gpu=1 but multi-gpu. The latest commit should fix the issue with multi-gpu. Essentially what was happening is we were initializing the model to our SparseSession before it was wrapped by FSDP. To fix this I added a new callback for on_train_begin that replaces the session's pytorch model with the FSDP wrapped one.

Using model.apply as implemented in the initial fix works because FSDP overrides the module apply function, see https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.apply.

In order to access and update the underlying model outside of apply, we need to use the summon_full_params function, see https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.summon_full_params. This fixed the issue with reading out the sparsities after training

with FullyShardedDataParallel.summon_full_params(self.model):
    self.log_model_sparsification()

We may need to implement this idea in other areas of the codebase.

Remaining things to wrap up:

currently this branch will only work with FSDP due to the summon_full_params call, update to check if we need this context before calling
Having some issues with the script hanging when we try to save the model. Haven't had time to debug this, but https://huggingface.co/docs/accelerate/usage_guides/fsdp#saving-and-loading may be a good reference here. Might need to wrap this function so only one process is calling it?

@Satrat

* WIP * WIP trainer refactor * loss updates and removing manager references * WIP generation script * events updating properly * fix structure init * running for text_generation * dataloaders and cleaning up finetuning script * reorganizing and fsdp * fix gradient bug * add fsdp config * clean up for debugging * clean up textgen script * model/recipe save and loading * quality and fixing tests * fix test * fix recipe load * [Finetuning] Model/Recipe reloading and Checkpoints (#1795) * Initial commit * Add end to end tests * Add e2e tests for constant pruning modifier * Move imports inside the test fuctions so that torch isn't imported unless running the tests * Update setup.py to not run modifier tests unless pytorch is specified * [Bugfix] .dict() method on Recipe (#1753) * Bugfix .dict() method on Recipe * Remove extraneous local test, [faulty commit] * [modifier refactor] Add serialization tests (#1755) * Add serialization tests * Clean up * Keep original stage and group names Clean up _get_yaml_dict * fix comment * Typo * [Unit Tests][Modifier Refactor] (#1756) * Move valid recipes to a helper file Add tests for session.py * Increase test coverage of src/sparseml/core/session.py to 100% Run Style Add logs to .gitignore * Increase coverage of tests/sparseml/core/test_state.py to 100% * add tests for lifecycle/event.py * Increase code coverage of lifecycle/event to 100% * increase lifecycle/session.py code coverage to 93% * Address review comments from @Satrat * Address review comments on 1752 (#1772) Update makefile to only ignore *pytorch.py files in modifier dir Fix order in test Add regex to makefile Add helper function to determine if torch tests should be run Check masks Make transformers import optional in sparsegpt.py * Fix merge conflict * Add more tests to check valid modifiers are created (#1774) * [Bug][ConstantPruningModifier] Fix mask de register bug (#1773) * Fix mask de-register logic * forgot to remove commented out line * Move tests inside pytorch directory as requested * Fix session reset (#1790) * save recipe with model * saving/loading/checkpointing * clean up structure initialization * clean up end stages * style * fixing test failures * fix test file --------- Co-authored-by: rahul-tuli <[email protected]> * style * add init for modifiers util * consolidate classes * cleaning up mixin classes and precision callback * specific train/eval fn * clean print statements * Additional Datasets for Finetuning (#1803) * wip support for additional datasets * support for splits and load_dataset args * clean up * c4 and op working with splits * load less data, run faster * [draft] proposed fix for incorrect mask application in FSDP (#1807) * [draft] proposed fix for incorrect mask application in FSDP * fix for multi-gpu * fix for hanging model save * clean up --------- Co-authored-by: Sara Adkins <[email protected]> * clean up logging * adding transformers GHA tests * clean up GHA * clean up GHA * Docstrings + Testing for Finetuning (#1832) * initial commit * docstrings for dataset registry * docstrings for helpers and clean reload_model_state * import fix * session_mixin docstrings * session mixin documentation and CLI hooks * cleaning up CLI calls * WIP unit tests * tests for dataset loading * session mixin unit tests * addressing PR comments * fix unit test * more unit test fixes * Distillation Support for Finetuning (#1865) * initial commit * propogate teacher to modifier * cherrypick distil changes * WIP for distillation loss fixes * WIP fixing distillation * fixing kd_wrapper issues * fixing comparison reference issue * cleanup for PR * more cleanup * fixing finalization sync * fix for saving * update example fsdp * fixing unit tests * update fsdp config * update fsdp config * remove copied function * Misc Finetuning Checkpointing Fixes (#1881) * initial commit * speeding up fsdp, fixing (some) checkpoint bugs ---------

[draft] proposed fix for incorrect mask application in FSDP

5d9907a

bfineran requested a review from Satrat October 31, 2023 20:28

bfineran assigned Satrat and bfineran Oct 31, 2023

bfineran marked this pull request as draft October 31, 2023 20:36

Sara Adkins added 3 commits November 1, 2023 13:11

fix for multi-gpu

eabedae

fix for hanging model save

45afe19

clean up

598cfae

Satrat marked this pull request as ready for review November 14, 2023 01:26

Satrat merged commit f220740 into refactor_hf_trainer Nov 14, 2023

Satrat deleted the fsdp-masking-patch branch November 14, 2023 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] proposed fix for incorrect mask application in FSDP #1807

[draft] proposed fix for incorrect mask application in FSDP #1807

bfineran commented Oct 31, 2023 •

edited by Satrat

Loading

[draft] proposed fix for incorrect mask application in FSDP #1807

[draft] proposed fix for incorrect mask application in FSDP #1807

Conversation

bfineran commented Oct 31, 2023 • edited by Satrat Loading

Update 11/1/23

bfineran commented Oct 31, 2023 •

edited by Satrat

Loading