Merge LoCo with Zero++ #6730

XingyuXie · 2024-11-08T13:58:26Z

Integration of LoCo Method into ZeRO++

Overview

This PR introduces the integration of the LoCo method, as outlined in this paper, into the ZeRO++ framework of DeepSpeed. The key enhancement involves applying error feedback compensation to 4-bit gradients before communication. This approach improves pre-training loss outcomes without additional time overhead, though it requires extra GPU memory. The extent of this memory increase depends on model size and training configuration.

Experimental Results

We conducted pre-training experiments using the Llama2 architecture, adjusting the number of layers and hidden size. The experiments included:

A smaller-scale model with 0.8B parameters trained on 30B tokens.
A larger-scale model with 8B parameters trained on 5B tokens.

The training data was sampled from Redpajama-V2.

Findings:

Smaller Models (0.8B parameters): Significant gains were observed when applying the LoCo method.
Larger Models (8B parameters): The gains were present but less pronounced. This could be due to:
1. Relatively smaller data volume.
2. Lower pre-training loss for larger models, making significant improvements harder to achieve.

However, even a smaller pre-training loss gap in larger models can translate to meaningful gains in downstream tasks.

Example Script

For reference, the run.sh script used for the 8B parameter, 5B tokens experiment is attached. The experiment was conducted using the DeepSpeed-Megatron platform.

Acknowledgments

Special thanks to cc @GuanhuaWang for ongoing communication and guidance throughout this work.

We appreciate your consideration of this PR and welcome any feedback or questions!

XingyuXie · 2024-11-08T14:02:39Z

@XingyuXie please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

Merge LoCo with Zero++

9118612

XingyuXie requested review from tjruwase, awan-10 and tohtana as code owners November 8, 2024 13:58

loadams requested review from GuanhuaWang and removed request for awan-10 November 12, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge LoCo with Zero++ #6730

Merge LoCo with Zero++ #6730

XingyuXie commented Nov 8, 2024

XingyuXie commented Nov 8, 2024

Merge LoCo with Zero++ #6730

Are you sure you want to change the base?

Merge LoCo with Zero++ #6730

Conversation

XingyuXie commented Nov 8, 2024

Integration of LoCo Method into ZeRO++

Overview

Experimental Results

Example Script

Acknowledgments

XingyuXie commented Nov 8, 2024