Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed #288

Merged
merged 67 commits into from
Oct 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d1ea793
deepspeed
haqishen Jul 18, 2023
5ef4792
shard
haqishen Jul 19, 2023
b5fac57
full param deepspeed works by this commit
haqishen Jul 25, 2023
0f7086b
offload optimizer & documentation
haqishen Jul 26, 2023
687c456
format & fix save deepspeed weight
haqishen Aug 2, 2023
3b7ff0d
format & update save_checkpoint
haqishen Aug 3, 2023
105a849
update pipfile
haqishen Aug 4, 2023
f583395
update pipfile
haqishen Aug 5, 2023
cbc50fb
zero init for transformers
haqishen Aug 9, 2023
ffee1c0
add some new config
haqishen Aug 9, 2023
f40ef52
fix bug
haqishen Aug 9, 2023
9cd37ab
min 1e6
haqishen Aug 10, 2023
69e9eb1
update deepspeed config
haqishen Aug 17, 2023
1415cdc
Merge main to deepspeed
haqishen Aug 17, 2023
9db3fbd
Merge branch 'main' into deepspeed
haqishen Aug 17, 2023
b0df016
Update requirements.txt
haqishen Aug 17, 2023
d30b51c
remove duplicate code
haqishen Aug 18, 2023
a4b76c3
Merge branch 'deepspeed' of github.com:h2oai/h2o-llmstudio into deeps…
haqishen Aug 18, 2023
67629ee
throw warning when compile w/ deepspeed
haqishen Aug 18, 2023
48d7f71
black
haqishen Aug 18, 2023
d1efef5
integrate deepspeed into wrap_model_distributed
haqishen Aug 18, 2023
d6b0748
remove unuse code
haqishen Aug 18, 2023
3f89359
style
haqishen Aug 18, 2023
5c253f2
fix bug
haqishen Aug 18, 2023
9ff717f
fix bug
haqishen Aug 18, 2023
405b207
Merge branch 'main' into deepspeed
haqishen Aug 18, 2023
b3495d4
max token len to 16k
haqishen Aug 18, 2023
7b78538
deepspeed save lora
haqishen Aug 21, 2023
892f47c
update get optimizer
haqishen Aug 21, 2023
f2dfb89
fix check disk
haqishen Aug 21, 2023
efe77bb
Merge branch 'main' into deepspeed
haqishen Aug 23, 2023
d297ec9
comment out offload CPU
haqishen Aug 28, 2023
a6781f1
Merge branch 'deepspeed' of github.com:h2oai/h2o-llmstudio into deeps…
haqishen Aug 28, 2023
e6e46dc
Merge branch 'main' into deepspeed
haqishen Aug 28, 2023
e16cab8
Pipfile.lock
haqishen Aug 28, 2023
65a1b2d
Merge branch 'main' into deepspeed
haqishen Aug 28, 2023
32b16a5
Update requirements.txt
haqishen Aug 28, 2023
eb4c990
Merge branch 'main' into deepspeed
haqishen Aug 28, 2023
e36fada
make black
haqishen Aug 29, 2023
bc4c239
Merge branch 'deepspeed' of github.com:h2oai/h2o-llmstudio into deeps…
haqishen Aug 29, 2023
b5e59e9
add default
haqishen Aug 29, 2023
24eeb16
minor fix
haqishen Sep 4, 2023
b9e5934
minor fix
haqishen Sep 4, 2023
a296cca
minor fix
haqishen Sep 4, 2023
11a4b8d
fix val loader
haqishen Sep 5, 2023
3efa2c9
potential val loader fix
psinger Sep 7, 2023
14bc17e
update
psinger Sep 8, 2023
0f40322
merge
psinger Sep 8, 2023
bd1e134
lock
psinger Sep 8, 2023
6f81182
Update requirements.txt
psinger Sep 8, 2023
62fc9c5
improve model saving for deepspeed
haqishen Sep 26, 2023
dbbbcdf
solved INFLIGHT problem
haqishen Sep 26, 2023
c023d19
update doc
haqishen Sep 26, 2023
2785f9f
deepspeed default push to hub by cpu
haqishen Sep 28, 2023
aa17c0b
Revert "improve model saving for deepspeed"
haqishen Oct 5, 2023
4491c16
remove unuse code
haqishen Oct 5, 2023
fa031f2
Merge branch 'main' into deepspeed
haqishen Oct 10, 2023
9337741
Update requirements.txt
haqishen Oct 10, 2023
263f48a
deepspeed==0.11.1
haqishen Oct 19, 2023
83429b6
Merge branch 'main' into deepspeed
haqishen Oct 19, 2023
882631a
Update requirements.txt
haqishen Oct 19, 2023
368f0af
temp fix for deepspeed slow gen
haqishen Oct 20, 2023
011e269
Merge branch 'deepspeed' of github.com:h2oai/h2o-llmstudio into deeps…
haqishen Oct 20, 2023
d5dbbfb
style
haqishen Oct 20, 2023
5b8499c
style
haqishen Oct 20, 2023
07bb4b2
fix
psinger Oct 24, 2023
91562e9
Merge branch 'main' into deepspeed
haqishen Oct 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ tiktoken = "==0.5.1"
hf-transfer = "==0.1.3"
peft = "==0.5.0"
azure-storage-file-datalake = ">=12.12.0"
deepspeed = "==0.11.1"
keyring = "==24.2.0"

[dev-packages]
Expand Down
823 changes: 464 additions & 359 deletions Pipfile.lock

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Whether to offload optimizer to cpu for saving more GPU ram during training. Note that turn on offload_optimizer would further slow down training.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes. Smaller values use less memory, but slow down training.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but slow down training.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but slow down training.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication and slow down training. (especially latency-bound messages).
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Maximum number of parameter elements to fetch ahead of use. Smaller values use less memory, but slow down training..
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Whether to use deepspeed for saving GPU ram during training. Note that turning on DeepSpeed can slow down training.
4 changes: 3 additions & 1 deletion llm_studio/app_utils/sections/experiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -1680,7 +1680,9 @@ async def experiment_push_to_huggingface_dialog(q: Q, error: str = ""):
num_running_queued = len(
experiments[experiments["status"].isin(["queued", "running"])]
)
if num_running_queued > 0:
experiment_path = q.client["experiment/display/experiment_path"]
cfg = load_config_yaml(os.path.join(experiment_path, "cfg.yaml"))
if num_running_queued > 0 or cfg.environment.use_deepspeed:
default_device = "cpu"

try:
Expand Down
73 changes: 42 additions & 31 deletions llm_studio/app_utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,36 +114,41 @@ def start_process(
env = {**os.environ, **env_vars}

if num_gpus == 0:
p = subprocess.Popen(
[
"python",
"train_wave.py",
"-Y",
config_name,
"-Q",
",".join([str(x) for x in process_queue]),
],
env=env,
)
cmd = [
"python",
"train_wave.py",
"-Y",
config_name,
]
# Do not delete for debug purposes
# elif num_gpus == 1:
# p = subprocess.Popen(
# [
# "env",
# f"CUDA_VISIBLE_DEVICES={','.join(gpu_list)}",
# "python",
# "-u",
# "train_wave.py",
# "-P",
# config_name,
# "-Q",
# ",".join([str(x) for x in process_queue]),
# ]
# )
# cmd = [
# "env",
# f"CUDA_VISIBLE_DEVICES={','.join(gpu_list)}",
# "python",
# "-u",
# "train_wave.py",
# "-P",
# config_name,
# ]
else:
free_port = find_free_port()
p = subprocess.Popen(
[
if cfg.environment.use_deepspeed:
logger.info("Starting deepspeed...")
cmd = [
"env",
"deepspeed",
"--include",
f"localhost:{','.join(gpu_list)}",
"--master_port",
f"{str(free_port)}",
"train_wave.py",
"-Y",
config_name,
]
else:
logger.info("Starting torchrun...")
cmd = [
"env",
f"CUDA_VISIBLE_DEVICES={','.join(gpu_list)}",
"torchrun",
Expand All @@ -152,11 +157,17 @@ def start_process(
"train_wave.py",
"-Y",
config_name,
"-Q",
",".join([str(x) for x in process_queue]),
],
env=env,
)
]

if len(process_queue) > 0:
cmd.append("-Q")
cmd.append(",".join([str(x) for x in process_queue]))

p = subprocess.Popen(
cmd,
env=env,
)

logger.info(f"Percentage of RAM memory used: {psutil.virtual_memory().percent}")

return p
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -227,9 +227,9 @@ class ConfigNLPCausalLMTokenizer(DefaultConfig):

def __post_init__(self):
super().__post_init__()
self._possible_values["max_length_prompt"] = (32, 8192, 32)
self._possible_values["max_length_answer"] = (32, 8192, 32)
self._possible_values["max_length"] = (32, 8192, 32)
self._possible_values["max_length_prompt"] = (32, 1024 * 16, 32)
self._possible_values["max_length_answer"] = (32, 1024 * 16, 32)
self._possible_values["max_length"] = (32, 1024 * 16, 32)
self._possible_values["padding_quantile"] = (0, 1, 0.01)
self._padding_side = "left"

Expand Down Expand Up @@ -343,6 +343,13 @@ class ConfigNLPCausalLMEnvironment(DefaultConfig):

compile_model: bool = False
use_fsdp: bool = False
use_deepspeed: bool = False
deepspeed_reduce_bucket_size: int = 1e6
deepspeed_stage3_prefetch_bucket_size: int = 1e6
deepspeed_stage3_param_persistence_threshold: int = 1e6
# deepspeed_offload_optimizer: bool = False
psinger marked this conversation as resolved.
Show resolved Hide resolved
# deepspeed_stage3_max_live_parameters: int = 1e9
# deepspeed_stage3_max_reuse_distance: int = 1e9

find_unused_parameters: bool = False
trust_remote_code: bool = True
Expand Down Expand Up @@ -376,6 +383,37 @@ def __post_init__(self):

self._possible_values["number_of_workers"] = (1, multiprocessing.cpu_count(), 1)
self._possible_values["seed"] = possible_values.Number(step=1, min=-1)
self._possible_values["deepspeed_reduce_bucket_size"] = possible_values.Number(
step=1, min=1e6
)
self._possible_values[
"deepspeed_stage3_prefetch_bucket_size"
] = possible_values.Number(step=1, min=1e6)
self._possible_values[
"deepspeed_stage3_param_persistence_threshold"
] = possible_values.Number(step=1, min=1e6)
self._possible_values[
"deepspeed_stage3_max_live_parameters"
] = possible_values.Number(step=1, min=1e6)
self._possible_values[
"deepspeed_stage3_max_reuse_distance"
] = possible_values.Number(step=1, min=1e6)
self._nesting.add(
[
"deepspeed_reduce_bucket_size",
"deepspeed_stage3_prefetch_bucket_size",
"deepspeed_stage3_param_persistence_threshold",
# "deepspeed_offload_optimizer",
],
[Dependency(key="use_deepspeed", value=False, is_set=False)],
)
# self._nesting.add(
# [
# "deepspeed_stage3_max_live_parameters",
# "deepspeed_stage3_max_reuse_distance",
# ],
# [Dependency(key="deepspeed_offload_optimizer", value=False, is_set=False)], # noqa: E501
# )


@dataclass
Expand Down
Loading