-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video Neva Pretraining + Inference Implementation #9095
Conversation
for more information, see https://pre-commit.ci
Signed-off-by: Vivian Chen <[email protected]>
@@ -11,7 +11,9 @@ inference: | |||
compute_logprob: False # a flag used to compute logprob of all the input text, a very special case of running inference, default False | |||
end_strings: ["<extra_id_1>","<extra_id_7>",] # generation will stop when one of these tokens is generated | |||
images_base_path: /pwd/images | |||
insert_image_token: null # `left` or `right` or `null` | |||
videos_base_path: null #/pwd/videos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary? if we don't have a mixture of video + image, maybe just media base path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Resolved] changed to media_base_path in the latest commit.
@@ -0,0 +1,92 @@ | |||
## Inference with multimodal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved] move under /docs
@@ -126,6 +135,22 @@ def forward_loop(): | |||
if responses is None: | |||
return | |||
|
|||
results = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated code! remove this part and merge into L154 below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Resolved]
else: | ||
frames = processor.preprocess(frames, return_tensors='pt')['pixel_values'] | ||
|
||
if neva_cfg.precision in [16, '16', '16-mixed']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
def torch_dtype_from_precision(precision: Union[int, str], megatron_amp_O2: Optional[bool] = None) -> torch.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
sources = preprocess_multimodal( | ||
copy.deepcopy(list_data_dict), multimodal_cfg, num_media_latents | ||
) # HARDCODED FOR NOW | ||
num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these changes mean? we don't want to hard code 576 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is because for video neva, the actual image_token_len we need to pass is 256, so we need some logic to make original model_config.data.image_token_len
from 224 to 256. However, if we do not hard coded 576 here, it will break neva v1.5 model inference as it uses image_token_len = 576
. Would appreciate some suggestions to deal with this more elegantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
@@ -372,6 +379,8 @@ def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, c | |||
turn['value'] = re.sub('<image>', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) | |||
list_data_dict.append(record) | |||
|
|||
num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line? already add above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
@@ -385,6 +394,7 @@ def neva_process_prompts(prompt, tokenizer, multimodal_cfg, num_media_latents, c | |||
if turn.get('value') is not None: | |||
turn['value'] = re.sub('<image>', f'{DEFAULT_IMAGE_TOKEN}\n', turn['value']) | |||
list_data_dict.append(record) | |||
num_media_latents = min((num_media_latents // 14) * (num_media_latents // 14), 576) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line? already add above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
Signed-off-by: Vivian Chen <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz remove this file. if you need it we can add to assets later instead of in github source code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[resolved]
Signed-off-by: Vivian Chen <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
This reverts commit 1a02ccd.
Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: paul-gibbons <[email protected]>
This reverts commit 80af9a4.
Signed-off-by: paul-gibbons <[email protected]>
This reverts commit 8c885c7.
Signed-off-by: paul-gibbons <[email protected]>
This reverts commit 94aba65.
Signed-off-by: paul-gibbons <[email protected]>
Signed-off-by: Vivian Chen <[email protected]>
Signed-off-by: Vivian Chen <[email protected]>
* video_neva pretrain * support video neva inference Signed-off-by: Vivian Chen <[email protected]> * yaml update, adding media_type * yaml update, adding media_type * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * modify neva inference config Signed-off-by: Vivian Chen <[email protected]> * modify based on review Signed-off-by: Vivian Chen <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove video test asset Signed-off-by: Vivian Chen <[email protected]> * video_neva doc, describing config changes. Signed-off-by: paul-gibbons <[email protected]> * Revert "video_neva doc, describing config changes." This reverts commit 1a02ccd. * vneva brief doc Signed-off-by: paul-gibbons <[email protected]> * vneva doc update Signed-off-by: paul-gibbons <[email protected]> * doc update Signed-off-by: paul-gibbons <[email protected]> * Revert "doc update" This reverts commit 80af9a4. * doc update Signed-off-by: paul-gibbons <[email protected]> * Revert "doc update" This reverts commit 8c885c7. * doc update Signed-off-by: paul-gibbons <[email protected]> * Revert "doc update" This reverts commit 94aba65. * doc update Signed-off-by: paul-gibbons <[email protected]> * add inference doc to docs, resolve review Signed-off-by: Vivian Chen <[email protected]> * modify inference config for other mlm Signed-off-by: Vivian Chen <[email protected]> --------- Signed-off-by: Vivian Chen <[email protected]> Signed-off-by: paul-gibbons <[email protected]> Co-authored-by: Vivian Chen <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
This PR enables users to train NeVa models on video data by slicing videos into specific number of frames.
Co-authored-by: Pratyush Muthukumar [email protected], Vivian Chen [email protected], Slyne Deng [email protected].
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information