Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training codes release plan #56

Open
itechmusic opened this issue Apr 30, 2024 · 15 comments
Open

Training codes release plan #56

itechmusic opened this issue Apr 30, 2024 · 15 comments

Comments

@itechmusic
Copy link
Contributor

Thank you all for your interest in our open-source work MuseTalk.

We have observed that the training codes hold significant value for our community. With this in mind, we are pleased to share an initial release of the training codes here.

We are committed to enhancing our efforts to finalize the training codes in the near future.

Should you encounter any questions or need clarification about the codes, please feel free to reach out.

@shounakb1
Copy link

shounakb1 commented May 10, 2024

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

Please let me know If I can contribute in any way

@shounakb1
Copy link

shounakb1 commented May 10, 2024

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.

mask = torch.zeros((ref_image.shape[1], ref_image.shape[2]))
mask[:ref_image.shape[2]//2,:] = 1
image = torch.FloatTensor(image)
mask, masked_image = prepare_mask_and_masked_image(image,mask)

@czk32611
Copy link
Contributor

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

Please let me know If I can contribute in any way

It would be great if you can contribute!

You can commit your data preparation codes and create a pull request to the train_codes branch. Then we could review and revise, and make it master once it is finalized.

Thanks!

@czk32611
Copy link
Contributor

czk32611 commented May 11, 2024

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.

mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.

mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

Currently, we mask the entire lower half face the same as it was done in most work (e.g. wav2lip and IP_LAP).

In my opinion, only masking the region below the nose may not be a good idea as our cheek also move when we talk. Sure, modifying the mask region is an interering topic, DINet use a smaller mask which remove the background.

We can further discussion this if you have a better idea.

@shounakb1
Copy link

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!
Please let me know If I can contribute in any way

It would be great if you can contribute!

You can commit your data preparation codes and create a pull request to the train_codes branch. Then we could review and revise, and make it master once it is finalized.

Thanks!

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

Currently, we mask the entire lower half face the same as it was done in most work (e.g. wav2lip and IP_LAP).

In my opinion, only masking the region below the nose may not be a good idea as our cheek also move when we talk. Sure, modifying the mask region is an interering topic, DINet use a smaller mask which remove the background.

We can further discussion this if you have a better idea.

Is there any way I can communicate on slack, discord or any place? Would like to ask something about the audio data preparation code, have some doubts there as the length of whisper chunks returned is different compared to to number of frames being processes during inference, but the data preparation doc here shows the same number of npy files as number of frames.
I do have some ideas on how to detect moving textures in face and selecting based on that maybe but not sure how well it will work out.

@shounakb1
Copy link

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!
Please let me know If I can contribute in any way

It would be great if you can contribute!
You can commit your data preparation codes and create a pull request to the train_codes branch. Then we could review and revise, and make it master once it is finalized.
Thanks!

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

Currently, we mask the entire lower half face the same as it was done in most work (e.g. wav2lip and IP_LAP).
In my opinion, only masking the region below the nose may not be a good idea as our cheek also move when we talk. Sure, modifying the mask region is an interering topic, DINet use a smaller mask which remove the background.
We can further discussion this if you have a better idea.

Is there any way I can communicate on slack, discord or any place? Would like to ask something about the audio data preparation code, have some doubts there as the length of whisper chunks returned is different compared to to number of frames being processes during inference, but the data preparation doc here shows the same number of npy files as number of frames. I do have some ideas on how to detect moving textures in face and selecting based on that maybe but not sure how well it will work out.

I think there is something going on in the datagen function which is taking care of the frame to audio chunks matching, I'm going through it currently. I'll update when I'm done with it according to my understanding.

@czk32611 In the mean time if you could let me know if we can get in touch somehow.

@shounakb1
Copy link

@czk32611 @itechmusic I have created the codes for both audio and images training data preparation, can you tell me how I can commit them for review coz right now I would get access denied. It's a bit of dirty code so I need your help to make it proper to be merged, I changed the save structure so it suits my training purpose so It's a bit different from the one described in doc. I was also wondering if we can generate the whole face if we mask the entire face.

@czk32611
Copy link
Contributor

@czk32611 @itechmusic I have created the codes for both audio and images training data preparation, can you tell me how I can commit them for review coz right now I would get access denied. It's a bit of dirty code so I need your help to make it proper to be merged, I changed the save structure so it suits my training purpose so It's a bit different from the one described in doc. I was also wondering if we can generate the whole face if we mask the entire face.

You may refer here https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork

@czk32611
Copy link
Contributor

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!
Please let me know If I can contribute in any way

It would be great if you can contribute!
You can commit your data preparation codes and create a pull request to the train_codes branch. Then we could review and revise, and make it master once it is finalized.
Thanks!

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

Currently, we mask the entire lower half face the same as it was done in most work (e.g. wav2lip and IP_LAP).
In my opinion, only masking the region below the nose may not be a good idea as our cheek also move when we talk. Sure, modifying the mask region is an interering topic, DINet use a smaller mask which remove the background.
We can further discussion this if you have a better idea.

Is there any way I can communicate on slack, discord or any place? Would like to ask something about the audio data preparation code, have some doubts there as the length of whisper chunks returned is different compared to to number of frames being processes during inference, but the data preparation doc here shows the same number of npy files as number of frames. I do have some ideas on how to detect moving textures in face and selecting based on that maybe but not sure how well it will work out.

You can find me in discord with user name czk32611. However, I do not use discord a lot, it may takes time for me to reply. Sorry.

@shounakb1
Copy link

@czk32611 @itechmusic I have created a pr for the data creation part, its a dirty implementation so if you could just go through it once to see if basic processes are correct, I got pretty decent results after training and also attached a sample in the pr. I also needed to update the Dataloader.py to train it but it worked pretty well for only 4 mins of training data.

I also applied some other tricks like dirazation, vad, headmotion detection etc to make the data automatically, and finally used dlib to cut the face and paste on original video because the ear rings were a little blurry.

Please guide me on any further process as I'm not a regular opensource contributer but really looking to get my hands dirty for the first time. Thanks!

@shounakb1
Copy link

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!
Please let me know If I can contribute in any way

It would be great if you can contribute!
You can commit your data preparation codes and create a pull request to the train_codes branch. Then we could review and revise, and make it master once it is finalized.
Thanks!

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

I have noticed you have taken half of the image as mask, I'm not sure I understand correctly but to me it seems suboptimal compared to masking after detecting the nose and masking the area below that, do correct me if I'm wrong.
mask = torch.zeros((ref_image.shape[1], ref_image.shape[2])) mask[:ref_image.shape[2]//2,:] = 1 image = torch.FloatTensor(image) mask, masked_image = prepare_mask_and_masked_image(image,mask)

Currently, we mask the entire lower half face the same as it was done in most work (e.g. wav2lip and IP_LAP).
In my opinion, only masking the region below the nose may not be a good idea as our cheek also move when we talk. Sure, modifying the mask region is an interering topic, DINet use a smaller mask which remove the background.
We can further discussion this if you have a better idea.

Is there any way I can communicate on slack, discord or any place? Would like to ask something about the audio data preparation code, have some doubts there as the length of whisper chunks returned is different compared to to number of frames being processes during inference, but the data preparation doc here shows the same number of npy files as number of frames. I do have some ideas on how to detect moving textures in face and selecting based on that maybe but not sure how well it will work out.

You can find me in discord with user name czk32611. However, I do not use discord a lot, it may takes time for me to reply. Sorry.

I sent a request from manwithplan6650

@czk32611 czk32611 pinned this issue May 31, 2024
@songcheng
Copy link

songcheng commented Jun 28, 2024

Thank you all for your interest in our open-source work MuseTalk.

We have observed that the training codes hold significant value for our community. With this in mind, we are pleased to share an initial release of the training codes here.

We are committed to enhancing our efforts to finalize the training codes in the near future.

Should you encounter any questions or need clarification about the codes, please feel free to reach out.
@czk32611
为什么训练过程,处理音频数据
audio_feature = audio_feature.to(dtype=weight_dtype)
之后没有对音频做pe处理。

@IronSpiderMan
Copy link

@itechmusic I have created the data preparation codes according to the architecture mentioned in Readme. I used Dlib for the face masking. This is a really simple approach to lip-syncing and it works so great, Thank you guys for this. I would be happy to take some load off you guys if you need any help in the data preparation since I see you have made the training script already. Thanks again!

Please let me know If I can contribute in any way

Are there any tips for the reference image? I tried using a frame from the 5 frames around the target image, but the result wasn't very good. Thank you very much.

@limaoyue1
Copy link

就是我在进行这一步的会,内存炸了,是为啥呢,随着图像的读取内存占用越来越高,然后就炸了
def read_imgs(img_list):
frames = []
print('reading images...')
for img_path in tqdm(img_list):
frame = cv2.imread(img_path)
frames.append(frame)
return frames

@glennchina
Copy link

Thank you all for your interest in our open-source work MuseTalk.

We have observed that the training codes hold significant value for our community. With this in mind, we are pleased to share an initial release of the training codes here.

We are committed to enhancing our efforts to finalize the training codes in the near future.

Should you encounter any questions or need clarification about the codes, please feel free to reach out.

The current train code don't have SIS(Selective Information Sampling ) part that described in your Paper.When will you release this part?
"Selective Information Sampling (SIS) strategy that selects reference images with head poses closely aligned to the target,
while ensuring distinct lip movements." I don't find the code in current DataLoader.py.For now reference frame is just random from all of the video.

Thanks a lot for your work and wait for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants