Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Finally I reproduce the result on the same wild skating video #23

Open
bucktoothsir opened this issue Jan 18, 2019 · 9 comments
Open

Finally I reproduce the result on the same wild skating video #23

bucktoothsir opened this issue Jan 18, 2019 · 9 comments

Comments

@bucktoothsir
Copy link

bucktoothsir commented Jan 18, 2019

camera1

In #6, I got a worse result on the skate video. Then I cut it into a square video and finally reproduced the result showed https://s3.amazonaws.com/video-pose-3d/private/videos/wild4.mp4

@DongJT1996
Copy link

Hello,
It's cool to reproduce the result. May I ask you a question that have you tried to change the viewpoint to see the result?

@bucktoothsir
Copy link
Author

@DongJT1996 yes I have

@tobiascz
Copy link

@bucktoothsir - We are discussing this topic in 3 different issues now 💃

Really cool! What do you think is the reason for this improvement?

  1. Smaller Image so that the dancer is always more in the center and detectron estimates the 2D positions more accurate?
  2. Something internal in the VideoPose3D code?
  3. Your computer was in a good mood?

@bucktoothsir
Copy link
Author

@tobiascz
I think the reason is that the h36m dataset is square.
So the model trained on h36m dataset could only works on square pictures.

@tobiascz
Copy link

@tobiascz
I think the reason is that the h36m dataset is square.
So the model trained on h36m dataset could only works on square pictures.

But you don't train directly on the images. You train the network on 2D poses and that's just an 17,2 array for each frame. The network lifts the 2D poses to 3D and you can compare the results to the ground truth provided in Human 3.6m. The only one who cares about your image resolution or size or whatever is your 2D Detector and that is in our case detectron.

@bucktoothsir
Copy link
Author

bucktoothsir commented Jan 29, 2019

@tobiascz
I think the reason is that the h36m dataset is square.
So the model trained on h36m dataset could only works on square pictures.

But you don't train directly on the images. You train the network on 2D poses and that's just an 17,2 array for each frame. The network lifts the 2D poses to 3D and you can compare the results to the ground truth provided in Human 3.6m. The only one who cares about your image resolution or size or whatever is your 2D Detector and that is in our case detectron.

2D pose is relative to the size of image. check code 84 row in run.py
kps[..., :2] = normalize_screen_coordinates(kps[..., :2], w=cam['res_w'], h=cam['res_h'])
but your thought might be right.

it is easy to verify our guess.
1.
1.1 extract 2d pose by original videos
1.2 cut the videos to square and change 2d pose by translation.
1.3. reconstruct 3d pose
2.
2.1 cut the videos to square
2.2 extract 2d pose on square videos
2.3 reconstruct 3d pose

compare the two results you would find something.

@tobiascz
Copy link

tobiascz commented Jan 30, 2019

@bucktoothsir point 1. to 1.3 I already did here right.

So I tried what you suggested. I cutted the original video to square. Extracted 2D pose with detectron and reconstructed 3D pose using VideoPose3D.

2nd Method

ezgif com-video-to-gif

So yeah it still doesn't look as nice as yours.

I have one more question.
In the render_animation call in run.py which value for azim did you use?

def render_animation(keypoints, poses, skeleton, fps, bitrate, azim, output, viewport,
					 limit=-1, downsample=1, size=6, input_video_path=None, input_video_skip=0):

EDIT:
So I actually found another thing that might destroy my results. I used the H36m metadata for my 2D poses from detectron:
metadata = {'layout_name': 'h36m', 'num_joints': 17, 'keypoints_symmetry': [[4, 5, 6, 11, 12, 13], [1, 2, 3, 14, 15, 16]]}

but detectron actually produces coco keypoints

keypoints_coco = [ 'nose','left_eye', 'right_eye','left_ear','right_ear', 'left_shoulder','right_shoulder','left_elbow', 'right_elbow', 'left_wrist','right_wrist', 'left_hip','right_hip', 'left_knee','right_knee','left_ankle','right_ankle' ]
so it actually should be:
`keypoints_symmetry = [ [1, 3, 5, 7, 9 , 11, 13, 15],[2, 4, 6, 8, 10, 12, 14, 16]]

But if i do that my 3D estimation gets completely wrong.

@tobiascz
Copy link

tobiascz commented Feb 1, 2019

The 2D keypoints metadata was my problem:
keypoints_symmetry = [ [1, 3, 5, 7, 9 , 11, 13, 15],[2, 4, 6, 8, 10, 12, 14, 16]]
with this symmetry I get really good results. Had to fix my 3D visualization as mentioned but now I am very happy with the reults!

scater_gif

The code I used

@dariopavllo
Copy link
Contributor

Good catch! Indeed, the keypoint symmetry is used for test-time augmentation. If the latter is enabled (it is by default), the match between left/right keypoints should be specified properly to avoid messing things up.

Regarding previous comments: the image does not need to have a square aspect ratio. I made the videos using the original 1920x1080 resolution. The important thing is that the images are normalized so that the largest edge is between -1 and 1 and the center of the image is at (0, 0), but the normalization function normalize_screen_coordinates takes care of that already. You should just make sure to modify cam['res_w'] and cam['res_h'] in the dataset definition to reflect the actual resolution of the video.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants