Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend][Pytorch] Improve Pytorch frontend for object detection models #6449

Merged
merged 23 commits into from
Sep 17, 2020

Conversation

kevinthesun
Copy link
Contributor

Some necessary improvements for pytorch od models.

@zhiics @yongwww @masahi

@masahi
Copy link
Member

masahi commented Sep 11, 2020

cc @siju-samuel @t-vi

@masahi
Copy link
Member

masahi commented Sep 11, 2020

@kevinthesun Thanks for working on this. Can you split this into multiple PRs? In particular, besides the new op conversion, you made many non trivial changes to existing ops. Without tests for the latter changes, it is hard to tell what they are for.

We can merge the new op conversion first (as they came with tests).

@kevinthesun
Copy link
Contributor Author

@masahi These changes are mainly for torch vision rcnn models which enhances current features. There is another PR adding some backend stuffs. After that I can add e2e torch vision rcnn tests into this PR which should cover all changes.

Copy link
Contributor

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, @kevinthesun !
I must admit I'm a bit concerned about the typing changes moving us back to doing things in a less systematic way. Most importantly

  • Have a view how we handle types (where in the conversion we switch from PyTorch types to Python/TVM ones) and try to avoid per-op rules where possible. (Kudos for unifying the full shape arguments!)
  • One of the patterns I would like to try to avoid is try: ... except: as a regular way of processing inputs of different types. It would seem that it might blur our understanding of what going on.

There seems to be a lot going on at the same time, if @masahi's suggestion to split would be feasible, it would be easier to see what is what.

dtype0 = _infer_type(inputs[0]).checked_type.dtype
if isinstance(inputs[1], _expr.Expr):
dtype1 = _infer_type(inputs[1]).checked_type.dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit that I'd appreciate if there were more commentary to the typing changes here.

  • In my opinion (and I could be wrong), it would be helpful to have a view what kind of types input_types and inputs can have and have a single place where we do implicit type promotion. I had hoped _pytorch_promote_types could be that.
  • If _pytorch_promote_types doesn't do the job, maybe we can comment why it isn't. Also why is this particular apparently particular elementwise ops as opposed to amending _pytorch_promote_types?

I know this looks like I'm asking for busywork when you're mostly interested in getting a particular to work, but I have the impression that we would want to avoid ad hoc type workarounds as much as possible if we want to avoid having subtle bugs whenever someone uses something outside what our unit tests catch.

Copy link
Contributor Author

@kevinthesun kevinthesun Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from weird behavior of prim::NumToTensor. It converts int32 to int64 silently:

%11 : int = aten::size(%img.1, %10), scope: __module.model # /usr/local/lib/python3.6/dist-packages/torchvision/models/detection/generalized_rcnn.py:62:0
  %im_h : Long() = prim::NumToTensor(%11), scope: __module.model

Right now py frontend just follow use the same dtype for this op output. For an elemwise op, pytorch input dtype is ["int64", "int64"] which is fine. However, the actual input dtype is ["int64", "int32"]. What I can do is to enhance _pytorch_promote_types so that we do _infer_type for every input and get actual input dtype, rather than solely relying on pytorch input dtype. Sounds like a plan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would eventually want to look at using type propagation more.
However, the issue here is that PyTorch's default dtype for integral tensors is int64. I don't think we should be hacking around that, really, because we're bound to end up with cases where int64 is the right thing to have. If I understood the discussions on the forum correctly, the idea was to downcast 64 bit indexing to 32 based if it is considered safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%11 : int = aten::size(%img.1, %10) generates int32 but %im_h : Long() = prim::NumToTensor(%11) automatically converts it to int64, without any hint. When we converting prim::NumToTenso, we can just follow the input type which is int32 here since there is no any other information. So this is about the weird behavior of prim::NumToTenso rather than indexing. I'm not sure how many other ops in pytorch has such behavior, but it looks like inferring actual input type in _pytorch_promote_types would fix these kind of issues.

k = int(_infer_value(inputs[1], {}).asnumpy().tolist())
k = _expr.const(k)
except Exception:
k = inputs[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int is not needed here?
Also it might be worth trying to avoid try: ... except Exception: during non-error-processing in favour of if isinstance(....): ... else:.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try except block is mainly for _infer_value. Currently there is no very secure way to try _infer_value with explicit error types. That's why a general Exception is used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, I would prefer looking at what the type of inputs[1] is and have an if. We should at least know which types are good to leave as is (the current except block).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can do what I did for arange. It's checking whether input is type _expr.Expr.

begin = _op.concatenate(tmp, axis=0)
btype = _infer_type(begin).checked_type.dtype
if str(btype) != "int32":
begin = _op.cast(begin, "int32")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int32 here and index_size limit 2**63-1 feels strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use int64 now.

if isinstance(inputs[3], _expr.Expr):
try:
target_end = np.asscalar(_infer_value(inputs[3], {}).asnumpy().astype(np.int))
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For which types do we want to do this (or alternatively which can go straight through)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(inputs[3], _expr.Expr):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have a strong preference for that, yeah.

new_shape.append(dim)
else:
try:
dim = int(_infer_value(dim, {}).asnumpy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too, maybe avoid: try: .. except:
(there are more places, I didn't flag them all, but I think they should be all changed to use plain if).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. These try except blocks are necessary to handle dynamic operators.

@masahi
Copy link
Member

masahi commented Sep 11, 2020

@kevinthesun sounds like you already have mask rcnn working :) can't wait

@kevinthesun
Copy link
Contributor Author

@masahi Coming soon. :D

@masahi
Copy link
Member

masahi commented Sep 12, 2020

If try ... except is necessary, I recommend adding a wrapper function around _infer_value, and isolate try ... except logic there. I see many repeated try ... except

@kevinthesun
Copy link
Contributor Author

kevinthesun commented Sep 14, 2020

@masahi The problem of creating a try_infer_value API is that it doesn't simplify the codes since we need to do various handling in except block for different ops. We still need to check the output of try_infer_value and have a branching to decide what actions to take. In some cases we also need to do more processing in try block. There is no uniform logic for such dynamic attribute inference.

@t-vi
Copy link
Contributor

t-vi commented Sep 14, 2020

But we should still know when it is appropriate to run the inference, should we not?
I must admit I'm still troubled by the idea that we don't know beforehand which inputs to process with which method.

@kevinthesun
Copy link
Contributor Author

kevinthesun commented Sep 14, 2020

But we should still know when it is appropriate to run the inference, should we not?
I must admit I'm still troubled by the idea that we don't know beforehand which inputs to process with which method.

We do this for some ops which have dynamic attributes. When those dynamic attributes are relay Expr, we need to try to infer value to make the generated relay program as static as possible. There is no good way to further tell which Relay Expr needs to be inferred(And not necessary since _infer_value does general evaluation for Relay Expr).

@kevinthesun
Copy link
Contributor Author

E2E tests added. Now waiting for #6464.

@zhiics
Copy link
Member

zhiics commented Sep 14, 2020

Thanks for the efforts and discussions. @kevinthesun could you please summarize the solutions/decisions to align with @masahi and @t-vi so that we can move forward?

@kevinthesun
Copy link
Contributor Author

@zhiics @masahi @t-vi Sure. One major thing in this PR is the handling for dynamic operators such as slice, arange and topk. These ops have dynamic attribute which affects relay type inference. The methodology here is to try to infer these values to make them as static as possible(Similar in tf parser).

@masahi suggested we can have an API wrap around try except blocks. However, the issue is for this method is for different ops the logic inside try except can be quite different and hard to generate a uniform interface.

@t-vi suggested we can check the input type to see whether we need to do such try except infer_value. Currently we differentiate between relay expr and numerical values. For these dynamic ops, we try to infer value when dynamic attribute is a Relay Expr to see whether it can become a numerical value. For all the subtype under relay Expr, we just call uniform interface of _infer_value to handle them.

Any comments or suggestions?

@masahi
Copy link
Member

masahi commented Sep 15, 2020

Is moving to fully dynamic frontend like being done for onnx #6351 help removing infer_value usage?

@kevinthesun
Copy link
Contributor Author

kevinthesun commented Sep 15, 2020

Is moving to fully dynamic frontend like being done for onnx #6351 help removing infer_value usage?

While we try to make relay expression as static as possible, a lot of work can still only be done in frontend. For these cases we still need to use _infer_value. This happens a lot for tf/pt od models. For some simple ops such as topk, we can directly use dyn namespace op and eliminate _infer_value. Later when we gradually improve dynamic ops it's possible to eliminate more _infer_values.

@masahi
Copy link
Member

masahi commented Sep 15, 2020

@kevinthesun Thanks I'm trying running e2e on my end. I have following questions:

  1. How long does it take to compile faster or mask rcnn from torchvision? I remember hearing TF faster rcnn taking 20 min to compile. If it is too slow, it might not be a good idea to run them on CI...

  2. Interestingly, mask rcnn seems to have prim::Loop, even though it is traced. In general, does tracing make sense for object detection models (that might have data dependent code path)? The loop is coming from https://github.com/pytorch/vision/blob/1a04d3c265679e1a508e7cd627006aaa9ef1ccfb/torchvision/models/detection/roi_heads.py#L457, so this is a partly scripted model.

  3. I'm having this error below, is it expected?

TVMError: Check failed: allow_alloc: Cannot find the Realization point of tensor Tensor(shape=[1], op.name=box_data_length)
During handling of the above exception, another exception occurred:

TVMError: Check failed: allow_alloc: Cannot find the Realization point of tensor Tensor(shape=[1], op.name=box_data_length)
Error during compile function
-----------------------------
#[version = "0.0.5"]
fn (%p0: Tensor[(1, ?, ?), float32], Primitive=1) -> (Tensor[(1), int32], Tensor[(1, ?, ?), float32], Tensor[(1, ?), int32]) {
  vision.get_valid_counts(%p0, meta[relay.attrs.GetValidCountsAttrs][0]) /* ty=(Tensor[(1), int32], Tensor[(1, ?, ?), float32], Tensor[(1, ?), int32]) */
}

@t-vi
Copy link
Contributor

t-vi commented Sep 15, 2020

@zhiics @kevinthesun @masahi
Thank you, @kevinthesun for your summary and all the work in the investigation and your PR.

I think using if isinstance(..., _expr.Expr): would be very much preferable to using exceptions.

  1. I see the uses of try: ... except: ... as using exceptions for regular control flow (because the error case is the one where the old normal logic is applicable and so we should have a clear view when the new logic is applicable and when not) and it would then except on the old cases,
  2. Not using try: ... except: ... for regular control flow seems like good programming fundamentals to me. It would seem odd if TVM as a compiler stack would not strive to follow best practice here.

Neither am I entirely sure whether 1. is contentious or not and to me it would seem that a PR is an odd place to form an opinion on 2. At the same time I see the construct as problematic enough to have a really hard time liking the current state of the PR. It would bring great joy if you could be convinced to move it to if.

I should emphasize that I'm entirely for having the new functions appreciate your @kevinthesun work on this. Thank you!

@kevinthesun
Copy link
Contributor Author

kevinthesun commented Sep 15, 2020

@zhiics @kevinthesun @masahi
Thank you, @kevinthesun for your summary and all the work in the investigation and your PR.

I think using if isinstance(..., _expr.Expr): would be very much preferable to using exceptions.

  1. I see the uses of try: ... except: ... as using exceptions for regular control flow (because the error case is the one where the old normal logic is applicable and so we should have a clear view when the new logic is applicable and when not) and it would then except on the old cases,
  2. Not using try: ... except: ... for regular control flow seems like good programming fundamentals to me. It would seem odd if TVM as a compiler stack would not strive to follow best practice here.

Neither am I entirely sure whether 1. is contentious or not and to me it would seem that a PR is an odd place to form an opinion on 2. At the same time I see the construct as problematic enough to have a really hard time liking the current state of the PR. It would bring great joy if you could be convinced to move it to if.

I should emphasize that I'm entirely for having the new functions appreciate your @kevinthesun work on this. Thank you!

@t-vi Thanks for your thoughts. To handle dynamic op correctly, we have to use if isinstance(..., _expr.Expr) together with try ... except. Do we have an agreement that in the dynamic ops involved in this PR, try ... except is necessary? Currently there is no other way rather than try _infer_value to get constant attribute. Unfortunately this is not about good programming fundamentals but about functionality, since regular control flow won't do the trick. You can see the similar methodology in TensorFlow frontend. We need to do these because tf/pt od models are in the most complicated models which TVM has ever tried to compiled. As @masahi mentioned, we can gradually move these complicated dynamic inference logics to backend while dyn namespace is improved. However, currently these are something necessary to support OD models.

@kevinthesun
Copy link
Contributor Author

@masahi It looks like certain recent change causes this error. I'm investigating.

@t-vi
Copy link
Contributor

t-vi commented Sep 15, 2020

I don't actually get why we need try: ... except: ... what case does it not handle that is handled properly by the except part?

@kevinthesun
Copy link
Contributor Author

kevinthesun commented Sep 15, 2020

Original pt frontend just handles limited cases, mostly static shape/attributes. It is fine we just keep input as it is for static models. For more dynamic models, we need to do some extra work to reduce the dynamism during type inference. For example, there is a chance to reduce output shape of (?, ?, ?) to (1, ?, ?) in a dynamic op. This is necessary otherwise it's hard to ensure we are doing the right thing for backend. That error pointed out by @masahi is exactly the case. The input shape of get_valid_counts should be (1, ?, 5) while somehow recent change makes it (1, ?, ?). get_valid_counts doesn't allow dynamic box data length. This is an example why we need to make the output relay Expr as static as possible and _infer_value is necessary.

@t-vi
Copy link
Contributor

t-vi commented Sep 15, 2020

I think I'm slowly starting to understand. But couldn't one have something like e_new = _try_to_make_static(e) that takes an expression, makes it as static as possible and returns the result - which might be e if it cannot be static. Ideally, this would work without going through exceptions, but if we cannot have this, one could implement this with try: ... except: ... . The thing I'm uncomfortable with is having so many places where we do try: ... exept:. (sounds like @masahi suggested this above)

@kevinthesun
Copy link
Contributor Author

As I have discussed with @masahi, the problem of having a try interface is that there is no common logic between different dynamic ops when dealing with dynamic attributes. We need to take different actions in try/except block depending on the actual op.

Copy link
Contributor

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the bits I had in mind seem OKish even if it creates cleanup opportunity.

@masahi
Copy link
Member

masahi commented Sep 15, 2020

@kevinthesun The maskrcnn test worked for me, great!

But unfortunately under torch 1.6, conversion fails with the following error. Most likely they come from pytorch function that are scripted https://github.com/pytorch/vision/blob/1a04d3c265679e1a508e7cd627006aaa9ef1ccfb/torchvision/models/detection/roi_heads.py#L454. Most of them don't make sense for tracing (like raising an exception).

We need to come back to this problem later when we upgrade our CI.

NotImplementedError: The following operators are not implemented:[
'aten::append',
'aten::tensor',
'aten::dim',
'aten::warn',
'aten::__is__',
'aten::__isnot__',
'prim::RaiseException',
'prim::unchecked_cast',
'aten::IntImplicit',
'aten::empty',
'aten::numel',
'prim::Uninitialized',
'aten::__contains__'
]

@kevinthesun
Copy link
Contributor Author

kevinthesun commented Sep 16, 2020

@masahi Yeah. Those ops look like coming from scripted model. I believe for pt 1.6 if we trace the model there are 2 or 3 ops missing.

@masahi
Copy link
Member

masahi commented Sep 16, 2020

@masahi Yeah. Those ops look like coming from scripted model. I believe for pt 1.6 if we trace the model there are 2 or 3 ops missing.

To be clear, that list op is coming from tracing mask rcnn model, since mask rcnn is partly scripted, even if we torch.jit.trace on it, we still get partly scripted TorchScript IR. See the link to torchvision code I added above.

For faster rcnn, which is not partly scripted and thus can be traced completely, I get following missing ops with PyTorch 1.6

NotImplementedError: The following operators are not implemented: ['aten::tensor', 'aten::empty', 'aten::numel']

pt_scores = pt_res[1].detach().numpy().tolist()
tvm_scores = tvm_res[1].asnumpy().tolist()
num_pt_valid_scores = num_tvm_valid_scores = 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm comparing the two output (box coordinates etc) by eye balling the raw numerical values, and it looks good!

I hope we can have a better way to test the outputs, for example extracting valid box indices based on score, sort indices by score, and sort boxes by sorted indices, like I did below.

In [59]: boxes_pt[ind_pt]                                                                                                                                                                        
Out[59]: 
array([[2.04335907e+02, 1.14787331e+02, 2.59456146e+02, 2.23669510e+02],
       [1.44117985e+01, 1.24377182e+02, 6.13694534e+01, 2.14236847e+02],
       [1.74448120e+02, 1.58607117e+02, 2.78158417e+02, 2.36064560e+02],
       [1.17156494e+02, 1.18118942e+02, 1.53017059e+02, 1.92442230e+02],
       [1.00772736e+02, 1.22123978e+02, 1.23872040e+02, 1.93398422e+02],
       [1.49618347e+02, 1.32603149e+02, 2.18598679e+02, 1.74433960e+02],
       [2.13966250e-01, 1.39350525e+02, 1.12648888e+01, 1.53912018e+02],
       [1.33723541e+02, 1.24649574e+02, 1.64407623e+02, 1.61921951e+02],
       [8.67264709e+01, 1.28565033e+02, 9.51557159e+01, 1.56289093e+02]],
      dtype=float32)

In [60]: boxes_tvm[ind_tvm]                                                                                                                                                                   
Out[60]: 
array([[204.3359    , 114.78732   , 259.45615   , 223.66951   ],
       [ 14.411795  , 124.37717   ,  61.369446  , 214.23685   ],
       [174.44815   , 158.60712   , 278.1584    , 236.06454   ],
       [117.156494  , 118.118935  , 153.01706   , 192.44223   ],
       [100.772736  , 122.12396   , 123.87204   , 193.39842   ],
       [149.61836   , 132.60315   , 218.5987    , 174.43396   ],
       [  0.39432764, 139.76776   ,  11.332638  , 153.84328   ],
       [133.72354   , 124.64958   , 164.40762   , 161.92194   ],
       [ 86.72647   , 128.56502   ,  95.155716  , 156.28911   ]],
      dtype=float32)

Copy link
Contributor Author

@kevinthesun kevinthesun Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Ideally in this case we should test against testing data set to calculate MAP. We have tested against coco data set and the accuracy is fine.

Copy link
Member

@yongwww yongwww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave up supporting mask rcnn from torchvision in last January, concluding that was not possible with TVM at that time. Really great to see this happening!!

@masahi masahi merged commit 8b4ca61 into apache:master Sep 17, 2020
@masahi
Copy link
Member

masahi commented Sep 17, 2020

Thanks @kevinthesun for the great work!!
Thanks everyone for review.

kevinthesun added a commit to kevinthesun/tvm that referenced this pull request Sep 17, 2020
…els (apache#6449)

* Improve Pytorch Frontend

* Add tests

* Fix pylint

* Improve data cast

* Use int64 for slice axis

* Fix lint

* fix roi_align(..., aligned=True)

* Minor fix

* Add e2e test

* Add asf header

* Minor change

* Use dynamic topk

* Improve test

* Rollback topk

* py format

* remove print

* More improve

* Fix test

* Improve addmm

* Fix test

* Fix format

* Fix format

* Fix test scatter

Co-authored-by: q.yao <[email protected]>
kevinthesun added a commit to kevinthesun/tvm that referenced this pull request Sep 18, 2020
…els (apache#6449)

* Improve Pytorch Frontend

* Add tests

* Fix pylint

* Improve data cast

* Use int64 for slice axis

* Fix lint

* fix roi_align(..., aligned=True)

* Minor fix

* Add e2e test

* Add asf header

* Minor change

* Use dynamic topk

* Improve test

* Rollback topk

* py format

* remove print

* More improve

* Fix test

* Improve addmm

* Fix test

* Fix format

* Fix format

* Fix test scatter

Co-authored-by: q.yao <[email protected]>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Sep 18, 2020
…els (apache#6449)

* Improve Pytorch Frontend

* Add tests

* Fix pylint

* Improve data cast

* Use int64 for slice axis

* Fix lint

* fix roi_align(..., aligned=True)

* Minor fix

* Add e2e test

* Add asf header

* Minor change

* Use dynamic topk

* Improve test

* Rollback topk

* py format

* remove print

* More improve

* Fix test

* Improve addmm

* Fix test

* Fix format

* Fix format

* Fix test scatter

Co-authored-by: q.yao <[email protected]>
@masahi masahi mentioned this pull request Sep 30, 2020
4 tasks
@zhangzj03
Copy link

@kevinthesun The maskrcnn test worked for me, great!

But unfortunately under torch 1.6, conversion fails with the following error. Most likely they come from pytorch function that are scripted https://github.com/pytorch/vision/blob/1a04d3c265679e1a508e7cd627006aaa9ef1ccfb/torchvision/models/detection/roi_heads.py#L454. Most of them don't make sense for tracing (like raising an exception).

We need to come back to this problem later when we upgrade our CI.

NotImplementedError: The following operators are not implemented:[
'aten::append',
'aten::tensor',
'aten::dim',
'aten::warn',
'aten::__is__',
'aten::__isnot__',
'prim::RaiseException',
'prim::unchecked_cast',
'aten::IntImplicit',
'aten::empty',
'aten::numel',
'prim::Uninitialized',
'aten::__contains__'
]
```hi. are these op ok ? now 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants