forked from santoshnlp/END2QnADatasets
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Json_Pytoch_Discuss.json
2769 lines (2769 loc) · 314 KB
/
Json_Pytoch_Discuss.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"data": [
{
"X ": "calculating loss for entire batch using nllloss in 0 4 0",
"Z ": "Defake:loss = criterion_test(dec_outs.view(-1, vocab_size, batch_size), targets.view(-1, batch_size)) ,I think you need to do criterion_test(dec_outs.view(-1, vocab_size),targets.view(-1)) In your case, ( C )- vocab_size and (N)- (batch_size*seq_length). I am assuming all the batches have the same sequence length. If not, youll have to use pack_padded_sequence and also mask the loss for the pad token.",
"Y ": "I believe you should use criterion test(dec outs.view(-1, vocab size),targets.view(-1)) to test your hypothesis.( C ) = vocab size, and ( N ) = batch size*seq length in your case. I'm assuming that the sequence length in each batch is the same. If not, you'll need to use pack padded sequence and hide the pad token loss."
},
{
"X ": "masked fill operates weirdly",
"Z ": "From the error mesage, it is a size issue on the 3rd dimension, where one is of size 8 and the other of size 9.I would print the size of the tensors before the operation to check the dimensions.",
"Y ": "The error message indicates that there is a size issue on the third dimension, with one size 8 and the other size 9.Before the procedure, I would print the size of the tensors to double-check the dimensions."
},
{
"X ": "creating custom nn module for softmargin softmax",
"Z ": "Prefer not to use for loops, try to vectorize your code as much as possible.Refer this old question I asked for checking my implementation of softargmax which I believe you are lookin for. It's pretty decent and fast",
"Y ": "Avoid using for loops whenever feasible, and attempt to vectorize your code as much as possible."
},
{
"X ": "gru autoencoder is not working",
"Z ": "I solve this problem.Target tensor was wrong.",
"Y ": "Target tensor was wrong"
},
{
"X ": "issue with multiple gpu loss convergence",
"Z ": "I solved my issue. Since batch wasnt my first dimension, I had to mention dim=1 in the data parallel, that is the dimension I need to scatter my inputs.",
"Y ": " mention dim=1 in the data parallel, that is the dimension I need to scatter my inputs."
},
{
"X ": "building from source keeps failing ubuntu 18 04 02 lts no gpu",
"Z ": "hey there,I cant tell you the exact reason for you problem, but it is best practice to build pytorch in a clean anaconda environment. Here is how. Please report back if it helps",
"Y ": "Create new anaconda envinorment "
},
{
"X ": "segmentation fault core dumped with personnal nn function",
"Z ": "The problem seem to be solved by updating from v1.0 to v1.0.1",
"Y ": "Update version from v1.0 to v1.0.1"
},
{
"X ": "libtorch cmake error on centos7",
"Z ": "Solved. Cmake 3.10 is fine.",
"Y": "use Cmake 3.10"
},
{
"X ": "cmake error and fatal error lnk1181 building from source on windows 10",
"Z ": "Well, I saw that your build directory contains space. As a workground, you can just avoid that to make build pass. However, more details to fix this issue are welcomed.",
"Y ": "So, I noticed that your build directory has some empty space. You can simply avoid that as a workground to ensure that the build passes. "
},
{
"X ": "data float 1 segfaults when cudatype",
"Z ": "I found the answer on StackOverflow:stackoverflow.com Torch C++: Getting the value of a int tensor by using *.data<int () pytorch, torch, libtorch asked by Afshin Oroojlooy on 02: 22PM - 15 Jan 19 UTC The Tensor class despreately needs documentation!",
"Y ": "The Tensor class despreately needs documentation!"
},
{
"X ": "how to collect libtorch package like the official release when building from source",
"Z ": "I think using an appropriate CMake + make install should work, e.g. the Android build does this. You want to disable the Python bit for this. The suggested alternative there works well, too - building Python and picking the lib and include. Libtorch 1.0 used to be built that way (actually extracting from the whl, Best regards Thomas",
"Y ": "I believe that using an adequate CMake + make install, such as the Android build, should suffice. For this, you'll need to turn off Python.The suggested solution there, creating Python and selecting the lib and include, also works fine. That's how Libtorch 1.0 was made (by extracting from the whl...)."
},
{
"X ": "cuda is available true with python false with c",
"Z ": "Do you use the same PyTorch distribution (i.e. libtorch cmake from /usr/local/lib/python3.x/dist-packages/torch/share/cmake or somesuch)?In the end, the same libtorch should behave the same way Best regards Thomas",
"Y ": "use libtorch"
},
{
"X ": "how to self define a backward function for a net in libtorch i tested some code but failed",
"Z ": "Hi, Note that on the python side, the Function have changed slightly as you can see in the tuto.For cpp it is a bit more complex. a Function does only one way and its ‚Äúapply‚Äù method should be implemented. It is either implemented in pure autograd by performing operations on Variables or the output should be wrapped and the backward Function specified.You will need 2 functions if you want a custom backward. For example here, ‚DelayedError is the forward function and ‚ÄúError‚Äù is the backward.",
"Y ": "The Functions on the Python side have changed slightly, as you can see in the tutorial.It's a little more complicated in cpp. A Function can only be used in one direction, and its apply method should be used. It can be done in pure autograd by executing operations on variables, or it can be wrapped and a reverse function supplied.If you want a custom backward, you'll need two functions. In this case, the forward function is DelayedError, and the backward function is Error."
},
{
"X ": "is there any way to skip steps in a dataloader",
"Z ": "Yeah, I would say no built-in way for now. But, we are working on a new design of DataLoader, which IMO will provide this functionality.",
"Y ": "Bulding New desgin for DataLoader"
},
{
"X ": "understanding model to device",
"Z ": "Yes, your assumption should be correct as also seen in this post, since the model reference would be passed and its parameters (and buffers) updated inplace.You could use the code snippet in the linked post to verify it using your setup.",
"Y ": "Yes, as noted in this post, your assumption should be valid, as the model reference would be passed and its parameters (and buffers) updated in place.To test it with your configuration, you might use the code snippet in the linked post."
},
{
"X ": "update weight with same netoworks output",
"Z ": "That wouldn't be a fix, as it's still using the wrong behavior. Previous PyTorch versions allowed this wrong gradient calculations, which is why no errors were raised.",
"Y ": "Previous PyTorch versions allowed this wrong gradient calculations, which is why no errors were raised."
},
{
"X ": "update network after differentiation with autograd grad",
"Z ": "Update: The layers receive gradients. My problem has to do something with My GNN.Maybe the integration in pytorch geometric is breaking",
"Y ": "Maybe the integration in pytorch geometric is breaking"
},
{
"X ": "create a f score loss function",
"Z ": "AFAIK f-score is ill-suited as a loss function for training a network. F-score is better suited to judge a classifier‚its calibration, but does not hold enough information for the neural network to improve its predictions.Loss functions are differentiable so that they can propagate gradients throughout the network using the chain rule (see ‚backpropagation).",
"Y ": "AFAIK f-score is ill-suited as a loss function for training a network."
},
{
"X ": "how do i backpropagate through a modules parameters",
"Z ": "you could look at something like https://github.com/facebookresearch/higher for this purpose.It functionalizes the model, where it’s parameters can be detached and backproped through",
"Y ": "It functionalizes the model, where it’s parameters can be detached and backproped through"
},
{
"X ": "how am i supposed to cache tensors that require grad but arent learnable module params that are replaced by a different tensor several times each forward pass",
"Z ": "I solved my problem by:Not making activs and outputs nn.Parameters Not assigning them as model attributes. Instead I added them as optional key word arguments in the forward method and returned the activations as well.",
"Y ": "Activations and outputs are not nn. Parameters aren't being assigned as model attributes. Instead, I added them to the forward procedure as optional key word arguments and returned the activations as well."
},
{
"X ": "torch no grad makes any difference",
"Z ": "Hi,The eval mode will make a difference only if you use special Modules that behave differently in eval mode like dropout or batchnorm. But even in that case, the runtime might not change that much.The no_grad mode disables the autograd so it will make a significant difference in memory usage but should not change much for runtime (as we work hard to make the autograd light so that it can run at every forward).",
"Y ": "Only if you utilise specific Modules that behave differently in eval mode, such as dropout or batchnorm, will eval mode make a difference. However, even in such instance, the runtime may not alter significantly. The no grad setting disables autograd, resulting in a significant reduction in memory use."
},
{
"X ": "training breaks in pytorch 1 5 0 throws inplace modification error",
"Z ": "Setting set_detect_anomaly doesnt give any other output Make sure to use latest pytorch as we recently fix warnings not showing up in colab.Or run your code in command line to have the corresponding forward code.The code does quite a lot of inplace and viewing ops.Plus, it is working without any error, when Pytorch is downgraded to 1.5.0.This kind of check is here to make sure we don’t compute silently wrong gradients. So it is most likely that the old behavior was silently computing wrong gradients and has been fixed in more recent versions.",
"Y ": "There is no other output when set detect anomaly is used. Make sure you're using the most recent version of Pytorch, as we recently fixed a bug where warnings weren't showing up in colab. Alternatively, you can run your code from the command line to get the forward code. The code performs numerous in-place and viewing operations. It also collaborates with..."
},
{
"X ": "custom loss function class",
"Z ": "I am afraid BCELoss does not.But looking at the code, BCELossWithLogits does (https: //pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html?highlight=bce#torch.nn.BCEWithLogitsLoss) So if you actually use a sigmoid before it and want to merge both, that will work Otherwise, you will have to create a custom nn.Module yes. But the one line formula for BCELoss given above should not be significantly slower than the all-in-one version.",
"Y ": "BCELossWithLogits does"
},
{
"X ": "best way to downsample batch image tensors",
"Z ": "Refer to nn.PixelShuffle()",
"Y ": " use nn.PixelShuffle()"
},
{
"X ": "unexpected data error in the ms coco dataset valueerror all bounding boxes should have positive height and width",
"Z ": "Based on the error it seems that this particular bounding box has a width of 0, so you might want to filter out these images.",
"Y ": "Filter biunding box whcich has width 0 "
},
{
"X ": "softmax returns only 1s and 0s during inference",
"Z ": "Found the issue. The data was not the same after all (the pipeline was missing the normalization step, and I didn't notice).Let this be a lesson to anyone getting weird logits out of your network: Print the values, don't plot the image. :v)",
"Y ": "Print the logits values because ploting image won't help "
},
{
"X ": "mnist server down",
"Z ": "If the version of torchvision is 0.9.0, which is currently stable, being unable to download MNIST is (unfortunately) expected, but if the version is nightly, it's not expected.",
"Y ": "Version is not stable , Download the MNIST file "
},
{
"X ": "how to visualize model in pytorch",
"Z ": "Not a problem @hs99! I'd suggest reading the tutorial first Saving and Loading Models PyTorch Tutorials 1.8.1+cu102 documentation and if there are still problems, raise a new topic!",
"Y ": "read Saving and Loading Models PyTorch Tutorials 1.8.1+cu102 documentation "
},
{
"X ": "vgg16 using cifar10 not converging",
"Z ": "Is it possible your validation accuracy is for a single batch instead of the entire validation set? If so the fluctuation would be perfectly normal since your accuracy is based on only 16 predictions which would fluctuate heavily.Otherwise, the heavy fluctuations in your validation set would not make sense across a larger sample, especially as the training and validation losses steadily decline.",
"Y ": "There is no other output when set detect anomaly is used. Make sure you're using the most recent version of Pytorch, as we recently fixed a bug where warnings weren't showing up in colab. Alternatively, you can run your code from the command line to get the forward code. The code performs numerous in-place and viewing operations. It also collaborates with..."
},
{
"X ": "dataparallel trained on one gpu but inference used on multiple gpus",
"Z ": "Then, try to load your model before DP construction.",
"Y ": "Load model before DP Construction "
},
{
"X ": "runtimeerror each element in list of batch should be of equal size",
"Z ": "Issue resolved by downgrad back to PyTorch 1.5.0So it looks like a PyTorch 1.6 issue",
"Y ": "Downgrade pytroch version to 1.5 from 1.6 "
},
{
"X ": "valueerror not enough values to unpack expected 3 got 2",
"Z ": "your lstm layer returns a 3-tuple but you unpack it as 2",
"Y ": "Lstm returns 3-tuple instead you are unpackingit as 2 "
},
{
"X ": "nn nllloss valueerror dimension mismatch",
"Z ": "UPDATEAs I was iterating over the training set, I realized that the last batch contains only 4 labels as opposed to the expected 10. Since it was the last batch, this was the value that the variable target.size(0) referred to after finishing the iteration, which ultimately caused the ValueError raise.Take-home message: Know thy dataset inside out ",
"Y ": "Know the dataset inside out , last bacth conatins only 4 labels it is excpecting 10 "
},
{
"X ": "nn embedding input indices meaning",
"Z ": "Each item of input, like 1, will be changed to its embeddings. 1 means Embedding layer's weight first row, like this:You can get the embed layer weight by embedding.weight.Search word2vectors to learn more.",
"Y ": "embedding.weight."
},
{
"X ": "low gpu utilization for sequence task",
"Z ": "Based on this thread, I found a way to eliminate the inner for loop using bmm. Profiling indicates this has removed a lot of work from the CPU (especially the backwards pass) and has resulted in a considerable speedup.",
"Y ": "Eliminate the inner loop with bmm"
},
{
"X ": "conflict between libtorch and grpc",
"Z ": "There is an issue about it and it’s not fixed yet: https://github.com/pytorch/pytorch/issues/14573. Currently the easiest way is to compile libtorch with the protobuf library that grpc uses, or compile grpc with the protobuf library that libtorch uses.",
"Y ": "Compiling libtorch with the protobuf library that grpc uses, or compiling grpc with the protobuf library that libtorch uses, is currently the easiest approach."
},
{
"X ": "multidimensional slice in c",
"Z ": "We don’t have it now, but we will add it (aka. numpy-style indexing) by the end of this year. ",
"Y ": "Curretly it is not avaiable"
},
{
"X ": "where is the implementation of tensor slice",
"Z ": "I believe it's implenented here ",
"Y ": "Use this link https://github.com/pytorch/pytorch/blob/3ad1bbe16a3c1d6bb9566f09229afd63022a82df/aten/src/ATen/native/TensorShape.cpp#L655"
},
{
"X ": "include directory structure",
"Z ": "I usually look at the cpp_extension include path to answer this.In particular, I don't think you should be using this particular include anymore. But I have to admit I don’t the rational behind this.",
"Y ": "us ethis link https: //pytorch.org/cppdocs/frontend.html#end-to-end-example 9"
},
{
"X ": "at cuda memory leak when loading model",
"Z ": "ASAN doesn’t work with CUDA, it’s a pretty well known problem. As you note, if you can just use CPU only functionality, you’ll be fine.",
"Y ": "ASAN doesn’t work with CUDA, use only CPU functionality"
},
{
"X ": "solved thread safety issue in torch load",
"Z ": "I think the error might be because I compiled in debug mode while using the release version of the torch library.",
"Y ": "I believe the error occurred because I used the release version of the torch library and compiled in debug mode."
},
{
"X ": "dataparallel output differs from its module",
"Z ": "Since the error is that low, I would still assume it‚'s still due to floating point precision.",
"Y ": "Error due to floating point precision "
},
{
"X ": "use 4 gpu to train model loss batch size batch size 4",
"Z ": "Thank you very much , I have solve this problem. There is something wrong with my dataloader function, when I load data, I use padding to process my data, but I forgot to turn list into tensor, as a result nn.Dataparallel to split data wrong in batch dim. ",
"Y ": " turn list to tensor "
},
{
"X ": "torch distributed class definitions",
"Z ": "ReduceOp is a C++ enum, and is exposed to the python interface using pybind (https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/init.cpp#L145). That enum is defined here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Types.hpp#L8",
"Y ": "us this link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Types.hpp#L8 "
},
{
"X ": "how to freeze feature extractor and train only classifier in distributeddataparallel",
"Z ": "Looks like I see the same issue with 1.1.0 and 1.2.0, although it seems to work 1.3 onwards. Could you try out a version = 1.3?",
"Y ": "Try with updated version 1.3"
},
{
"X ": "attention weights with multiple heads in nn multiheadattention",
"Z ": "This seems to be because the attention weights are averaged across all of the heads:github.com/pytorch/pytorch nn.MultiHeadAttention should be able to return attention weights for each head. opened 01: 18PM - 10 Mar 20 UTC ironcadiz enhancement module: nn oncall: transformer/mhatriaged ## üöÄ Feature ## Motivation Currently when using the `nn.MultiHeadAttention` ‚Ķlayer, the `attn_output_weights` consists of an average of the attention weights of each head, therefore the original weights are inaccessible. That makes analysis like the one made in this [paper](https: //arxiv.org/abs/1906.04341v1) very difficult.## PitchWhen the `nn.MultiHeadAttention` forward is called with `need_weights=True` (and maybe a second parameter like `nead_attn_heads=True`), `attn_output_weights` should be a tensor of size `[N,num_heads,L,S]`,with the weights of each head, instead of the average of size `[N,L,S]` (following the notation in the [docs](https: //pytorch.org/docs/stable/nn.html#multiheadattention))## Alternatives ## Additional context A small discussion about this subject with a potential solution was made [here](https: //discuss.pytorch.org/t/getting-nn-multiheadattention-attention-weights-for-each-head/72195) If you guys agree, I'll gladly make a PR.",
"Y ": "This seems to be because the attention weights are averaged across all of the heads."
},
{
"X ": "about torch autograd function",
"Z ": "Well, the rename to ctx is a good idea, but really, you would need to find a source for your shape.For example TorchVision’s roi align-function takes some more parameters (vision/roi_align_kernel.cpp at 0013d9314cf1bd83eaf38c3ac6e0e9342fa99683 · pytorch/vision · GitHub), maybe the forward should, too, and then assign them to ctx members.",
"Y ": "use this link https: //github.com/pytorch/vision/blob/0013d9314cf1bd83eaf38c3ac6e0e9342fa99683/torchvision/csrc/ops/autograd/roi_align_kernel.cpp#L111-L127"
},
{
"X ": "how to initialize tensors such that memory is allocated",
"Z ": "It seems like rand needs additional memory to generate the random numbers, but then uses similar memory to zeros.",
"Y ": "Rand appears to require additional memory to produce random numbers, but then consumes memory similar to zeros."
},
{
"X ": "will the same model input data twice retain the gradient information of the first input data",
"Z ": "Yes, it should (if total.backward()). Try to print and see if they are different?Since the backward is on total i.e loss1+loss2, the computation graph would include both 1,2 inputs.You could also refer the GAN tutorial where something similar is done",
"Y ": "use total.backward()"
},
{
"X ": "jit tried to access nonexistent attribute or method forward of type tensor",
"Z ": "Looks like inheritance is not supported Unable to call `super` method with TorchScript · Issue #42885 · pytorch/pytorch · GitHub",
"Y ": " Unable to call troch script with super method "
},
{
"X ": "pytorch with cuda 11 compatibility",
"Z ": "As explained here, the binaries are not built yet with CUDA11. However, the initial CUDA11 enablement PRs are already merged, so that you could install from source using CUDA11.If you want to use the binaries, you would have to stick to 10.2 for now.",
"Y ": "install the latest CUDA11 "
},
{
"X ": "unsupported format string passed to list format",
"Z ": "I know what is wrong, that I pass an array rather than a value, thank you!",
"Y ": "Pass Array"
},
{
"X ": "increasing data set size slows loss backward even though batch size is constant",
"Z ": "I think I have found the issue. I had wrongly assumed that the input data tensors needed requires_grad=True for proper training but after experimenting a little and setting requires_grad=False for the input data everything is running much faster and the network still learns. I guess only model.parameters() needs required_grad=True.",
"Y ": "use requires_grad=False"
},
{
"X ": "torch logsumexp returning nan gradients when inputs are inf",
"Z ": "It depends on your optimizer.If you don't have momentum/accumulated terms, then you can simply set these gradients to 0 and your optimizer won't change the values.If you have a fancy optimizer that will update the weights even for a 0 gradient, the simplest solution might be to save the original value of the weights before performing the step and then restoring them after the optimizer step.",
"Y ": "Simply mention gradients to 0"
},
{
"X ": "multiple calls to autograd grad with same graph increases memory usage",
"Z ": "Okay, nevermind. There was an extra backwards hook being added in the saliency code I copied. Clearing the data fixed the memory issue.Thanks for your help!",
"Y ": "remove backwards hook which was added and it will solve memory issuses"
},
{
"X ": "pass all parameters to optimizer instead of only nonfrozen parameters",
"Z ": "Hi,Assuming no gradients were computed for them before and their .grad field is None to begin with. Then the optimizer will just ignore them because they don't have any gradient (as the backward won't populate them).",
"Y ": "Gradient values are empty because .grad is None"
},
{
"X ": "how to calculate gradients correctly without in place operations for custom unpooling layer",
"Z ": "melike:I wrote it before learning that in-place operations should be avoided in PyTorch.You don’t have to avoid them. It is just that autograd does not support every combination of them and it will raise an error if you hit such case.So if your code runs without error, it means that autograd can handle this case just fine.The only concern I would have with such implementation is the slowdown due to the nested loops. But that’s unrelated to gradient correctness.",
"Y ": "You don't have to stay away from them. It's just that autograd doesn't support every combination of them, and if you do, you'll get an error.So, if your code runs without errors, autograd is capable of handling this situation.The only problem I have with such an implementation is the nested loops' slowness. However, this has nothing to do with gradient accuracy."
},
{
"X ": "grad fn get whole graph in dot",
"Z ": "Hi,This package will return a dot graph: https: //github.com/szagoruyko/pytorchviz The objects are re-used because the first one goes out of scope and is free. But later one, since you redo an allocation of the same size, the same memory is returned to you (many allocator do caching for allocations of the same size).",
"Y ": "use this link https: //github.com/szagoruyko/pytorchviz"
},
{
"X ": "gradient computation when using forward hooks",
"Z ": "Hi,I think the simplest way to understand what will happen here is to know that the autograd lives below torch.nn and is completely unaware of what torch.nn does.So in this case, whatever is the Tensor you give to the rest of the net is the one that will get gradients (it does not matter if it comes from a hook or not).And in this case, since A_hooked depends on A, then the gradients will flow back from A_hooked to A.",
"Y ": "The simplest way to comprehend what will happen here is to remember that the autograd lives beneath torch.nn and has no idea what torch.nn does.In this situation, the Tensor you offer to the remainder of the network is the one that gets gradients (it does not matter if it comes from a hook or not).Because A hooked is dependent on A, the gradients will flow back from A hooked to A in this situation."
},
{
"X ": "dataset for cnn regression",
"Z ": "Hi @mattbevWelcome to the PyTorch community! You can consider object counting datasets, the idea is that object counting can be formulated as a regression problem. Here are some links:Visual Geometry Group - University of Oxford Object Counting | Papers With Code [2008.12470] Counting from Sky: A Large-scale Dataset for Remote Sensing Object Counting and A Benchmark MethodCrowd Counting | Kagglhttp: //visal.cs.cityu.edu.hk/static/pubs/conf/cvpr08-peoplecnt.pdf Hope this helps!",
"Y ": "You can consider object counting datasets, the idea is that object counting can be formulated as a regression problem. "
},
{
"X ": "is it better to set batch size as a integer power of 2 for torch utils data dataloader",
"Z ": "Powers of two could be preferred in all dimensions, so number of channels, spatial size etc.However, as described before, internally padding could be used, so that you wouldn' hit a performance cliff and should thus profile your workloads.",
"Y ": "In all dimensions, such as channel count, spatial size, and so on, powers of two may be preferred.However, as previously described, internally padding could be used to avoid a performance cliff, and you should thus profile your workloads."
},
{
"X ": "runtimeerror function addbackward0 returned an invalid gradient at index 1 expected type torch floattensor but got torch cuda floattensor",
"Z ": "I solved the problem. One variable which i was initializing within the loss function by the name ,processed‚was not being put on cuda.Thing to keep in mind for these problems is that some variable is not deployed on GPU or CPU whichever device you are using. So a shortcut is to put every single variable to GPU or CPU whichever device you are using by calling variable.to(device) function.",
"Y ": "One variable with the name,processed, which I was initialising within the loss function, was not being put on cuda.The important thing to remember for these issues is that some variables are not deployed on the GPU or CPU, whichever device you are using. As a result, a shortcut is to call the variable.to(device) function to assign every variable to the GPU or CPU, depending on which device you're using."
},
{
"X ": "where should i look to solve running mean error in resnet transfer learning",
"Z ": "The original resnet's first convolution out channel is 64, but you are using 128. Thus it does not work with the next batch norm as well as following layers.Please use self.conv1 = nn.Conv2d(1,64, kernel_size=(7,7), bias=False); self.inplanes = 128 or you have to change the entire network.",
"Y ": "use self.conv1 = nn.Conv2d(1,64, kernel_size=(7,7), bias=False); self.inplanes = 128 "
},
{
"X ": "save output image of cnn model",
"Z ": "It is possible but I am not sure if it'ss the best way to go for your problem.From what I understand you only want to reconstruct the RGB image from the output, am I right? If yes, do you know what each channel of your output represents? Isn’t one of the channels the edge map?In case you want to do that, you can either change the output shape of your 3rd conv2d or add another layer with the input channel of your last layer and your desired output dimension. But you may need to adjust your lost function as well depending on what loss function you are using.",
"Y ": "either change the output shape of your third conv2d or add another layer with the input channel of your previous layer and your desired output dimension. However, depending on the loss function you are using, you may need to adjust it as well"
},
{
"X ": "need feature maps of resnet50",
"Z ": "I usually use forward hooks as described here, which can store the intermediate activations. You could then pass these activations to further processing.",
"Y ": "Use forward hooks "
},
{
"X ": "tensors are at different cuda devices",
"Z ": "Solution by Yanai Elazar:You can define an environment variable like this: CUDA_VISIBLE_DEVICES=1 This way, only this gpu will be available for the running program, and you won’t leak into other gpus. This way in the code you need to run on a single gpu, and not specify one specifically.",
"Y ": "CUDA_VISIBLE_DEVICES=1"
},
{
"X ": "simple rnn stuck around the mean",
"Z ": "Are you making use of the hidden state? Maybe you could use the lstm like it'ss done in this tutorial: https: //pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html",
"Y ": "use this link https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html"
},
{
"X ": "pad packed sequence export to onnx",
"Z ": "I was able to solve this by creating my own packing and unpacking methods to use during export. From what I understand, exporting to ONNX does not support creating your own instance of PackedSequence. I submitted an issue to Pytorch.",
"Y ": "Use your own method of packing and unpacking urong export "
},
{
"X ": "a very strange phenomenon i met in training machine translation",
"Z ": "It is not necessary that your loss should decrease for every batch within each epoch (it can go up for different batches), but it should decrease across different epochs.If your loss is not decreasing across different epochs, learning rate could be a problem",
"Y ": "Learning rate could be problem"
},
{
"X ": "pytorch chatbot loss function with ignore index instead of targets padding mask",
"Z ": "I figured out what was wrong with my model. It turned out that despite that my loss function returned some reasonable values, the loss was not calculated properly, thus as a consequence model did not learn. Output from my AttentionDecoder was softmaxed, then I used CrossEntropyLoss or NLLLoss (tried them both), but I did not change the softmax to log_softmax in case of NLLLoss, or in case of using CrossEntropyLoss I did not get rid of softmax at all as CEL comprised of log_softmax and NLLLoss.",
"Y ": "Calluate the loss properly "
},
{
"X ": "do we need to set a fixed input sentence length when we use padding packing with rnn",
"Z ": "The RNN see each word, i.e., a vector of size 5, step by step. If there are 6 words, the RNN sees 6 vectors and then stops. Same with 8 words. Your confusion might stem that LSTM or GRU hides this step-wise processing. You give the model a sequence of a certain lengths, but internally the model loops over the sequence. More words just means more loops before it’s finished.Obviously, things get problematic with batches if the sequences in a batch have different lengths. One default solutions is to pad all short sequences to the length of the longest sequence.The size/complexity of the model (the number of neurons of you will, but it’s better to think in number of trainable parameters) of the LSTM/GRU depends on: the size of the input (e.g.,5 in your example) the size of the hidden dimension number of layers in case of a stacked LSTM/GRU whether you use uni- or bidirectional.It does not depend on the sequences lengths. Sure, the processing takes more time for longer sequences.",
"Y ": "It does not depend on the sequences lengths. Sure, the processing takes more time for longer sequences"
},
{
"X ": "pytorch c api missing headers",
"Z ": "I just went through the trouble of upgrading to the most recent stable version which I got from the home-page (The one that I downloaded from the installation help page was nightly version which did not compile). I checked with the source code on github that it indeed contained the Pooling functions. Glad to confirm that it does indeed work now. Thanks for the help!",
"Y ": "Update to the lastet version "
},
{
"X ": "about torchvision for c frontend",
"Z ": "Indeed, torchvision C++ support isn't matching Python support.However, @ShahriarSS is doing some good work on it, so the gap is getting smaller.My guess is that most people use OpenCV to do transforms or do them manually. (Personally, I incorporated things like “normalizing” into the traced/scripted model last time I did this.)Best regardsThomas",
"Y ": "use Opencv , torchvision c++ is not matching python support currently "
},
{
"X ": "why gpu memory allocations are associated with the cuda stream",
"Z ": "As @albanD wrote, limiting CUDA allocations to a single stream reduces the number of CPU-GPU synchronizations necessary. CUDA kernels are asynchronous, so when an allocation is “freed” the kernel may not be finished (or may not have even started). Reusing the same allocation in a different stream could cause memory corruption because work in that stream may start before previously launched work in the original stream finishes.It’s safe to immediately reuse the allocation in the same stream because operations within a stream are ordered sequentially. This is the strategy the caching allocator uses.The CUDA memory API handles this differently: The cudaFree call synchronize all streams – the CPU waits until all streams finish all outstanding work before the cudaFree call completes. This ensures that subsequent uses of the memory are guaranteed to happen after the previous uses finish. However, this makes cudaFree a relatively expensive call. The primary goal of the caching allocator is to avoid this type of synchronization.",
"Y ": "The CUDA memory API takes a different approach: The cudaFree call synchronises all streams – the CPU waits until all streams have completed all outstanding work before completing the cudaFree call. This ensures that subsequent uses of the memory will take place after the previous ones have finished. CudaFree, on the other hand, is a relatively expensive call. The caching allocator's primary goal is to avoid this type of synchronisation."
},
{
"X ": "c time sequence prediction py slow",
"Z ": "Error in test codes",
"Y ": "Error in the test code"
},
{
"X ": "model weights are not moved to the gpu",
"Z ": "I did find my problem. It was a rather unspectacular error.i forgot registering my layers with register_module(). When adding them i got the expected results ",
"Y ": "Register layers with register_module()"
},
{
"X ": "unable to install torchvision",
"Z ": "Likewise, you should select Release over Debug in the VS GUI.",
"Y ": "Use updated version "
},
{
"X ": "libtorch glog doesnt print",
"Z ": "Maybe you can try it: add add_definitions(-DC10_USE_GLOG) in your project‚s cmakelists.txt.",
"Y ": " AAdd add_definitions(-DC10_USE_GLOG"
},
{
"X ": "runtimeerror stop waiting response is expected",
"Z ": "The error has been fixed.‚Stop_waiting response is expected‚error occurred in TCPStore.cpp. So it was actually the communication problem. It works finally when I reinstalled NCCL: https: //github.com/NVIDIA/nccl.git",
"Y ": "Reinstall NCCL using this link https://github.com/NVIDIA/nccl.git"
},
{
"X ": "torch nn parallel data parallel for distributed training backward pass model update",
"Z ": "Yes the locking is builtin and the weights will properly be updated before they are used.",
"Y ": "ocking is bultin function and weights will be updated accordingly "
},
{
"X ": "why is float tensor addition on cpu slower for avx2 than the default aten cpu capability",
"Z ": "Resolved at On CPU, vectorized float tensor addition might be slower than unvectorized float tensor addition · Issue #60202 · pytorch/pytorch · GitHub.Basically, memory allocation & zero-filling costs are worse for AVX2.",
"Y ": "Due to memory allocation on CPU "
},
{
"X ": "pytorch in place operator issue in numpy conversion",
"Z ": "id() is inappropriate because python objects are not value objects, i.e. they link to other objects, and you just have multiple links here (see .storage().data_ptr() to reason about address identities)",
"Y ": "see .storage().data_ptr() "
},
{
"X ": "customdataset give me error",
"Z ": "Thank you, I could not find these subtle typo bug int. I actually meant init. many thanks",
"Y ": "check init "
},
{
"X ": "is the sgd in pytorch a real sgd",
"Z ": "Ok perfect, that was exactly what I thought. Actually, they should be named Stepper. For example with SGD that will be ‚SGDStepper. That seems more clear.",
"Y ": "It shoul dbe SGDStepper"
},
{
"X ": "runtimeerror number of dims dont match in permute",
"Z ": "alicanakca:mask’s shape is torch.Size([256,256]).This is the issue – the mask is 2-dimensional, but you’ve provided 3 arguments to mask.permute().I am guessing that you’re converting the image from h x w x c format to c x h x w. However, looks like the mask is only in an h x w format.",
"Y ": "This is the problem: the mask is two-dimensional, but you've given it three arguments. permute().I'm assuming you're converting the image from h x w x c to c x h x w. However, it appears that the mask is only in h x w format."
},
{
"X ": "in pytorch is there pdf logpdf function for distribution",
"Z ": "https://pytorch.org/docs/master/distributions.html?highlight=distributions#module-torch.distributions It looks like probs() and log_probs() are what you’re looking for",
"Y ": "use this link https: //pytorch.org/docs/master/distributions.html?highlight=distributions#module-torch.distributions"
},
{
"X ": "how to create computational graphs for updated parameters",
"Z ": "Hi,You might want to take a look at the higher library that is built to do just that.",
"Y ": "There are sepearte library to do that "
},
{
"X ": "neat way of temporarily disabling grads for a model",
"Z ": "Hi,I don't think there is any update. The for loop is simple and is the most efficient thing that can be done here.Especially with your special logic of things already not requiring gradients, that would be tricky.Note that you can add a method to your q_model module yourself to do that to make it a bit cleaner.",
"Y ": "I don't believe there has been an update. The for loop is straightforward and the most efficient option here.That would be tricky, especially with your special logic of things already not requiring gradients.To make it a little cleaner, you can add a method to your q model module yourself."
},
{
"X ": "why autograd will accumate gradients",
"Z ": "You could simulate a larger batch size by accumulating the gradients of smaller batches and scaling them with the number of accumulations. This can be useful e.g. if the larger batch size would be beneficial for training but doesn’t fit onto your GPU.Accumulating the gradients gives you the ability to scale them manually afterwards without enforcing any assumptions on your use case.",
"Y ": "Accumulating the gradients gives you the ability to scale them manually afterwards without enforcing any assumptions on your use case."
},
{
"X ": "complex functions exp does not support automatic differentiation for outputs with complex dtype",
"Z ": "Hi,In preparation for the 1.7 release and to avoid issues, we added error messages for all the functions that were not yet audited for complex autograd.We are working on auditing the formulas and re-enabling them.cc @anjali411 do we have an issue describing the process if people want to help here?",
"Y ": "use the latest version 0f 1.7 "
},
{
"X ": "optimizing parameters of function generating convolution kernel instead of raw weights",
"Z ": "You should probably use nn.functional.conv2d, with it you can use any tensor as kernel .",
"Y ": "use nn.functional.conv2d "
},
{
"X ": "where is the actual code for layernorm torch nn functional layer norm",
"Z ": "You can find the (CPU) C++ implementation here.",
"Y ": "use this link for CPU C++ implementation https://github.com/pytorch/pytorch/blob/392abde8e64b0d91b7d52aecee8dce9aff8d0b2f/aten/src/ATen/native/layer_norm.cpp "
},
{
"X ": "how can i apply l2 l1 loss with 3d voxels",
"Z ": "You can directly apply both mentioned losses, as they would expect the model output and target to have the same shape, which is the case for your use case.Unfortunately, I not sure how SSIM can be used for your use case, but if I‚m not mistaken the original implementation uses 2D convs internally, so you might change it to 3D ones.",
"Y ": "You can directly apply both mentioned losses "
},
{
"X ": "different init for training ensembles",
"Z ": "Setting the seed at the beginning of the script would make the pseudorandom number generator output deterministic “random” values. Creating multiple models in the same script would thus also create different parameters, since the sequence of the random number generation is defined by the seed, but the values won’t be the same.",
"Y ": "Setting the seed at the start of the script would cause the pseudorandom number generator to produce deterministic “random” values. Because the sequence of the random number generation is defined by the seed, creating multiple models in the same script would result in different parameters, but the values would not be the same."
},
{
"X ": "torch lstsq output size incorrect",
"Z ": "answered here: torch.lstsq returns wrong tensor size · Issue #56833 · pytorch/pytorch · GitHub",
"Y ": "use this link https://github.com/pytorch/pytorch/issues/56833"
},
{
"X ": "on the fly image rotation cpu bottleneck",
"Z ": "Have you tried alternative rotation implementations (e.g., skimage’s rotate or albumentations’s rotate)?Albumentations in particular claims to be very fast for rotation: benchmark.",
"Y ": "try Albumentatuins "
},
{
"X ": "my program stops at loss backward without any prompt in cmd",
"Z ": "I I tried to run my program on linux platform, and it ran successfully.Therefore, it is very likely that it is caused by different osPrevious os win 10",
"Y ": "Due to different OS "
},
{
"X ": "creating input for the model from the raw text",
"Z ": "Or, you could load your data with a new torchtext abstraction. Text classification datasets, mentioned by you, follow the same new abstraction. It should be very straightforward to copy/paste and write your own pipeline link.",
"Y ": "load data with torchtext abstraction "
},
{
"X ": "model before after loading weight totally different",
"Z ": "@ptrblck I found finally the issue. It cames when I tried to compute the gradient with the backward() function. I forgot to use amp.scale_loss. But it makes a weird behaviors because the training works well, until I load again the checkpoint Problem solved !",
"Y ": "use amp.scale_loss.' "
},
{
"X ": "cudaextension for multiple gpu architectures",
"Z ": "Apparently it was some kind of problem with an old cached version works now ",
"Y ": "update to new version or remove the cache of old version "
},
{
"X ": "edge case with register hook",
"Z ": "Ho right.The thing is that your hook actually waits on the other backward to finish because it waits on the the other thread.The thing is that because the hook is blocked waiting on this, another thread cannot use run backward (this current thread can though).So you either want to run this other backward in the same thread as the hook. Or not block the hook waiting on that backward.",
"Y ": "The problem is that your hook actually waits on the other thread to finish because it is dependent on it.The problem is that because the hook is blocked while waiting for this, another thread cannot run backward (this current thread can though).So you'll either want to run this other thread backwards in the same thread as the hook, or you'll want to run it forwards in a different thread. Or, alternatively, do not block the hook while waiting on that backward."
},
{
"X ": "network in q learning is predicting the same q values for all states",
"Z ": "Normalizing the input on a scale 0-1 instead of -1 to 1 solved this issue.",
"Y ": "Scale to 0-1 instead of -1 to 1"
},
{
"X ": "how do i map joblibs parallel function to pytorchs distributeddataparallel",
"Z ": "use torch.multiprocessing.pool",
"Y ": "use torch.multiprocessing.pool"
},
{
"X ": "how to get the batch dimension right in the forward path of a custom layer",
"Z ": "pytorch .dot function is different from tensorflow or numpy",
"Y ": "use pytorch .dot function "
},
{
"X ": "unexpected key in state dict bn1 num batches tracked",
"Z ": "I manage to solve the problem with following link How to load part of pre trained model? @apaszke post.",
"Y ": "use this link https://discuss.pytorch.org/t/how-to-load-part-of-pre-trained-model/1113/2"
},
{
"X ": "allow size mis match in autograd forward vs backward",
"Z ": "True, but the memory would be an issue.I’m not sure to see why.Currently, you already have a x → M → z → PADDING → z_pI think you want (x, z_g) → M_AND_PADDING → z_pAnd in that new custom Function, you don’t need to do anything beyond what the padding is currently doing.",
"Y ": "use this x, z_g) → M_AND_PADDING → z_p"
},
{
"X ": "gradients exist but weights not updating",
"Z ": "Hi,When you get the parameters of your net, it does not clone the tensors. So in your case, before and after contain the same tensors. So when the optimizer update the weights in place, it updates both your lists. You can try and change one weight by hand, they will still remain the same.",
"Y ": " Try changing the weight"
},
{
"X ": "variables are not updated after loss backward and optimizer step",
"Z ": "Finally, and after 5 days, I found the error.In fact, the computational graph was broken into two different places, due to two wrong operations. However, it was very difficult to debug it and find the issue source. No tools or Libs exist to visualize the graph, which is the main component for the gradient backpropagation.",
"Y ": "In fact, the computational graph was broken into two different places, due to two wrong operations."
},
{
"X ": "the second order derivative of a function with respective to the input",
"Z ": "Hi,The problem is that your function is linear. So the first gradient is constant and the second order gradient is independent of the input.This error message happens because of the independence (and thus, it is not used in the graph).",
"Y ": "The problem is that your function is linear. So the first gradient is constant and the second order gradient is independent of the input."
},
{
"X ": "loss backward time increases for each batch",
"Z ": "Could you check if you might be running out of memory and your system might be using the swap?",
"Y ": "Check Memory "
},
{
"X ": "weight of layer as a result of dot operation",
"Z ": "Use the functionals instead of the convolutional module. Functionals takes weights as inputs.",
"Y ": "Use the functionals instead of the convolutional module. Functionals takes weights as inputs."
},
{
"X ": "runtimeerror mat1 dim 1 must match mat2 dim 0 cnn",
"Z ": "It looks like you are already printing the shape so you should be able to see what N, and D are here.Flatten can work, but rather than reshaping to (-1, something), you should reshape to (batch_size,-1).",
"Y ": "use Flattenand reshape (batch_size,-1)"
},
{
"X ": "error on torch load pytorchstreamreader failed",
"Z ": "Ok, Im able to load the model. The problem was with the saved weight file. It wasn't saved properly and the weight file size was smaller (only 90 MB instead of 200 MB).",
"Y ": "Save the file size properly"
},
{
"X ": "the code that was working previously gets stuck at loading the checkpoint file that is cached on system",
"Z ": "hmm it was very weird. I reboot the machine and then I ran it again and it worked.",
"Y ": "Reboot the machine"
},
{
"X ": "nn transformerencoderlayer 3d mask doesnt match the broadcast shape",
"Z ": "Solution: Upgrade to PyTorch 1.5",
"Y ": "Upgrde the version "
},
{
"X ": "solved runtimeerror expected object of device type cuda but got device type cpu for argument 2 mat2 in call to th mm",
"Z ": "Okay, i just solved the problem by myself, the reason of this is the Attn() function which i wrote outside the model class as another def() function, and the Attn() function will not be moved to the GPU, so I create a new nn.Module class for Attn and i wrote : self.attn = Attn(hidden_size) in the model.",
"Y ": "use this self.attn = Attn(hidden_size) "
},
{
"X ": "my implementation of self attention",
"Z ": "I can't believe I made this silly mistake in verson1 queries are outputted from w_v, instead of w_q.",
"Y ": "queies are outputted from w_v"
},
{
"X ": "resume training validation loss going up increased",
"Z ": "Thank you sir, this issue is almost related to differences between the two datasets.",
"Y ": "use same datasets "
},
{
"X ": "lstm text generator repeats same words over and over",
"Z ": "Okay, it was actually a stupid mistake I made in producing the characters with the trained model: I got confused with the batch size and assumed that at each step the network would predict an entire batch of new characters when in fact it only predicts a single one Yikes!Anyways, thanks for your advice and see if I can use it to fine tune the results a bit!",
"Y ": "check on each batch outputs"
},
{
"X ": "what is the exactly implementation of torch embedding",
"Z ": "It should eventually call into this method for the forward pass.",
"Y ": "It should eventually call into this method for the forward pass"
},
{
"X ": "how would i do load state dict in c",
"Z ": "The current implementation of load_state_dict is in Python, and basically it parses the weights dictionary and copies them into the model's parameters.So I guess you'll need to do the same in CPP.",
"Y ": "it is inpython and it will be same for cpp also "
},
{
"X ": "compiler c not compatible with the compiler pytorch was built",
"Z ": "@MauroPfister ArchLinux‚s compiler does follow the rolling base of GNU, I would say they should be fully compatible. The reason we still give warning is that ArchLinux is a independent linux distribution, their software might contains their own Proprietary software and is not endorsed by the GNU project.",
"Y ": " ArchLinux is a independent linux distribution"
},
{
"X ": "unable to access my nets parameters",
"Z ": "try class ConvNet : public torch: :nn: :Module",
"Y ": "use class ConvNet : public torch: :nn: :Module"
},
{
"X ": "converting simple rnn model from python to c",
"Z ": "Sorry I was giving you this link : https://github.com/prabhuomkar/pytorch-cpp/tree/master/tutorials/intermediate/recurrent_neural_network By mistake I have given u the wrong link.",
"Y ": " use this link https: //github.com/prabhuomkar/pytorch-cpp/tree/master/tutorials/intermediate/recurrent_neural_network "
},
{
"X ": "using nn module list in c api",
"Z ": "@Aaditya_Chandrasekha We have a simple instruction in the comment here: https: //github.com/pytorch/pytorch/blob/cd0724f9f1b57dae12be2c3fc6be1bd41210ee88/torch/csrc/api/include/torch/nn/modules/container/modulelist.h#L11 We have tests here, it contains more examples. https: //github.com/ShahriarSS/pytorch/blob/678873103191c329e2ca4a53db1d398599ad9443/test/cpp/api/modulelist.cpp",
"Y ": "use this link https://github.com/pytorch/pytorch/blob/cd0724f9f1b57dae12be2c3fc6be1bd41210ee88/torch/csrc/api/include/torch/nn/modules/container/modulelist.h#L11 "
},
{
"X ": "gradient clipping in pytorch c libtorch",
"Z ": "The usage look correct and is also used in this way in this test.",
"Y ": " Same eay implement in test"
},
{
"X ": "futex wait hang",
"Z ": "Hi, No this is expected. Half of them are OMP worker thread and one of them is an autograd engine worker thread. These are worker threads that are kept around so that we don't have to recreate them every time we need them. OMP does that by default and we do it ourselves as well in the autograd engine.",
"Y ": "OMP and autograd does it by default "
},
{
"X ": "can we split a large pytorch built in nn module to multiple gpu",
"Z ": "Hi, I'm afraid we don't provide any construct to do this automatically. But you can simply create 8 different Linear that each take a subset of the input and split the input yourself and call each of these Linears and then add all the results (assuming your split on the input size here given that it is the biggest).",
"Y ": " you can simply create 8 different Linear that each take a subset of the input and split the input yourself and call each of these Linears and then add all the results"
},
{
"X ": "sharing model between processes automatically allocates new memory",
"Z ": "It turns out that every-time a process holds any pytorch object that is allocated on the GPU, then it allocates an individual copy of all the kernels (cuda functions) that pytorch uses, which is about 1GB. It seems there is no way around it, and if your machine has Xgb of GPU RAM, then you're limited to X processes. The only way around it is dedicating one process to hold the pytorch module and act with the other processes in a producer-consumers pattern, which is a real headache when it comes to scalability and much more for RT application .",
"Y ": "The only way around it is dedicating one process to hold the pytorch module and act with the other processes in a producer-consumers pattern, which is a real headache when it comes to scalability and much more for RT application"
},
{
"X ": "how to split a pretrained model for model parallelism",
"Z ": " Do I also need to change this or does this ‚to work with nn.sequential (no separate forward function) as well? ‚towould work on nn.sequential, although you need to modify the forward function since once you have completed execution for the module on GPU0, the output will be on GPU0. Now since the other module you want to execute is on GPU1, you need to move the output from GPU0 to GPU1 manually (using .to) and then you need to execute the module on GPU1.",
"Y ": " use nn.sequential and .to to move output from one GPU to another "
},
{
"X ": "build pytorch gpu for different gpu archs",
"Z ": "I‚ve answered in the GitHub issue.",
"Y ": "answered in the GitHub issue"
},
{
"X ": "confused about distributed data parallel behavior",
"Z ": "Hi,Could you try torch.cuda.set_device() instead, torch.cuda.device is a context manager, also see https: //github.com/pytorch/pytorch/issues/1608",
"Y ": "use this torch.cuda.set_device() "
},
{
"X ": "loss calculation within batch iteration",
"Z ": "This problem has been resolved. I derived a bit and figured those two loss calculation approaches are essentially the same.",
"Y ": "The two loss calualtion approaches are same "
},
{
"X ": "best way to handle variable number of inputs",
"Z ": "Why not use *args and **kwargs?",
"Y ": "use *args and **kwargs"
},
{
"X ": "model to cpu does not release gpu memory allocated by registered buffer",
"Z ": "you cannot delete the CUDA context while the PyTorch process is still runningClearing the GPU is a headache vision No, you cannot delete the CUDA context while the PyTorch process is still running and would have to shutdown the current process and use a new one for the downstream application.",
"Y ": "No, you cannot delete the CUDA context while the PyTorch process is still running and would have to shutdown the current process and use a new one for the downstream application."
},
{
"X ": "implementing a custom convolution using conv2d input and conv2d weight",
"Z ": "Hi, This OOM exception comes from the python api implement of conv2d_weight actually. In backprop weight calculation, the output gradients need to be expanded with output channel times. When default cudnn implement this with data prefetch block and block (not allocate more memory), python api uses a repeat that will allocate a huge size of memory on output gradients tensor with unnecessary duplication of data. you can easily fix this by convert the repeat into a loop function at conv2d_weight.",
"Y ": "convert into a loop function at conv2d_weight"
},
{
"X ": "why criterion cuda is not needed but model cuda is",
"Z ": "The impact of moving a module to cuda is actually to move all it'ss parameters to cuda. Criterion don't have parameters in general, so it is not necessary to do it.",
"Y ": " Critertion don't have parameters but cuda has parameters"
},
{
"X ": "debugging memory allocations in torch autograd grad",
"Z ": "Hi, You can enable anomaly mode. That will show you the forward op that corresponds to the one that is failing in the backward. Can you share this trace?",
"Y ": "enable anomly mode "
},
{
"X ": "integrated gradients for rnns",
"Z ": "Hi,You won't be able to get gradients wrt to the input of the embedding layer I a'm afraid. Since, as you pointed out, they are not of contiguous dtype. You might want to do use that technique on the output of the embedding layer instead?",
"Y ": "use the technique on the output of the embedding layer"
},
{
"X ": "minibatch size by iteration",
"Z ": "Your code looks correct, but you might want to divide the accumulated loss by the number of accumulation steps. Also, here is a nice overview of different approaches in case you want to trade compute for memory etc.",
"Y ": "Divide the accumulated loss by the number of accumulation steps"
},
{
"X ": "getting cant export a trace that didnt finish running error with profiler",
"Z ": "Solved: The print(prof) line should be outside the with block.",
"Y": "print(prof) should be outside the block "
},
{
"X ": "pytorch lightning number of training and validation batches",
"Z ": "I think this is the total number of batches (training + validation). Best regardsThomas",
"Y ": "total number of batches (training + validation)"
},
{
"X ": "how to load imagenet",
"Z ": "The validation set for ImageNet has 50,000 images or 50 per each of the 1,000 classes. If you don't shuffle the data then the expectation indeed is that you only see two classes for a batch size of 100.",
"Y ": " Shuffle the data "
},
{
"X ": "masking out locations in convolutional kernels",
"Z ": "I don't see any obvious problems here but you can do some simple tests like running your layer on a ones tensor input and checking that the results are what you expect based on the mask. If you are using batchnorm layers after the convolution, you can avoid the bias term entirely as it will be effectively undone by the batchnorm. Additionally, I don't think the bias is applied before the convolution, so it shouldn't be affected by (or affect) the mask that you are using.",
"Y ": "you can do some simple tests like running your layer on a ones tensor input and checking that the results are what you expect based on the mask"
},
{
"X ": "rewriting a crnn model with the same architecture gives different results than the original",
"Z ": "First thing, good job to simplify the stuff you find on the internet. I often do this, too, when I need to look at code from others.You are not using the same weights with this. The random init for the second will be different than the one for the first because you not re-seeding after instiatiating the first.In this case, if I copy the manual seed to before the second network is instantiated, I actually do get the same results.Now this is also good luck because apparently you are creating the modules in the exact same order in both networks. This can easily break through refactoring and in this case it is safer to try to copy the state_dict of one of them to the other (take the state dict, rename the keys as needed, load into the other model) to compare. For things with batch norm, one also needs to keep in mind that running it updates the running statistics in training mode.Best regardsThomas",
"Y ": ""
},
{
"X ": "dynamically replacing the last linear layer",
"Z ": "Sorry for not really answering your question, but you might want to test the training on the CPU first. Here the error messages are most of the time more useful than CUDA errorsApart form that, you don’t really replace the last linear layer. You simple have multiple linear layers and choose one dynamically, which is essentially the idea behind multitask learning. And from a quick look at your code, it seems alright. But I didn't check any details.What’s the error when running in the CPU?",
"Y ": "try running on CPU "
},
{
"X ": "unsure of output dimension and loss type newbie",
"Z ": "You might want to look a this post, it seems very related. The link Udacity tutorial is also exactly about a character RNN.",
"Y ": "Check character RNN"
},
{
"X ": "improving nmt model outputs",
"Z ": "Rare words or out-of-vocabulary words are a fundamental challenge for NMT. You still find very recent academic papers addressing this.For example, for a very simple NMT task, I used an off-the-shelf NER system to replace, say, person names. So 2 sentences “I met Alice” and “I met Bob” would be converted to I met ; same for the target sentences. After the translation, I would simple replace with the actual name. Replacing numbers with would also be very easy with a RegEx. It worked fine enough for my use case, but its probably too naive for the general case.",
"Y ": "Replacing numbers with would also be very easy with a RegEx. but its probably too naive for the general case "
},
{
"X ": "implement a keras model using pytorch doesnt learn",
"Z ": "Problem identified, the data need to be shuffled in train loader.",
"Y ": "the data need to be shuffled in train loader"
},
{
"X ": "transformer mask doesnt do anything",
"Z ": "I figured out the problem, I was not properly inserting SOS and EOS tokens, so even with proper masking it was able to copy straight from the given target.",
"Y ": "Insert SOS and EOS tokens properly "
},
{
"X ": "cant use from blob to construct tensor on gpu in c",
"Z ": "@farmersrice check this issue.https: //github.com/pytorch/pytorch/issues/15426, I think our document need update.You can not copy memory from CPU to GPU directly. Your temp[] is not on GPU.I think you have to use .to(device) at this point.",
"Y ": "use this link https: //github.com/pytorch/pytorch/issues/15426‚ "
},
{
"X ": "error in cmake while setting up libtorch",
"Z ": "cudnn version might not be found in cudnn.h. In the cuda.cmake change cudnn.h to cudnn_version.h and caffe2 is able to find the cudnn version.",
"Y ": "change cudnn.h to cudnn_version.h"
},
{
"X ": "how to mask tensor with boolean using c api how to achieve this python code with c api",
"Z ": "use masked_scatter function can do that",
"Y ": "use masked_scatter"
},
{
"X ": "libtorch ubuntu runtime error",
"Z ": "@yf225 sorry.It was my mistake.It happened because of a file path error",
"Y ": " use correct path address"
},
{
"X ": "during deserialization torch load fails at debug while it works fine in release mode unhandled exception at 0x00007fff7de1a308 in test exe microsoft c exception c10 error at memory location 0x000000cdee5bd950 occurred",
"Z ": "The reason was the debug version of the lib was missing! (again!) moving the needed libs next to the executable fixed the issue (the release versions were added to the PATH, so at runtime, it would pick the release version and boom!)",
"Y ": "update to latest version of build "
},
{
"X ": "per tensor channel quantization equivalents in pytorch caffe2",
"Z ": "Unfortunately, Caffe2 Int8Conv doesn’t support per-channel quantization. The DNNLOWP engine that uses FBGEMM backend does support group-wise quantization if that helps you. Please see https://github.com/pytorch/pytorch/blob/master/caffe2/quantization/server/conv_groupwise_dnnlowp_op_test.py for example of using group-wise quantization.",
"Y ": "use this link https://github.com/pytorch/pytorch/blob/master/caffe2/quantization/server/conv_groupwise_dnnlowp_op_test.py "
},
{
"X ": "quantized squeeze block mobilenetv3",
"Z ": "You can actually try to comment out the two lines as https://github.com/pytorch/pytorch/pull/30442, since the tensor iterator supports broadcast.",
"Y ": "use this link https://github.com/pytorch/pytorch/pull/30442, since the tensor iterator supports broadcast "
},
{
"X ": "cannot quantize nn conv2d with dynamic quantization",
"Z ": "Hi @babak_hss, Dynamic quantization is currently supported only for nn.Linear and nn.LSTM, please see: https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic",
"Y ": " for nn.Linear and nn.LSTM Dynamic quantization is currently supported . use this link https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic "
},
{
"X ": "when quantized max pool2d is used",
"Z ": "Yes, that is correct. it is dispatch here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Pooling.cpp#L128 We have multiple ways to do dispatch right now in PyTorch, one common place is in native_functions.yaml, you can take a look at: https: //github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/README.md",
"Y ": "use tis link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Pooling.cpp#L128"
},
{
"X ": "assertionerror torch nn quantized relu does not support inplace",
"Z ": "should be fixed in https://github.com/pytorch/pytorch/pull/33105, cc @raghuramank100",
"Y ": "use this link https://github.com/pytorch/pytorch/pull/33105"
},
{
"X ": "conv2d unpack and conv2d prepack behavior",
"Z ": "Bias is kept in fp32 format for eager mode quantization and dynamically quantized while computing quantized FC/Conv. It’s returned in fp32 because that’s how it’s passed in to an operator as well. The reason for keeping bias in fp32 is the unavailability of input scale until the operator has executed so we can’t quantize bias until then. To convert bias to quantized format, use input_scale * weight_scale with a zero_point = 0. See this https: //github.com/pytorch/FBGEMM/blob/master/include/fbgemm/OutputProcessing-inl.h#L104-L108 code for converting bias with act_times_weight scale. Check out the code in https: //github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp file for prepack function. If USE_FBGEMM is true, fbgemm_conv_prepack function is called for doing prepacking.",
"Y ": "check this code https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp "
},
{
"X ": "net in dataparallel make training aware quantization convert model acc error",
"Z ": "There are currently some issues with nn.DataParallel and Quantization Aware Training. There is a WIP PR to fix it - https://github.com/pytorch/pytorch/pull/37032 You can follow the toy example here to make sure you're following the steps for QAT correctly https: //gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40",
"Y ": " follow the steps from this link https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40"
},
{
"X ": "construct quantized tensor from int repr",
"Z ": "we do have some non-public API to do this: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3862 and https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3868 but they we might change the API when we officially release quantization as a stable feature.",
"Y ": "use this link https: //github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3862 "
},
{
"X ": "tied conv1d and conv transpose1d not geting the same result as the input",
"Z ": "I think I misunderstand the ,tied weight concept.I wrote the conv_transposed1d in doubly block circulant matrix form and I find that one don't need to flip the temporal axis actually.Suppose the conv1d's matrix is and the corresponding conv_transpose1d's matrix is .The square matrix apprently is not always identity matrix. So the result need not to be identical to the input.",
"Y ": " reault should not be identical "
},
{
"X ": "dynamic quantization error mixed serialization of script and non script modules is not supported",
"Z ": "It looks like you are trying to quantize the scripted net.The correct order seems like first quantize your net then script it!",
"Y ": " first quantize your net and then script it "
},
{
"X ": "pytorch1 5 0 win7 64bit didnt find engine for operation quantized conv2d prepack noqengine",
"Z ": "We use VS 14.11 to build binaries for CUDA 9.2, so there is no FBGEMM support. If you need FBGEMM, then please use the binaries with other CUDA versions instead.",
"Y ": "use the VS 14.11 build "
},
{
"X ": "did pytorch support int16 quantization",
"Z ": "We currently do not support int16 quantization. There is support for fp16 dynamic quantization.",
"Y ": "use fp16 dynamic quantization"
},
{
"X ": "dose static quantization support cuda",
"Z ": "No, it only works on CPU right now, we will consider adding CUDA support in the second half of the year",
"Y ": "currently it works only on CPU "
},
{
"X ": "quantized model consists of relu6",
"Z ": "That is correct, we will work on adding support for fusing relu6 soon. For now, if you are doing post training quantization, you could replace relu6 with relu and proceed as a work around. Thanks,",
"Y ": " repalce relu6 with relu "
},
{
"X ": "loading of quantized model",
"Z ": "Hi mohit7,Make sure you create the net using previous definition, and let the net go through process that was applied during quantization before (prepare_model, fuse_model, and convert), without rerun the calibration process.After that you can load the quantized state_dict in. Hope it helps.",
"Y ": "Make sure you create the net using previous definition, and let the net go through process that was applied during quantization before (prepare_model, fuse_model, and convert), without rerun the calibration process.After that you can load the quantized state_dict in"
},
{
"X ": "problem in computing loss in multiple cpu distribution training",
"Z ": "Typically you want to run the forward and backward pass on each process separately and then average the gradients across all processes and then run the optimizer independently on each process.I‚m wondering what is the reason you‚re trying to build this yourself. PyTorch has a DistributedData Parallel module, which does all of this for you.",
"Y ": "use PyTorch Distributed Data Parallel module"
},
{
"X ": "how to link a custom nccl version",
"Z ": "You can see here that NCCL is statically linked to the binaries and can take a look at the repository for more information about the build process. ",
"Y ": "Check the build versions"
},
{
"X ": "training with ddp and syncbatchnorm hangs at the same training step on the first epoch",
"Z ": "[Solved] My problem was that I have random alternating training that go down different branches of my model. I needed to set the random seed that samples the probability of which alternating loss it will perform. This is probably because when pytorch does it reduce_all somewhere, it notices a difference in batch norm statistics since I believe it assumes some ordering on the statistics.",
"Y ": " set the random seed that samples the probability of which alternating loss it will perform"
},
{
"X ": "dataparallel and conv2d",
"Z ": "The conv2d library was not the problem. I found out problem was listed here : Since I was running VGG on cifar100, I had to rewrite the forward method on pytorch‚Äôs default VGG network since its built for ImageNet and includes a averagepool layer that will error with cifar100's data size. Using types.MethodType to replace methods in a network is incompatible with DataParallel. My solution was to create my own ‚MyVGG‚class that takes a VGG model as an input and takes all of its parameters, and then I could write my own forward function within that class.",
"Y ": "using types.MethodType to replace methods in a network is incompatible with DataParallel"
},
{
"X ": "how dose distributed sampler passes the value epoch to data loader",
"Z ": "The sampler is passed as an argument when initializing the DataLoader, so the train loader will have access to the sampler object. Neither the loader not the sampler need to be re-constructed every epoch.",
"Y ": "the sampler is passed as an argument when initializing the DataLoader, so the train loader will have access to the sampler object."
},
{
"X ": "the ddp seem to be disable to find the second node",
"Z ": "If I understand correctly, you are trying to train with 4 GPUs, 2 on one machine and 2 on another machine? If this is the case, then you will need to launch your training script separately on each machine. The node_rank for launch script on the first machine should be 0 and node_rank passed to the launch script on the second machine should be 1. It seems here like you are passing 2 separate node_ranks for processes launched on the same machine.See the multi-node multi-process distributed launch example here: Distributed communication package - torch.distributed ‚ PyTorch 1.7.0 documentation",
"Y ": "use Distributed communication package - torch.distributed ‚ PyTorch 1.7.0 documentation"
},
{
"X ": "why the output of children part of a network has low resolution",
"Z ": "Based on the posted code I assume the left image represents the input while the right one the model output?If thats the case, I guess your model isn't able to create sharp images and you could check the literature for new architectures, which could avoid the blurry output.",
"Y ": "I guess your model isn't able to create sharp images and you could check the literature for new architectures"
},
{
"X ": "build model from submodels",
"Z ": "I think I found the solution by myself.For everyone struggling with the same problem: You can use ModuleList. I my example, I can just append each encoder and the classifier to the ModuleList. Using this class, my Main-Model is aware of its submodels and for example the number of parameters is calculated correctly. I think there is a pretty good explanation of the concept here.",
"Y ": "use ModuleList"
},
{
"X ": "how to install pytorch 1 3 0 or above with cuda 8",
"Z ": "Thank you for your reply.I haven't tested building it from source. I decided to use the cpu version for now.",
"Y ": "install using build package "
},
{
"X ": "custom mean of tensor partitions",
"Z ": "Id look at the third-party package PyTorch scatter. It has a reduction=mean mode. You need to convert lst to a tensor and possibly use broadcasting. Now, the scatter implementation uses atomics, which is problematic e.g. in terms of performance. If the partitions are ordered (as your example suggests), you might compare to just doing a for loop and taking means over the slices. Best regards Thomas",
"Y ": "look into PyTorch scatter package "
},
{
"X ": "functional linear may cause runtimeerror one of the variables needed for gradient computation has been modified by an inplace operation",
"Z ": "Finally, I solved the problem.I wrongly use the output of the model as input for the next iteration.What a fool mistake!",
"Y ": "Use model input "
},
{
"X ": "how to remove the grad fn selectbackward in output array",
"Z ": "Hi,The detach() in the no_grad block is not needed. You will need to move all the ops into the no_grad block though to make sure no gradient is tracked ",
"Y ": "The detach() in the no_grad block is not needed."
},
{
"X ": "can i get gradients of network for each sample in the batch",
"Z ": "If you use simple NN, you can use tricks like the one mentionned here to reuse computations.",
"Y ": "Use simple NN"
},
{
"X ": "question about loading the model that was trained using 4gpu with distributed dataparallel to only 1 gpu job",
"Z ": "I’m not sure to understand the use case.It seems you would like to load the state_dict to a single GPU machine, but in your code you are wrapping the model again in DDP.Would creating the model, loading the state_dict, and pushing the model to the single GPU not work?",
"Y ": "Create the model loading the state_dict, and push the model to the single GPU"
},
{
"X ": "how to deploy different scripts on different gpus",
"Z ": "You could pass the device you want to train on as an argument to the script. For example cuda: 0 corresponds to the 1st GPU in your system, cuda: 1 corresponds to the 2nd GPU and so on. Then assuming you store the passed argument in a variable named device, all you have to do is to call .to(device) on your tensors etc.",
"Y ": " call .to(device) "
},
{
"X ": "distributeddataparralled not support cuda",
"Z ": "DistributedDataParallel (DDP) does supports CUDA. The comment suggests extra care might be necessary when backward run on non-default stream. Actually, even if backward occurs on non-default streams it should be fine for most use cases. Below is why:background: I learned from @albanD that autograd engine will use the same stream as the forward pass.Let’s take a look at what could go wrong for the code you quoted.1: the tensor is not ready when launching the allreduce operation2: the tensor was destroyed too soon before the allreduce finishes.We can rule out 2 for now, as all_reduce does recordStream() properly to prevent CUDA blocks to be freed too early.Then the only thing left is 1. The operation on that tensor before allreduce is bucket_view.copy_(grad.view({-1}), /* non_blocking */ true); in mark_variable_ready_dense. The copy here happens on the same device (replica.contents and grad). And Reducer itself does not switch streams in between. So the only case that could hit race condition is when the application used different streams for different operators during the forward pass, and grads associated with those operators fall into the same bucket in reducer.",
"Y ": "bucket_view.copy_(grad.view({-1}), /* non_blocking */ true); in mark_variable_ready_dense. "
},
{
"X ": "using custom method in distributed model",
"Z ": "bigyeet:Is this right, or do I have to write a custom DataParallel wrapper that has scatter, gather, etc methods? If so, how would I do it? It depends on what you expected reset_hidden_state to achieve. Below is what happens in EVERY forward pass when you use DataParallel. split input data replicate model to all devices feed input data splits to all model replicas gather outputs from all replicas done with forward After the forward pass, the autograd graph actually contains multiple model replicas. It looks sth likeoriginal model <- scatter <- model replicas <- replica output <- gather <- final output.So in your above use case, if reset_hidden_state has any side effect that you would like to apply to the backward pass, it will only apply to the original model, not to model replicas. But if you are only trying to clear some states for the next forward pass, it should work.",
"Y ": "original model <- scatter <- model replicas <- replica output <- gather <- final output."
},
{
"X ": "unable to load waveglow checkpoint after training with multiple gpus",
"Z ": "This usually happens when multiple processes try to write to a single file.However, this should be prevented with the if condition if rank == 0:.Did you remove it or changed the save logic somehow?",
"Y ": "use rank == 0"
},
{
"X ": "strange behavior nn dataparallel",
"Z ": "Thanks for the information. This points towards some communication issues between the GPUs.Could you run the PyTorch code using NCCL_P2P_DISABLE=1 to use shared memory instead of p2p access?",
"Y ": "run model using NCCL_P2P_DISABLE=1"
},
{
"X ": "loss collection for outputs on multiple gpus",
"Z ": "If you are using nn.DataParallel the model will be replicated to each GPU and each model will get a chunk of your input batch.The output will be gathered on the default device, so most likely you wouldn‚Äôt have to change anything. However, I‚Äôm not sure about the use case.How are you calculating the memory consumption and is this operation differentiable?I assume it‚s not differentiable so that your accumulated loss will in fact just be the nn.CrossEntropyLoss.",
"Y ": "If you are using nn.DataParallel the model will be replicated to each GPU and each model will get a chunk of your input batch."
},
{
"X ": "default collate fn sending data to cuda 0",
"Z ": "Have you tried setting CUDA_VISIBLE_DEVICES env var before launching the process? It would be more clear if you share some minimum code snippet ",
"Y ": "set CUDA_VISIBLE_DEVICES env var befor launching the model "
},
{
"X ": "distributed gpu calculations and cuda extensions",
"Z ": "Would splitting the data and sending each chunk to a specific device work? Something like this could already solve your use case: data = torch.randn(4,100) chunks = data.chunk(4,0) res = [] for idx, chunk in enumerate(chunks): res.append(my_fun(chunk.to('cuda: {}'.format(idx))).to('cuda: 0')) res = torch.stack(res)",
"Y ": "data = torch.randn(4, 100) chunks = data.chunk(4,) res = [] for idx, chunk in enumerate(chunks): res.append(my_fun(chunk.to('cuda: {}'.format(idx))).to('cuda: 0'))res = torch.stack(res)"
},
{
"X ": "question about torch distributed p2p communication",
"Z ": "Hey @yijingThe message will directly send from 10.0.0.2 to 10.0.0.3.In init_process_group, the init_method=“tcp: //10.0.0.1:8888” is only for rendezvous, i.e., all process will use the same ip:port to find each other. After that communications don’t need to go through master.BTW, if you are using p2p comm, torchrpc might be useful too. Here is a tutoral.",
"Y ": " The message will directly send from 10.0.0.2 to 10.0.0.3. and also use torchpc "
},
{
"X ": "behavior of dataloader when resuming training from the existing checkpoint",
"Z ": "Are you trying to train your model for only 1 epoch because you have so much data and it’ll take too long to do more, or are you possibly trying to do 1 epoch because your machine won’t allow it to finish and everything shuts off, so you’d like to save intermediate progress? (Epoch = single pass through your entire dataset) Just asking out of curiousity, no worries if there’s no reason. As for for your question, I’d do one of the following:Drop shuffle=True and as you train keep track of an id (either the step number, which will represent what batch you are on, or just the raw id of current sample you’re on). If you’re using a HuggingFace Trainer instance for your model training, you can use callbacks to do this (add a on_step_end or on_step_begin callback to write out current step # to a file, can be found here in the docs). When continuing training, You can slice examples starting from the id you left on, and ending with the last id of the dataset, then append all the samples you’ve already trained with at the end of this slice (essentially shifting the samples you trained with, but putting them at the end). If you don’t care about re-using the samples at the end, you can just use PyTorch’s Subset dataset class. Keep shuffle=True but have a small function call when you fetch a sample that writes-out the id that’s getting fetched/processed. When continuing training, do a similar process as above (option 1) but rather than working with a single slice from shuffle=False you can slice out a subset of your dataset using the ids you’ve saved",
"Y ": "Keep shuffle=True, but when you fetch a sample, run a tiny function that prints out the id that's being fetched/processed. Continue training in the same way as before (option 1), but instead of working with a single slice from shuffle=False, use the ids you've saved to slice out a portion of your dataset."
},
{
"X ": "architecture of deeplabv3 resnet50",
"Z ": "Printing out the model wouldn't show the computation graph and would only print the child modules, so I agree that this would not be sufficient to see‚the structure.You could check out e.g. PyTorchViz to visualize the computation graph in case that's helpful.PS: Often I also take a look at the source code, but for segmentation/detection models this is unfortunately also not trivial.",
"Y ": "use this https: //github.com/szagoruyko/pytorchviz"
},
{
"X ": "create diagonal matrices from batch",
"Z ": "Hi Samuel! Samue1: x = torch.rand(size=(M, N)) and want to create for each of the M inputs a diagonal matrixTry: torch.diag_embed (torch.rand (size = (M, N))) k",
"Y ": "torch.diag_embed (torch.rand (size = (M, N))) "
},
{
"X ": "how to display incorrect samples predicted by the model",
"Z ": "HarshRangwala:Invalid shape (1,3,224,224) for image dataThat first dimension should be squeezed out as an image should have 3 dimensions: number of channels, height, and width. (i.e. (3,224,224)).Try img = img.squeeze() before calling ax.imshow(img)",
"Y ": "Try img = img.squeeze() before calling ax.imshow(img)"
},
{
"X ": "moving tensor to cuda",
"Z ": "If you are pushing tensors to a device or host, you have to reassign them: a = a.to(device='cuda') nn.Modules push all parameters, buffers and submodules recursively and don't need the assignment.",
"Y ": "a = a.to(device='cuda')"
},
{
"X ": "efficient implementation of jacobian of softmax",
"Z ": "Hi Samuel! Samue1: Does this also work for batched versions of S? No. If you had tried it, you would have discovered that torch.outer() does not accept multidimensional tensors. Is the result correct if I use J = torch.diag_embed(S) - torch.outer(S, S) No, this will throw an error (because you pass a multidimensionaltensor to torch.outer()). You can, however, use pytorch’s swiss army knife of tensor multiplication functions to construct a batch version of outer: >>> import torch gt;>> torch.__version__ '1.9.0' >>> S = torch.arange (6).reshape (2,3).float() >>> S tensor([[0., 1., 2.],[3., 4., 5.]])>>> torch.diag_embed (S) - torch.einsum ('ij, ik -> ijk', S, S)tensor([[[ 0., 0., 0.],[ 0., 0., -2.],[ 0., -2., -2.]],[[ -6., -12., -15.],[-12., -12., -20.],[-15., -20., -20.]]])(As an aside, none of this has anything to do with the title you gavethis thread, namely “Jacobian of Softmax.”)Best.K. Frank",
"Y ": "use pytorch’s swiss army knife of tensormultiplication functions to construct a batch version of outer"
},
{
"X ": "pytorch can not move tensor to cuda",
"Z ": "Since you are using an Ampere GPU (3070), you would need to use CUDA;=11.0, so the old PyTorch 1.5.1 release with CUDA9.2 won’t work. Update to the latest release with CUDA11.1 and it should work.",