Json_Pytoch_Discuss.json

{
  "data": [
    {
      "X ": "calculating loss for entire batch using nllloss in 0 4 0",
      "Z ": "Defake:loss = criterion_test(dec_outs.view(-1, vocab_size, batch_size), targets.view(-1, batch_size)) ,I think you need to do criterion_test(dec_outs.view(-1, vocab_size),targets.view(-1)) In your case, ( C )-   vocab_size and (N)-   (batch_size*seq_length). I am assuming all the batches have the same sequence length. If not, youll have to use pack_padded_sequence and also mask the loss for the pad token.",
      "Y ": "I believe you should use criterion test(dec outs.view(-1, vocab size),targets.view(-1)) to test your hypothesis.( C ) = vocab size, and ( N ) = batch size*seq length in your case. I'm assuming that the sequence length in each batch is the same. If not, you'll need to use pack padded sequence and hide the pad token loss."
    },
    {
      "X ": "masked fill operates weirdly",
      "Z ": "From the error mesage, it is a size issue on the 3rd dimension, where one is of size 8 and the other of size 9.I would print the size of the tensors before the operation to check the dimensions.",
      "Y ": "The error message indicates that there is a size issue on the third dimension, with one size 8 and the other size 9.Before the procedure, I would print the size of the tensors to double-check the dimensions."
    },
    {
      "X ": "creating custom nn module for softmargin softmax",
      "Z ": "Prefer not to use for loops, try to vectorize your code as much as possible.Refer this old question I asked for checking my implementation of softargmax which I believe you are lookin for. It's pretty decent and fast",
      "Y ": "Avoid using for loops whenever feasible, and attempt to vectorize your code as much as possible."
    },
    {
      "X ": "gru autoencoder is not working",
      "Z ": "I solve this problem.Target tensor was wrong.",
      "Y ": "Target tensor was wrong"
    },
    {
      "X ": "issue with multiple gpu loss convergence",
      "Z ": "I solved my issue. Since batch wasnt my first dimension, I had to mention dim=1 in the data parallel, that is the dimension I need to scatter my inputs.",
      "Y ": " mention dim=1 in the data parallel, that is the dimension I need to scatter my inputs."
    },
    {
      "X ": "building from source keeps failing ubuntu 18 04 02 lts no gpu",
      "Z ": "hey there,I cant tell you the exact reason for you problem, but it is best practice to build pytorch in a clean anaconda environment. Here is how. Please report back if it helps",
      "Y ": "Create new anaconda  envinorment "
    },
    {
      "X ": "segmentation fault core dumped with personnal nn function",
      "Z ": "The problem seem to be solved by updating from v1.0 to v1.0.1",
      "Y ": "Update version from v1.0 to v1.0.1"
    },
    {
      "X ": "libtorch cmake error on centos7",
      "Z ": "Solved. Cmake 3.10 is fine.",
      "Y": "use Cmake 3.10"
    },
    {
      "X ": "cmake error and fatal error lnk1181 building from source on windows 10",
      "Z ": "Well, I saw that your build directory contains space. As a workground, you can just avoid that to make build pass. However, more details to fix this issue are welcomed.",
      "Y ": "So, I noticed that your build directory has some empty space. You can simply avoid that as a workground to ensure that the build passes. "
    },
    {
      "X ": "data float 1 segfaults when cudatype",
      "Z ": "I found the answer on StackOverflow:stackoverflow.com Torch C++: Getting the value of a int tensor by using *.data&lt;int  () pytorch, torch, libtorch asked by Afshin Oroojlooy on 02: 22PM - 15 Jan 19 UTC The Tensor class despreately needs documentation!",
      "Y ": "The Tensor class despreately needs documentation!"
    },
    {
      "X ": "how to collect libtorch package like the official release when building from source",
      "Z ": "I think using an appropriate CMake + make install should work, e.g. the Android build does this. You want to disable the Python bit for this. The suggested alternative there works well, too - building Python and picking the lib and include. Libtorch 1.0 used to be built that way (actually extracting from the whl, Best regards Thomas",
      "Y ": "I believe that using an adequate CMake + make install, such as the Android build, should suffice. For this, you'll need to turn off Python.The suggested solution there, creating Python and selecting the lib and include, also works fine. That's how Libtorch 1.0 was made (by extracting from the whl...)."
    },
    {
      "X ": "cuda is available true with python false with c",
      "Z ": "Do you use the same PyTorch distribution (i.e. libtorch cmake from /usr/local/lib/python3.x/dist-packages/torch/share/cmake or somesuch)?In the end, the same libtorch should behave the same way Best regards Thomas",
      "Y ": "use libtorch"
    },
    {
      "X ": "how to self define a backward function for a net in libtorch i tested some code but failed",
      "Z ": "Hi, Note that on the python side, the Function have changed slightly as you can see in the tuto.For cpp it is a bit more complex. a Function does only one way and its ‚Äúapply‚Äù method should be implemented. It is either implemented in pure autograd by performing operations on Variables or the output should be wrapped and the backward Function specified.You will need 2 functions if you want a custom backward. For example here, ‚DelayedError is the forward function and ‚ÄúError‚Äù is the backward.",
      "Y ": "The Functions on the Python side have changed slightly, as you can see in the tutorial.It's a little more complicated in cpp. A Function can only be used in one direction, and its apply method should be used. It can be done in pure autograd by executing operations on variables, or it can be wrapped and a reverse function supplied.If you want a custom backward, you'll need two functions. In this case, the forward function is DelayedError, and the backward function is Error."
    },
    {
      "X ": "is there any way to skip steps in a dataloader",
      "Z ": "Yeah, I would say no built-in way for now. But, we are working on a new design of DataLoader, which IMO will provide this functionality.",
      "Y ": "Bulding New desgin for DataLoader"
    },
    {
      "X ": "understanding model to device",
      "Z ": "Yes, your assumption should be correct as also seen in this post, since the model reference would be passed and its parameters (and buffers) updated inplace.You could use the code snippet in the linked post to verify it using your setup.",
      "Y ": "Yes, as noted in this post, your assumption should be valid, as the model reference would be passed and its parameters (and buffers) updated in place.To test it with your configuration, you might use the code snippet in the linked post."
    },
    {
      "X ": "update weight with same netoworks output",
      "Z ": "That wouldn't be a fix, as it's still using the wrong behavior. Previous PyTorch versions allowed this wrong gradient calculations, which is why no errors were raised.",
      "Y ": "Previous PyTorch versions allowed this wrong gradient calculations, which is why no errors were raised."
    },
    {
      "X ": "update network after differentiation with autograd grad",
      "Z ": "Update: The layers receive gradients. My problem has to do something with My GNN.Maybe the integration in pytorch geometric is breaking",
      "Y ": "Maybe the integration in pytorch geometric is breaking"
    },
    {
      "X ": "create a f score loss function",
      "Z ": "AFAIK f-score is ill-suited as a loss function for training a network. F-score is better suited to judge a classifier‚its calibration, but does not hold enough information for the neural network to improve its predictions.Loss functions are differentiable so that they can propagate gradients throughout the network using the chain rule (see ‚backpropagation).",
      "Y ": "AFAIK f-score is ill-suited as a loss function for training a network."
    },
    {
      "X ": "how do i backpropagate through a modules parameters",
      "Z ": "you could look at something like https://github.com/facebookresearch/higher for this purpose.It functionalizes the model, where it‚Äôs parameters can be detached and backproped through",
      "Y ": "It functionalizes the model, where it’s parameters can be detached and backproped through"
    },
    {
      "X ": "how am i supposed to cache tensors that require grad but arent learnable module params that are replaced by a different tensor several times each forward pass",
      "Z ": "I solved my problem by:Not making activs and outputs nn.Parameters Not assigning them as model attributes. Instead I added them as optional key word arguments in the forward method and returned the activations as well.",
      "Y ": "Activations and outputs are not nn. Parameters aren't being assigned as model attributes. Instead, I added them to the forward procedure as optional key word arguments and returned the activations as well."
    },
    {
      "X ": "torch no grad makes any difference",
      "Z ": "Hi,The eval mode will make a difference only if you use special Modules that behave differently in eval mode like dropout or batchnorm. But even in that case, the runtime might not change that much.The no_grad mode disables the autograd so it will make a significant difference in memory usage but should not change much for runtime (as we work hard to make the autograd light so that it can run at every forward).",
      "Y ": "Only if you utilise specific Modules that behave differently in eval mode, such as dropout or batchnorm, will eval mode make a difference. However, even in such instance, the runtime may not alter significantly. The no grad setting disables autograd, resulting in a significant reduction in memory use."
    },
    {
      "X ": "training breaks in pytorch 1 5 0 throws inplace modification error",
      "Z ": "Setting set_detect_anomaly doesnt give any other output Make sure to use latest pytorch as we recently fix warnings not showing up in colab.Or run your code in command line to have the corresponding forward code.The code does quite a lot of inplace and viewing ops.Plus, it is working without any error, when Pytorch is downgraded to 1.5.0.This kind of check is here to make sure we don‚Äôt compute silently wrong gradients. So it is most likely that the old behavior was silently computing wrong gradients and has been fixed in more recent versions.",
      "Y ": "There is no other output when set detect anomaly is used. Make sure you're using the most recent version of Pytorch, as we recently fixed a bug where warnings weren't showing up in colab. Alternatively, you can run your code from the command line to get the forward code. The code performs numerous in-place and viewing operations. It also collaborates with..."
    },
    {
      "X ": "custom loss function class",
      "Z ": "I am afraid BCELoss does not.But looking at the code, BCELossWithLogits does (https: //pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html?highlight=bce#torch.nn.BCEWithLogitsLoss) So if you actually use a sigmoid before it and want to merge both, that will work Otherwise, you will have to create a custom nn.Module yes. But the one line formula for BCELoss given above should not be significantly slower than the all-in-one version.",
      "Y ": "BCELossWithLogits does"
    },
    {
      "X ": "best way to downsample batch image tensors",
      "Z ": "Refer to nn.PixelShuffle()",
      "Y ": " use nn.PixelShuffle()"
    },
    {
      "X ": "unexpected data error in the ms coco dataset valueerror all bounding boxes should have positive height and width",
      "Z ": "Based on the error it seems that this particular bounding box has a width of 0, so you might want to filter out these images.",
      "Y ": "Filter  biunding box whcich has width 0 "
    },
    {
      "X ": "softmax returns only 1s and 0s during inference",
      "Z ": "Found the issue. The data was not the same after all (the pipeline was missing the normalization step, and I didn't notice).Let this be a lesson to anyone getting weird logits out of your network: Print the values, don't plot the image. :v)",
      "Y ": "Print the logits values because  ploting image won't help "
    },
    {
      "X ": "mnist server down",
      "Z ": "If the version of torchvision is 0.9.0, which is currently stable, being unable to download MNIST is (unfortunately) expected, but if the version is nightly, it's not expected.",
      "Y ": "Version is not stable , Download the MNIST file "
    },
    {
      "X ": "how to visualize model in pytorch",
      "Z ": "Not a problem @hs99! I'd suggest reading the tutorial first Saving and Loading Models PyTorch Tutorials 1.8.1+cu102 documentation and if there are still problems, raise a new topic!",
      "Y ": "read Saving and Loading Models PyTorch Tutorials 1.8.1+cu102 documentation "
    },
    {
      "X ": "vgg16 using cifar10 not converging",
      "Z ": "Is it possible your validation accuracy is for a single batch instead of the entire validation set? If so the fluctuation would be perfectly normal since your accuracy is based on only 16 predictions which would fluctuate heavily.Otherwise, the heavy fluctuations in your validation set would not make sense across a larger sample, especially as the training and validation losses steadily decline.",
      "Y ": "There is no other output when set detect anomaly is used. Make sure you're using the most recent version of Pytorch, as we recently fixed a bug where warnings weren't showing up in colab. Alternatively, you can run your code from the command line to get the forward code. The code performs numerous in-place and viewing operations. It also collaborates with..."
    },
    {
      "X ": "dataparallel trained on one gpu but inference used on multiple gpus",
      "Z ": "Then, try to load your model before DP construction.",
      "Y ": "Load model before DP Construction "
    },
    {
      "X ": "runtimeerror each element in list of batch should be of equal size",
      "Z ": "Issue resolved by downgrad back to PyTorch 1.5.0So it looks like a PyTorch 1.6 issue",
      "Y ": "Downgrade pytroch version to 1.5 from 1.6 "
    },
    {
      "X ": "valueerror not enough values to unpack expected 3 got 2",
      "Z ": "your lstm layer returns a 3-tuple but you unpack it as 2",
      "Y ": "Lstm returns 3-tuple instead you are unpackingit as 2  "
    },
    {
      "X ": "nn nllloss valueerror dimension mismatch",
      "Z ": "UPDATEAs I was iterating over the training set, I realized that the last batch contains only 4 labels as opposed to the expected 10. Since it was the last batch, this was the value that the variable target.size(0) referred to after finishing the iteration, which ultimately caused the ValueError raise.Take-home message: Know thy dataset inside out ",
      "Y ": "Know the dataset inside out , last bacth conatins only 4 labels it is excpecting 10 "
    },
    {
      "X ": "nn embedding input indices meaning",
      "Z ": "Each item of input, like 1, will be changed to its embeddings. 1 means Embedding layer's weight first row, like this:You can get the embed layer weight by embedding.weight.Search word2vectors to learn more.",
      "Y ": "embedding.weight."
    },
    {
      "X ": "low gpu utilization for sequence task",
      "Z ": "Based on this thread, I found a way to eliminate the inner for loop using bmm. Profiling indicates this has removed a lot of work from the CPU (especially the backwards pass) and has resulted in a considerable speedup.",
      "Y ": "Eliminate the inner loop with bmm"
    },
    {
      "X ": "conflict between libtorch and grpc",
      "Z ": "There is an issue about it and it‚Äôs not fixed yet: https://github.com/pytorch/pytorch/issues/14573. Currently the easiest way is to compile libtorch with the protobuf library that grpc uses, or compile grpc with the protobuf library that libtorch uses.",
      "Y ": "Compiling libtorch with the protobuf library that grpc uses, or compiling grpc with the protobuf library that libtorch uses, is currently the easiest approach."
    },
    {
      "X ": "multidimensional slice in c",
      "Z ": "We don‚Äôt have it now, but we will add it (aka. numpy-style indexing) by the end of this year. ",
      "Y ": "Curretly it is not avaiable"
    },
    {
      "X ": "where is the implementation of tensor slice",
      "Z ": "I believe it's implenented here ",
      "Y ": "Use this link https://github.com/pytorch/pytorch/blob/3ad1bbe16a3c1d6bb9566f09229afd63022a82df/aten/src/ATen/native/TensorShape.cpp#L655"
    },
    {
      "X ": "include directory structure",
      "Z ": "I usually look at the cpp_extension include path to answer this.In particular, I don't think you should be using this particular include anymore. But I have to admit I don‚Äôt the rational behind this.",
      "Y ": "us ethis link https: //pytorch.org/cppdocs/frontend.html#end-to-end-example 9"
    },
    {
      "X ": "at cuda memory leak when loading model",
      "Z ": "ASAN doesn‚Äôt work with CUDA, it‚Äôs a pretty well known problem. As you note, if you can just use CPU only functionality, you‚Äôll be fine.",
      "Y ": "ASAN doesn’t work with CUDA, use only CPU functionality"
    },
    {
      "X ": "solved thread safety issue in torch load",
      "Z ": "I think the error might be because I compiled in debug mode while using the release version of the torch library.",
      "Y ": "I believe the error occurred because I used the release version of the torch library and compiled in debug mode."
    },
    {
      "X ": "dataparallel output differs from its module",
      "Z ": "Since the error is that low, I would still assume it‚'s  still due to floating point precision.",
      "Y ": "Error due to floating point precision "
    },
    {
      "X ": "use 4 gpu to train model loss batch size batch size 4",
      "Z ": "Thank you very much , I have solve this problem. There is something wrong with my dataloader function, when I load data, I use padding to process my data, but I forgot to turn list into tensor, as a result nn.Dataparallel to split data wrong in batch dim. ",
      "Y ": " turn  list to tensor "
    },
    {
      "X ": "torch distributed class definitions",
      "Z ": "ReduceOp is a C++ enum, and is exposed to the python interface using pybind (https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/init.cpp#L145). That enum is defined here: https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Types.hpp#L8",
      "Y ": "us this link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Types.hpp#L8 "
    },
    {
      "X ": "how to freeze feature extractor and train only classifier in distributeddataparallel",
      "Z ": "Looks like I see the same issue with 1.1.0 and 1.2.0, although it seems to work 1.3 onwards. Could you try out a version   = 1.3?",
      "Y ": "Try with updated version 1.3"
    },
    {
      "X ": "attention weights with multiple heads in nn multiheadattention",
      "Z ": "This seems to be because the attention weights are averaged across all of the heads:github.com/pytorch/pytorch nn.MultiHeadAttention should be able to return attention weights for each head. opened 01: 18PM - 10 Mar 20 UTC ironcadiz enhancement module: nn oncall: transformer/mhatriaged ## üöÄ Feature ## Motivation Currently when using the `nn.MultiHeadAttention` ‚Ä¶layer, the `attn_output_weights` consists of an average of the attention weights of each head, therefore the original weights are inaccessible. That makes analysis like the one made in this [paper](https: //arxiv.org/abs/1906.04341v1) very difficult.## PitchWhen the `nn.MultiHeadAttention` forward is called with `need_weights=True` (and maybe a second parameter like `nead_attn_heads=True`),  `attn_output_weights` should be a tensor of size `[N,num_heads,L,S]`,with the weights of each head, instead of the average of size `[N,L,S]` (following the notation in the [docs](https: //pytorch.org/docs/stable/nn.html#multiheadattention))## Alternatives ## Additional context A small discussion about this subject with a potential solution was made [here](https: //discuss.pytorch.org/t/getting-nn-multiheadattention-attention-weights-for-each-head/72195) If you guys agree, I'll gladly make a PR.",
      "Y ": "This seems to be because the attention weights are averaged across all of the heads."
    },
    {
      "X ": "about torch autograd function",
      "Z ": "Well, the rename to ctx is a good idea, but really, you would need to find a source for your shape.For example TorchVision‚Äôs roi align-function takes some more parameters (vision/roi_align_kernel.cpp at 0013d9314cf1bd83eaf38c3ac6e0e9342fa99683 ¬∑ pytorch/vision ¬∑ GitHub), maybe the forward should, too, and then assign them to ctx members.",
      "Y ": "use this link https: //github.com/pytorch/vision/blob/0013d9314cf1bd83eaf38c3ac6e0e9342fa99683/torchvision/csrc/ops/autograd/roi_align_kernel.cpp#L111-L127"
    },
    {
      "X ": "how to initialize tensors such that memory is allocated",
      "Z ": "It seems like rand needs additional memory to generate the random numbers, but then uses similar memory to zeros.",
      "Y ": "Rand appears to require additional memory to produce random numbers, but then consumes memory similar to zeros."
    },
    {
      "X ": "will the same model input data twice retain the gradient information of the first input data",
      "Z ": "Yes, it should (if total.backward()). Try to print and see if they are different?Since the backward is on total i.e loss1+loss2, the computation graph would include both 1,2 inputs.You could also refer the GAN tutorial where something similar is done",
      "Y ": "use total.backward()"
    },
    {
      "X ": "jit tried to access nonexistent attribute or method forward of type tensor",
      "Z ": "Looks like inheritance is not supported Unable to call `super` method with TorchScript ¬∑ Issue #42885 ¬∑ pytorch/pytorch ¬∑ GitHub",
      "Y ": " Unable to call troch script with super method "
    },
    {
      "X ": "pytorch with cuda 11 compatibility",
      "Z ": "As explained here, the binaries are not built yet with CUDA11. However, the initial CUDA11 enablement PRs are already merged, so that you could install from source using CUDA11.If you want to use the binaries, you would have to stick to 10.2 for now.",
      "Y ": "install  the latest CUDA11 "
    },
    {
      "X ": "unsupported format string passed to list format",
      "Z ": "I know what is wrong, that I pass an array rather than a value, thank you!",
      "Y ": "Pass Array"
    },
    {
      "X ": "increasing data set size slows loss backward even though batch size is constant",
      "Z ": "I think I have found the issue. I had wrongly assumed that the input data tensors needed requires_grad=True for proper training but after experimenting a little and setting requires_grad=False for the input data everything is running much faster and the network still learns. I guess only model.parameters() needs required_grad=True.",
      "Y ": "use requires_grad=False"
    },
    {
      "X ": "torch logsumexp returning nan gradients when inputs are inf",
      "Z ": "It depends on your optimizer.If you don't have momentum/accumulated terms, then you can simply set these gradients to 0 and your optimizer won't change the values.If you have a fancy optimizer that will update the weights even for a 0 gradient, the simplest solution might be to save the original value of the weights before performing the step and then restoring them after the optimizer step.",
      "Y ": "Simply mention gradients to 0"
    },
    {
      "X ": "multiple calls to autograd grad with same graph increases memory usage",
      "Z ": "Okay, nevermind. There was an extra backwards hook being added in the saliency code I copied. Clearing the data fixed the memory issue.Thanks for your help!",
      "Y ": "remove backwards hook which was added and it will solve memory issuses"
    },
    {
      "X ": "pass all parameters to optimizer instead of only nonfrozen parameters",
      "Z ": "Hi,Assuming no gradients were computed for them before and their .grad field is None to begin with. Then the optimizer will just ignore them because they don't have any gradient (as the backward won't populate them).",
      "Y ": "Gradient values are empty because .grad is None"
    },
    {
      "X ": "how to calculate gradients correctly without in place operations for custom unpooling layer",
      "Z ": "melike:I wrote it before learning that in-place operations should be avoided in PyTorch.You don‚Äôt have to avoid them. It is just that autograd does not support every combination of them and it will raise an error if you hit such case.So if your code runs without error, it means that autograd can handle this case just fine.The only concern I would have with such implementation is the slowdown due to the nested loops. But that‚Äôs unrelated to gradient correctness.",
      "Y ": "You don't have to stay away from them. It's just that autograd doesn't support every combination of them, and if you do, you'll get an error.So, if your code runs without errors, autograd is capable of handling this situation.The only problem I have with such an implementation is the nested loops' slowness. However, this has nothing to do with gradient accuracy."
    },
    {
      "X ": "grad fn get whole graph in dot",
      "Z ": "Hi,This package will return a dot graph: https: //github.com/szagoruyko/pytorchviz The objects are re-used because the first one goes out of scope and is free. But later one, since you redo an allocation of the same size, the same memory is returned to you (many allocator do caching for allocations of the same size).",
      "Y ": "use this link   https: //github.com/szagoruyko/pytorchviz"
    },
    {
      "X ": "gradient computation when using forward hooks",
      "Z ": "Hi,I think the simplest way to understand what will happen here is to know that the autograd lives below torch.nn and is completely unaware of what torch.nn does.So in this case, whatever is the Tensor you give to the rest of the net is the one that will get gradients (it does not matter if it comes from a hook or not).And in this case, since A_hooked depends on A, then the gradients will flow back from A_hooked to A.",
      "Y ": "The simplest way to comprehend what will happen here is to remember that the autograd lives beneath torch.nn and has no idea what torch.nn does.In this situation, the Tensor you offer to the remainder of the network is the one that gets gradients (it does not matter if it comes from a hook or not).Because A hooked is dependent on A, the gradients will flow back from A hooked to A in this situation."
    },
    {
      "X ": "dataset for cnn regression",
      "Z ": "Hi @mattbevWelcome to the PyTorch community! You can consider object counting datasets, the idea is that object counting can be formulated as a regression problem. Here are some links:Visual Geometry Group - University of Oxford Object Counting | Papers With Code [2008.12470] Counting from Sky: A Large-scale Dataset for Remote Sensing Object Counting and A Benchmark MethodCrowd Counting | Kagglhttp: //visal.cs.cityu.edu.hk/static/pubs/conf/cvpr08-peoplecnt.pdf Hope this helps!",
      "Y ": "You can consider object counting datasets, the idea is that object counting can be formulated as a regression problem. "
    },
    {
      "X ": "is it better to set batch size as a integer power of 2 for torch utils data dataloader",
      "Z ": "Powers of two could be preferred in all dimensions, so number of channels, spatial size etc.However, as described before, internally padding could be used, so that you wouldn' hit a performance cliff and should thus profile your workloads.",
      "Y ": "In all dimensions, such as channel count, spatial size, and so on, powers of two may be preferred.However, as previously described, internally padding could be used to avoid a performance cliff, and you should thus profile your workloads."
    },
    {
      "X ": "runtimeerror function addbackward0 returned an invalid gradient at index 1 expected type torch floattensor but got torch cuda floattensor",
      "Z ": "I solved the problem. One variable which i was initializing within the loss function by the name ,processed‚was not being put on cuda.Thing to keep in mind for these problems is that some variable is not deployed on GPU or CPU whichever device you are using. So a shortcut is to put every single variable to GPU or CPU whichever device you are using by calling variable.to(device) function.",
      "Y ": "One variable with the name,processed, which I was initialising within the loss function, was not being put on cuda.The important thing to remember for these issues is that some variables are not deployed on the GPU or CPU, whichever device you are using. As a result, a shortcut is to call the variable.to(device) function to assign every variable to the GPU or CPU, depending on which device you're using."
    },
    {
      "X ": "where should i look to solve running mean error in resnet transfer learning",
      "Z ": "The original resnet's first convolution out channel is 64, but you are using 128. Thus it does not work with the next batch norm as well as following layers.Please use self.conv1 = nn.Conv2d(1,64, kernel_size=(7,7), bias=False); self.inplanes = 128 or you have to change the entire network.",
      "Y ": "use self.conv1 = nn.Conv2d(1,64, kernel_size=(7,7), bias=False); self.inplanes = 128 "
    },
    {
      "X ": "save output image of cnn model",
      "Z ": "It is possible but I am not sure if it'ss the best way to go for your problem.From what I understand you only want to reconstruct the RGB image from the output, am I right? If yes, do you know what each channel of your output represents? Isn‚Äôt one of the channels the edge map?In case you want to do that, you can either change the output shape of your 3rd conv2d or add another layer with the input channel of your last layer and your desired output dimension. But you may need to adjust your lost function as well depending on what loss function you are using.",
      "Y ": "either change the output shape of your third conv2d or add another layer with the input channel of your previous layer and your desired output dimension. However, depending on the loss function you are using, you may need to adjust it as well"
    },
    {
      "X ": "need feature maps of resnet50",
      "Z ": "I usually use forward hooks as described here, which can store the intermediate activations. You could then pass these activations to further processing.",
      "Y ": "Use forward hooks "
    },
    {
      "X ": "tensors are at different cuda devices",
      "Z ": "Solution by Yanai Elazar:You can define an environment variable like this: CUDA_VISIBLE_DEVICES=1 This way, only this gpu will be available for the running program, and you won‚Äôt leak into other gpus. This way in the code you need to run on a single gpu, and not specify one specifically.",
      "Y ": "CUDA_VISIBLE_DEVICES=1"
    },
    {
      "X ": "simple rnn stuck around the mean",
      "Z ": "Are you making use of the hidden state? Maybe you could use the lstm like it'ss done in this tutorial: https: //pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html",
      "Y ": "use this link https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html"
    },
    {
      "X ": "pad packed sequence export to onnx",
      "Z ": "I was able to solve this by creating my own packing and unpacking methods to use during export. From what I understand, exporting to ONNX does not support creating your own instance of PackedSequence. I submitted an issue to Pytorch.",
      "Y ": "Use your own method of   packing and unpacking   urong export "
    },
    {
      "X ": "a very strange phenomenon i met in training machine translation",
      "Z ": "It is not necessary that your loss should decrease for every batch within each epoch (it can go up for different batches), but it should decrease across different epochs.If your loss is not decreasing across different epochs, learning rate could be a problem",
      "Y ": "Learning rate could be problem"
    },
    {
      "X ": "pytorch chatbot loss function with ignore index instead of targets padding mask",
      "Z ": "I figured out what was wrong with my model. It turned out that despite that my loss function returned some reasonable values, the loss was not calculated properly, thus as a consequence model did not learn. Output from my AttentionDecoder was softmaxed, then I used CrossEntropyLoss or NLLLoss (tried them both), but I did not change the softmax to log_softmax in case of NLLLoss, or in case of using CrossEntropyLoss I did not get rid of softmax at all as CEL comprised of log_softmax and NLLLoss.",
      "Y ": "Calluate the loss properly "
    },
    {
      "X ": "do we need to set a fixed input sentence length when we use padding packing with rnn",
      "Z ": "The RNN see each word, i.e., a vector of size 5, step by step. If there are 6 words, the RNN sees 6 vectors and then stops. Same with 8 words. Your confusion might stem that LSTM or GRU hides this step-wise processing. You give the model a sequence of a certain lengths, but internally the model loops over the sequence. More words just means more loops before it‚Äôs finished.Obviously, things get problematic with batches if the sequences in a batch have different lengths. One default solutions is to pad all short sequences to the length of the longest sequence.The size/complexity of the model (the number of neurons of you will, but it‚Äôs better to think in number of trainable parameters) of the LSTM/GRU depends on: the size of the input (e.g.,5 in your example) the size of the hidden dimension number of layers in case of a stacked LSTM/GRU whether you use uni- or bidirectional.It does not depend on the sequences lengths. Sure, the processing takes more time for longer sequences.",
      "Y ": "It does not depend on the sequences lengths. Sure, the processing takes more time for longer sequences"
    },
    {
      "X ": "pytorch c api missing headers",
      "Z ": "I just went through the trouble of upgrading to the most recent stable version which I got from the home-page (The one that I downloaded from the installation help page was nightly version which did not compile). I checked with the source code on github that it indeed contained the Pooling functions. Glad to confirm that it does indeed work now. Thanks for the help!",
      "Y ": "Update to the lastet version "
    },
    {
      "X ": "about torchvision for c frontend",
      "Z ": "Indeed, torchvision C++ support isn't matching Python support.However, @ShahriarSS is doing some good work on it, so the gap is getting smaller.My guess is that most people use OpenCV to do transforms or do them manually. (Personally, I incorporated things like ‚Äúnormalizing‚Äù into the traced/scripted model last time I did this.)Best regardsThomas",
      "Y ": "use Opencv , torchvision c++ is not matching python support currently "
    },
    {
      "X ": "why gpu memory allocations are associated with the cuda stream",
      "Z ": "As @albanD wrote, limiting CUDA allocations to a single stream reduces the number of CPU-GPU synchronizations necessary. CUDA kernels are asynchronous, so when an allocation is ‚Äúfreed‚Äù the kernel may not be finished (or may not have even started). Reusing the same allocation in a different stream could cause memory corruption because work in that stream may start before previously launched work in the original stream finishes.It‚Äôs safe to immediately reuse the allocation in the same stream because operations within a stream are ordered sequentially. This is the strategy the caching allocator uses.The CUDA memory API handles this differently: The cudaFree call synchronize all streams ‚Äì the CPU waits until all streams finish all outstanding work before the cudaFree call completes. This ensures that subsequent uses of the memory are guaranteed to happen after the previous uses finish. However, this makes cudaFree a relatively expensive call. The primary goal of the caching allocator is to avoid this type of synchronization.",
      "Y ": "The CUDA memory API takes a different approach: The cudaFree call synchronises all streams – the CPU waits until all streams have completed all outstanding work before completing the cudaFree call. This ensures that subsequent uses of the memory will take place after the previous ones have finished. CudaFree, on the other hand, is a relatively expensive call. The caching allocator's primary goal is to avoid this type of synchronisation."
    },
    {
      "X ": "c time sequence prediction py slow",
      "Z ": "Error in test codes",
      "Y ": "Error in the test code"
    },
    {
      "X ": "model weights are not moved to the gpu",
      "Z ": "I did find my problem. It was a rather unspectacular error.i forgot registering my layers with register_module(). When adding them i got the expected results ",
      "Y ": "Register layers with register_module()"
    },
    {
      "X ": "unable to install torchvision",
      "Z ": "Likewise, you should select Release over Debug in the VS GUI.",
      "Y ": "Use updated version "
    },
    {
      "X ": "libtorch glog doesnt print",
      "Z ": "Maybe you can try it: add add_definitions(-DC10_USE_GLOG) in your project‚s cmakelists.txt.",
      "Y ": " AAdd add_definitions(-DC10_USE_GLOG"
    },
    {
      "X ": "runtimeerror stop waiting response is expected",
      "Z ": "The error has been fixed.‚Stop_waiting response is expected‚error occurred in TCPStore.cpp. So it was actually the communication problem. It works finally when I reinstalled NCCL: https: //github.com/NVIDIA/nccl.git",
      "Y ": "Reinstall  NCCL using this link  https://github.com/NVIDIA/nccl.git"
    },
    {
      "X ": "torch nn parallel data parallel for distributed training backward pass model update",
      "Z ": "Yes the locking is builtin and the weights will properly be updated before they are used.",
      "Y ": "ocking is bultin function and weights will be updated  accordingly "
    },
    {
      "X ": "why is float tensor addition on cpu slower for avx2 than the default aten cpu capability",
      "Z ": "Resolved at On CPU, vectorized float tensor addition might be slower than unvectorized float tensor addition ¬∑ Issue #60202 ¬∑ pytorch/pytorch ¬∑ GitHub.Basically, memory allocation &amp; zero-filling costs are worse for AVX2.",
      "Y ": "Due to memory allocation on CPU "
    },
    {
      "X ": "pytorch in place operator issue in numpy conversion",
      "Z ": "id() is inappropriate because python objects are not value objects, i.e. they link to other objects, and you just have multiple links here (see .storage().data_ptr() to reason about address identities)",
      "Y ": "see .storage().data_ptr() "
    },
    {
      "X ": "customdataset give me error",
      "Z ": "Thank you, I could not find these subtle typo bug int. I actually meant init. many thanks",
      "Y ": "check init "
    },
    {
      "X ": "is the sgd in pytorch a real sgd",
      "Z ": "Ok perfect, that was exactly what I thought. Actually, they should be named Stepper. For example with SGD that will be ‚SGDStepper. That seems more clear.",
      "Y ": "It shoul dbe SGDStepper"
    },
    {
      "X ": "runtimeerror number of dims dont match in permute",
      "Z ": "alicanakca:mask‚Äôs shape is torch.Size([256,256]).This is the issue ‚Äì the mask is 2-dimensional, but you‚Äôve provided 3 arguments to mask.permute().I am guessing that you‚Äôre converting the image from h x w x c format to c x h x w. However, looks like the mask is only in an h x w format.",
      "Y ": "This is the problem: the mask is two-dimensional, but you've given it three arguments. permute().I'm assuming you're converting the image from h x w x c to c x h x w. However, it appears that the mask is only in h x w format."
    },
    {
      "X ": "in pytorch is there pdf logpdf function for distribution",
      "Z ": "https://pytorch.org/docs/master/distributions.html?highlight=distributions#module-torch.distributions It looks like probs() and log_probs() are what you‚Äôre looking for",
      "Y ": "use this link  https: //pytorch.org/docs/master/distributions.html?highlight=distributions#module-torch.distributions"
    },
    {
      "X ": "how to create computational graphs for updated parameters",
      "Z ": "Hi,You might want to take a look at the higher library that is built to do just that.",
      "Y ": "There are sepearte library to do that "
    },
    {
      "X ": "neat way of temporarily disabling grads for a model",
      "Z ": "Hi,I don't think there is any update. The for loop is simple and is the most efficient thing that can be done here.Especially with your special logic of things already not requiring gradients, that would be tricky.Note that you can add a method to your q_model module yourself to do that to make it a bit cleaner.",
      "Y ": "I don't believe there has been an update. The for loop is straightforward and the most efficient option here.That would be tricky, especially with your special logic of things already not requiring gradients.To make it a little cleaner, you can add a method to your q model module yourself."
    },
    {
      "X ": "why autograd will accumate gradients",
      "Z ": "You could simulate a larger batch size by accumulating the gradients of smaller batches and scaling them with the number of accumulations. This can be useful e.g. if the larger batch size would be beneficial for training but doesn‚Äôt fit onto your GPU.Accumulating the gradients gives you the ability to scale them manually afterwards without enforcing any assumptions on your use case.",
      "Y ": "Accumulating the gradients gives you the ability to scale them manually afterwards without enforcing any assumptions on your use case."
    },
    {
      "X ": "complex functions exp does not support automatic differentiation for outputs with complex dtype",
      "Z ": "Hi,In preparation for the 1.7 release and to avoid issues, we added error messages for all the functions that were not yet audited for complex autograd.We are working on auditing the formulas and re-enabling them.cc @anjali411 do we have an issue describing the process if people want to help here?",
      "Y ": "use the  latest version 0f 1.7 "
    },
    {
      "X ": "optimizing parameters of function generating convolution kernel instead of raw weights",
      "Z ": "You should probably use nn.functional.conv2d, with it you can use any tensor as kernel .",
      "Y ": "use  nn.functional.conv2d "
    },
    {
      "X ": "where is the actual code for layernorm torch nn functional layer norm",
      "Z ": "You can find the (CPU) C++ implementation here.",
      "Y ": "use this link for  CPU C++ implementation https://github.com/pytorch/pytorch/blob/392abde8e64b0d91b7d52aecee8dce9aff8d0b2f/aten/src/ATen/native/layer_norm.cpp "
    },
    {
      "X ": "how can i apply l2 l1 loss with 3d voxels",
      "Z ": "You can directly apply both mentioned losses, as they would expect the model output and target to have the same shape, which is the case for your use case.Unfortunately, I not sure how SSIM can be used for your use case, but if I‚m not mistaken the original implementation uses 2D convs internally, so you might change it to 3D ones.",
      "Y ": "You can directly apply both mentioned losses "
    },
    {
      "X ": "different init for training ensembles",
      "Z ": "Setting the seed at the beginning of the script would make the pseudorandom number generator output deterministic ‚Äúrandom‚Äù values. Creating multiple models in the same script would thus also create different parameters, since the sequence of the random number generation is defined by the seed, but the values won‚Äôt be the same.",
      "Y ": "Setting the seed at the start of the script would cause the pseudorandom number generator to produce deterministic “random” values. Because the sequence of the random number generation is defined by the seed, creating multiple models in the same script would result in different parameters, but the values would not be the same."
    },
    {
      "X ": "torch lstsq output size incorrect",
      "Z ": "answered here: torch.lstsq returns wrong tensor size ¬∑ Issue #56833 ¬∑ pytorch/pytorch ¬∑ GitHub",
      "Y ": "use this link https://github.com/pytorch/pytorch/issues/56833"
    },
    {
      "X ": "on the fly image rotation cpu bottleneck",
      "Z ": "Have you tried alternative rotation implementations (e.g., skimage‚Äôs rotate or albumentations‚Äôs rotate)?Albumentations in particular claims to be very fast for rotation: benchmark.",
      "Y ": "try Albumentatuins "
    },
    {
      "X ": "my program stops at loss backward without any prompt in cmd",
      "Z ": "I I tried to run my program on linux platform, and it ran successfully.Therefore, it is very likely that it is caused by different osPrevious os win 10",
      "Y ": "Due to different OS "
    },
    {
      "X ": "creating input for the model from the raw text",
      "Z ": "Or, you could load your data with a new torchtext abstraction. Text classification datasets, mentioned by you, follow the same new abstraction. It should be very straightforward to copy/paste and write your own pipeline link.",
      "Y ": "load data with torchtext abstraction "
    },
    {
      "X ": "model before after loading weight totally different",
      "Z ": "@ptrblck I found finally the issue. It cames when I tried to compute the gradient with the backward() function. I forgot to use amp.scale_loss. But it makes a weird behaviors because the training works well, until I load again the checkpoint Problem solved !",
      "Y ": "use amp.scale_loss.' "
    },
    {
      "X ": "cudaextension for multiple gpu architectures",
      "Z ": "Apparently it was some kind of problem with an old cached version works now ",
      "Y ": "update to new version or remove the cache of old version "
    },
    {
      "X ": "edge case with register hook",
      "Z ": "Ho right.The thing is that your hook actually waits on the other backward to finish because it waits on the the other thread.The thing is that because the hook is blocked waiting on this, another thread cannot use run backward (this current thread can though).So you either want to run this other backward in the same thread as the hook. Or not block the hook waiting on that backward.",
      "Y ": "The problem is that your hook actually waits on the other thread to finish because it is dependent on it.The problem is that because the hook is blocked while waiting for this, another thread cannot run backward (this current thread can though).So you'll either want to run this other thread backwards in the same thread as the hook, or you'll want to run it forwards in a different thread. Or, alternatively, do not block the hook while waiting on that backward."
    },
    {
      "X ": "network in q learning is predicting the same q values for all states",
      "Z ": "Normalizing the input on a scale 0-1 instead of -1 to 1 solved this issue.",
      "Y ": "Scale   to 0-1 instead of -1 to 1"
    },
    {
      "X ": "how do i map joblibs parallel function to pytorchs distributeddataparallel",
      "Z ": "use torch.multiprocessing.pool",
      "Y ": "use  torch.multiprocessing.pool"
    },
    {
      "X ": "how to get the batch dimension right in the forward path of a custom layer",
      "Z ": "pytorch .dot function is different from tensorflow or numpy",
      "Y ": "use pytorch .dot function "
    },
    {
      "X ": "unexpected key in state dict bn1 num batches tracked",
      "Z ": "I manage to solve the problem with following link How to load part of pre trained model? @apaszke post.",
      "Y ": "use this link https://discuss.pytorch.org/t/how-to-load-part-of-pre-trained-model/1113/2"
    },
    {
      "X ": "allow size mis match in autograd forward vs backward",
      "Z ": "True, but the memory would be an issue.I‚Äôm not sure to see why.Currently, you already have a x ‚Üí M ‚Üí z ‚Üí PADDING ‚Üí z_pI think you want (x, z_g) ‚Üí M_AND_PADDING ‚Üí z_pAnd in that new custom Function, you don‚Äôt need to do anything beyond what the padding is currently doing.",
      "Y ": "use this x, z_g) → M_AND_PADDING → z_p"
    },
    {
      "X ": "gradients exist but weights not updating",
      "Z ": "Hi,When you get the parameters of your net, it does not clone the tensors. So in your case, before and after contain the same tensors. So when the optimizer update the weights in place, it updates both your lists. You can try and change one weight by hand, they will still remain the same.",
      "Y ": " Try changing the weight"
    },
    {
      "X ": "variables are not updated after loss backward and optimizer step",
      "Z ": "Finally, and after 5 days, I found the error.In fact, the computational graph was broken into two different places, due to two wrong operations. However, it was very difficult to debug it and find the issue source. No tools or Libs exist to visualize the graph, which is the main component for the gradient backpropagation.",
      "Y ": "In fact, the computational graph was broken into two different places, due to two wrong operations."
    },
    {
      "X ": "the second order derivative of a function with respective to the input",
      "Z ": "Hi,The problem is that your function is linear. So the first gradient is constant and the second order gradient is independent of the input.This error message happens because of the independence (and thus, it is not used in the graph).",
      "Y ": "The problem is that your function is linear. So the first gradient is constant and the second order gradient is independent of the input."
    },
    {
      "X ": "loss backward time increases for each batch",
      "Z ": "Could you check if you might be running out of memory and your system might be using the swap?",
      "Y ": "Check Memory "
    },
    {
      "X ": "weight of layer as a result of dot operation",
      "Z ": "Use the functionals instead of the convolutional module. Functionals takes weights as inputs.",
      "Y ": "Use the functionals instead of the convolutional module. Functionals takes weights as inputs."
    },
    {
      "X ": "runtimeerror mat1 dim 1 must match mat2 dim 0 cnn",
      "Z ": "It looks like you are already printing the shape so you should be able to see what N, and D are here.Flatten can work, but rather than reshaping to (-1, something), you should reshape to (batch_size,-1).",
      "Y ": "use Flattenand reshape (batch_size,-1)"
    },
    {
      "X ": "error on torch load pytorchstreamreader failed",
      "Z ": "Ok, Im able to load the model. The problem was with the saved weight file. It wasn't saved properly and the weight file size was smaller (only 90 MB instead of 200 MB).",
      "Y ": "Save the file size properly"
    },
    {
      "X ": "the code that was working previously gets stuck at loading the checkpoint file that is cached on system",
      "Z ": "hmm it was very weird. I reboot the machine and then I ran it again and it worked.",
      "Y ": "Reboot the machine"
    },
    {
      "X ": "nn transformerencoderlayer 3d mask doesnt match the broadcast shape",
      "Z ": "Solution: Upgrade to PyTorch 1.5",
      "Y ": "Upgrde the version "
    },
    {
      "X ": "solved runtimeerror expected object of device type cuda but got device type cpu for argument 2 mat2 in call to th mm",
      "Z ": "Okay, i just solved the problem by myself, the reason of this is the Attn() function which i wrote outside the model class as another def() function, and the Attn() function will not be moved to the GPU, so I create a new nn.Module class for Attn and i wrote : self.attn = Attn(hidden_size) in the model.",
      "Y ": "use this self.attn = Attn(hidden_size) "
    },
    {
      "X ": "my implementation of self attention",
      "Z ": "I can't  believe I made this silly mistake in verson1 queries are outputted from w_v, instead of w_q.",
      "Y ": "queies are outputted from w_v"
    },
    {
      "X ": "resume training validation loss going up increased",
      "Z ": "Thank you sir, this issue is almost related to differences between the two datasets.",
      "Y ": "use same datasets "
    },
    {
      "X ": "lstm text generator repeats same words over and over",
      "Z ": "Okay, it was actually a stupid mistake I made in producing the characters with the trained model: I got confused with the batch size and assumed that at each step the network would predict an entire batch of new characters when in fact it only predicts a single one Yikes!Anyways, thanks for your advice and  see if I can use it to fine tune the results a bit!",
      "Y ": "check on each batch outputs"
    },
    {
      "X ": "what is the exactly implementation of torch embedding",
      "Z ": "It should eventually call into this method for the forward pass.",
      "Y ": "It should eventually call into this method for the forward pass"
    },
    {
      "X ": "how would i do load state dict in c",
      "Z ": "The current implementation of load_state_dict is in Python, and basically it parses the weights dictionary and copies them into the model's parameters.So I guess you'll need to do the same in CPP.",
      "Y ": "it is inpython and it will be same for  cpp also "
    },
    {
      "X ": "compiler c not compatible with the compiler pytorch was built",
      "Z ": "@MauroPfister ArchLinux‚s compiler does follow the rolling base of GNU, I would say they should be fully compatible. The reason we still give warning is that ArchLinux is a independent linux distribution, their software might contains their own Proprietary software and is not endorsed by the GNU project.",
      "Y ": " ArchLinux is a independent linux distribution"
    },
    {
      "X ": "unable to access my nets parameters",
      "Z ": "try class ConvNet : public torch: :nn: :Module",
      "Y ": "use  class ConvNet : public torch: :nn: :Module"
    },
    {
      "X ": "converting simple rnn model from python to c",
      "Z ": "Sorry I was giving you this link : https://github.com/prabhuomkar/pytorch-cpp/tree/master/tutorials/intermediate/recurrent_neural_network By mistake I have given u the wrong link.",
      "Y ": " use this link https: //github.com/prabhuomkar/pytorch-cpp/tree/master/tutorials/intermediate/recurrent_neural_network "
    },
    {
      "X ": "using nn module list in c api",
      "Z ": "@Aaditya_Chandrasekha We have a simple instruction in the comment here: https: //github.com/pytorch/pytorch/blob/cd0724f9f1b57dae12be2c3fc6be1bd41210ee88/torch/csrc/api/include/torch/nn/modules/container/modulelist.h#L11 We have tests here, it contains more examples. https: //github.com/ShahriarSS/pytorch/blob/678873103191c329e2ca4a53db1d398599ad9443/test/cpp/api/modulelist.cpp",
      "Y ": "use this link https://github.com/pytorch/pytorch/blob/cd0724f9f1b57dae12be2c3fc6be1bd41210ee88/torch/csrc/api/include/torch/nn/modules/container/modulelist.h#L11 "
    },
    {
      "X ": "gradient clipping in pytorch c libtorch",
      "Z ": "The usage look correct and is also used in this way in this test.",
      "Y ": " Same eay implement in test"
    },
    {
      "X ": "futex wait hang",
      "Z ": "Hi, No this is expected. Half of them are OMP worker thread and one of them is an autograd engine worker thread. These are worker threads that are kept around so that we don't have to recreate them every time we need them. OMP does that by default and we do it ourselves as well in the autograd engine.",
      "Y ": "OMP and autograd  does it by default "
    },
    {
      "X ": "can we split a large pytorch built in nn module to multiple gpu",
      "Z ": "Hi, I'm afraid we don't provide any construct to do this automatically. But you can simply create 8 different Linear that each take a subset of the input and split the input yourself and call each of these Linears and then add all the results (assuming your split on the input size here given that it is the biggest).",
      "Y ": " you can simply create 8 different Linear that each take a subset of the input and split the input yourself and call each of these Linears and then add all the results"
    },
    {
      "X ": "sharing model between processes automatically allocates new memory",
      "Z ": "It turns out that every-time a process holds any pytorch object that is allocated on the GPU, then it allocates an individual copy of all the kernels (cuda functions) that pytorch uses, which is about 1GB. It seems there is no way around it, and if your machine has Xgb of GPU RAM, then you're limited to X processes. The only way around it is dedicating one process to hold the pytorch module and act with the other processes in a producer-consumers pattern, which is a real headache when it comes to scalability and much more for RT application .",
      "Y ": "The only way around it is dedicating one process to hold the pytorch module and act with the other processes in a producer-consumers pattern, which is a real headache when it comes to scalability and much more for RT application"
    },
    {
      "X ": "how to split a pretrained model for model parallelism",
      "Z ": " Do I also need to change this or does this ‚to  work with nn.sequential (no separate forward function) as well? ‚towould work on nn.sequential, although you need to modify the forward function since once you have completed execution for the module on GPU0, the output will be on GPU0. Now since the other module you want to execute is on GPU1, you need to move the output from GPU0 to GPU1 manually (using .to) and then you need to execute the module on GPU1.",
      "Y ": " use nn.sequential and .to to move output from one GPU to another "
    },
    {
      "X ": "build pytorch gpu for different gpu archs",
      "Z ": "I‚ve answered in the GitHub issue.",
      "Y ": "answered in the GitHub issue"
    },
    {
      "X ": "confused about distributed data parallel behavior",
      "Z ": "Hi,Could you try torch.cuda.set_device() instead, torch.cuda.device is a context manager, also see https: //github.com/pytorch/pytorch/issues/1608",
      "Y ": "use this torch.cuda.set_device() "
    },
    {
      "X ": "loss calculation within batch iteration",
      "Z ": "This problem has been resolved. I derived a bit and figured those two loss calculation approaches are essentially the same.",
      "Y ": "The two loss calualtion approaches are same "
    },
    {
      "X ": "best way to handle variable number of inputs",
      "Z ": "Why not use *args and **kwargs?",
      "Y ": "use *args and **kwargs"
    },
    {
      "X ": "model to cpu does not release gpu memory allocated by registered buffer",
      "Z ": "you cannot delete the CUDA context while the PyTorch process is still runningClearing the GPU is a headache vision No, you cannot delete the CUDA context while the PyTorch process is still running and would have to shutdown the current process and use a new one for the downstream application.",
      "Y ": "No, you cannot delete the CUDA context while the PyTorch process is still running and would have to shutdown the current process and use a new one for the downstream application."
    },
    {
      "X ": "implementing a custom convolution using conv2d input and conv2d weight",
      "Z ": "Hi, This OOM exception comes from the python api implement of conv2d_weight actually. In backprop weight calculation, the output gradients need to be expanded with output channel times. When default cudnn implement this with data prefetch block and block (not allocate more memory), python api uses a repeat that will allocate a huge size of memory on output gradients tensor with unnecessary duplication of data. you can easily fix this by convert the repeat into a loop function at conv2d_weight.",
      "Y ": "convert  into a loop function at conv2d_weight"
    },
    {
      "X ": "why criterion cuda is not needed but model cuda is",
      "Z ": "The impact of moving a module to cuda is actually to move all it'ss parameters to cuda. Criterion don't have parameters in general, so it is not necessary to do it.",
      "Y ": " Critertion don't have parameters but cuda has parameters"
    },
    {
      "X ": "debugging memory allocations in torch autograd grad",
      "Z ": "Hi, You can enable anomaly mode. That will show you the forward op that corresponds to the one that is failing in the backward. Can you share this trace?",
      "Y ": "enable anomly mode "
    },
    {
      "X ": "integrated gradients for rnns",
      "Z ": "Hi,You won't be able to get gradients wrt to the input of the embedding layer I a'm afraid. Since, as you pointed out, they are not of contiguous dtype. You might want to do use that technique on the output of the embedding layer instead?",
      "Y ": "use  the  technique on the output of the embedding layer"
    },
    {
      "X ": "minibatch size by iteration",
      "Z ": "Your code looks correct, but you might want to divide the accumulated loss by the number of accumulation steps. Also, here is a nice overview of different approaches in case you want to trade compute for memory etc.",
      "Y ": "Divide the accumulated loss by the number of accumulation steps"
    },
    {
      "X ": "getting cant export a trace that didnt finish running error with profiler",
      "Z ": "Solved: The print(prof) line should be outside the with block.",
      "Y": "print(prof) should be outside the  block "
    },
    {
      "X ": "pytorch lightning number of training and validation batches",
      "Z ": "I think this is the total number of batches (training + validation). Best regardsThomas",
      "Y ": "total number of batches (training + validation)"
    },
    {
      "X ": "how to load imagenet",
      "Z ": "The validation set for ImageNet has 50,000 images or 50 per each of the 1,000 classes. If you don't shuffle the data then the expectation indeed is that you only see two classes for a batch size of 100.",
      "Y ": " Shuffle the data  "
    },
    {
      "X ": "masking out locations in convolutional kernels",
      "Z ": "I don't see any obvious problems here but you can do some simple tests like running your layer on a ones tensor input and checking that the results are what you expect based on the mask. If you are using batchnorm layers after the convolution, you can avoid the bias term entirely as it will be effectively undone by the batchnorm. Additionally, I don't think the bias is applied before the convolution, so it shouldn't be affected by (or affect) the mask that you are using.",
      "Y ": "you can do some simple tests like running your layer on a ones tensor input and checking that the results are what you expect based on the mask"
    },
    {
      "X ": "rewriting a crnn model with the same architecture gives different results than the original",
      "Z ": "First thing, good job to simplify the stuff you find on the internet. I often do this, too, when I need to look at code from others.You are not using the same weights with this. The random init for the second will be different than the one for the first because you not re-seeding after instiatiating the first.In this case, if I copy the manual seed to before the second network is instantiated, I actually do get the same results.Now this is also good luck because apparently you are creating the modules in the exact same order in both networks. This can easily break through refactoring and in this case it is safer to try to copy the state_dict of one of them to the other (take the state dict, rename the keys as needed, load into the other model) to compare. For things with batch norm, one also needs to keep in mind that running it updates the running statistics in training mode.Best regardsThomas",
      "Y ": ""
    },
    {
      "X ": "dynamically replacing the last linear layer",
      "Z ": "Sorry for not really answering your question, but you might want to test the training on the CPU first. Here the error messages are most of the time more useful than CUDA errorsApart form that, you don‚Äôt really replace the last linear layer. You simple have multiple linear layers and choose one dynamically, which is essentially the idea behind multitask learning. And from a quick look at your code, it seems alright. But I didn't check any details.What‚Äôs the error when running in the CPU?",
      "Y ": "try running on CPU "
    },
    {
      "X ": "unsure of output dimension and loss type newbie",
      "Z ": "You might want to look a this post, it seems very related. The link Udacity tutorial is also exactly about a character RNN.",
      "Y ": "Check character RNN"
    },
    {
      "X ": "improving nmt model outputs",
      "Z ": "Rare words or out-of-vocabulary words are a fundamental challenge for NMT. You still find very recent academic papers addressing this.For example, for a very simple NMT task, I used an off-the-shelf NER system to replace, say, person names. So 2 sentences ‚ÄúI met Alice‚Äù and ‚ÄúI met Bob‚Äù would be converted to I met ; same for the target sentences. After the translation, I would simple replace  with the actual name. Replacing numbers with  would also be very easy with a RegEx. It worked fine enough for my use case, but its probably too naive for the general case.",
      "Y ": "Replacing numbers with  would also be very easy with a RegEx. but its probably too naive for the general case "
    },
    {
      "X ": "implement a keras model using pytorch doesnt learn",
      "Z ": "Problem identified, the data need to be shuffled in train loader.",
      "Y ": "the data need to be shuffled in train loader"
    },
    {
      "X ": "transformer mask doesnt do anything",
      "Z ": "I figured out the problem, I was not properly inserting SOS and EOS tokens, so even with proper masking it was able to copy straight from the given target.",
      "Y ": "Insert SOS and EOS tokens properly "
    },
    {
      "X ": "cant use from blob to construct tensor on gpu in c",
      "Z ": "@farmersrice check this issue.https: //github.com/pytorch/pytorch/issues/15426, I think our document need update.You can not copy memory from CPU to GPU directly. Your temp[] is not on GPU.I think you have to use .to(device) at this point.",
      "Y ": "use this link https: //github.com/pytorch/pytorch/issues/15426‚ "
    },
    {
      "X ": "error in cmake while setting up libtorch",
      "Z ": "cudnn version might not be found in cudnn.h. In the cuda.cmake change cudnn.h to cudnn_version.h and caffe2 is able to find the cudnn version.",
      "Y ": "change cudnn.h to cudnn_version.h"
    },
    {
      "X ": "how to mask tensor with boolean using c api how to achieve this python code with c api",
      "Z ": "use masked_scatter function can do that",
      "Y ": "use masked_scatter"
    },
    {
      "X ": "libtorch ubuntu runtime error",
      "Z ": "@yf225 sorry.It was my mistake.It happened because of a file path error",
      "Y ": " use correct path address"
    },
    {
      "X ": "during deserialization torch load fails at debug while it works fine in release mode unhandled exception at 0x00007fff7de1a308 in test exe microsoft c exception c10 error at memory location 0x000000cdee5bd950 occurred",
      "Z ": "The reason was  the debug version of the lib was missing! (again!) moving the needed libs next to  the executable fixed the issue  (the release versions were added to the PATH, so at runtime, it would pick the release version and boom!)",
      "Y ": "update to latest version of build "
    },
    {
      "X ": "per tensor channel quantization equivalents in pytorch caffe2",
      "Z ": "Unfortunately, Caffe2 Int8Conv doesn‚Äôt support per-channel quantization. The DNNLOWP engine that uses FBGEMM backend does support group-wise quantization if that helps you. Please see https://github.com/pytorch/pytorch/blob/master/caffe2/quantization/server/conv_groupwise_dnnlowp_op_test.py for example of using group-wise quantization.",
      "Y ": "use this link https://github.com/pytorch/pytorch/blob/master/caffe2/quantization/server/conv_groupwise_dnnlowp_op_test.py "
    },
    {
      "X ": "quantized squeeze block mobilenetv3",
      "Z ": "You can actually try to comment out the two lines as https://github.com/pytorch/pytorch/pull/30442, since the tensor iterator supports broadcast.",
      "Y ": "use this link https://github.com/pytorch/pytorch/pull/30442, since the tensor iterator supports broadcast "
    },
    {
      "X ": "cannot quantize nn conv2d with dynamic quantization",
      "Z ": "Hi @babak_hss, Dynamic quantization is currently supported only for nn.Linear and nn.LSTM, please see: https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic",
      "Y ": " for nn.Linear and nn.LSTM   Dynamic quantization is currently supported . use this link https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic "
    },
    {
      "X ": "when quantized max pool2d is used",
      "Z ": "Yes, that is correct. it is dispatch here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Pooling.cpp#L128 We have multiple ways to do dispatch right now in PyTorch, one common place is in native_functions.yaml, you can take a look at: https: //github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/README.md",
      "Y ": "use tis link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Pooling.cpp#L128"
    },
    {
      "X ": "assertionerror torch nn quantized relu does not support inplace",
      "Z ": "should be fixed in https://github.com/pytorch/pytorch/pull/33105, cc @raghuramank100",
      "Y ": "use this link https://github.com/pytorch/pytorch/pull/33105"
    },
    {
      "X ": "conv2d unpack and conv2d prepack behavior",
      "Z ": "Bias is kept in fp32 format for eager mode quantization and dynamically quantized while computing quantized FC/Conv. It‚Äôs returned in fp32 because that‚Äôs how it‚Äôs passed in to an operator as well. The reason for keeping bias in fp32 is the unavailability of input scale until the operator has executed so we can‚Äôt quantize bias until then. To convert bias to quantized format, use input_scale * weight_scale with a zero_point = 0. See this https: //github.com/pytorch/FBGEMM/blob/master/include/fbgemm/OutputProcessing-inl.h#L104-L108 code for converting bias with act_times_weight scale. Check out the code in https: //github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp file for prepack function. If USE_FBGEMM is true, fbgemm_conv_prepack function is called for doing prepacking.",
      "Y ": "check this code https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp "
    },
    {
      "X ": "net in dataparallel make training aware quantization convert model acc error",
      "Z ": "There are currently some issues with nn.DataParallel and Quantization Aware Training. There is a WIP PR to fix it - https://github.com/pytorch/pytorch/pull/37032 You can follow the toy example here to make sure you're following the steps for QAT correctly https: //gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40",
      "Y ": " follow the steps from this link https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40"
    },
    {
      "X ": "construct quantized tensor from int repr",
      "Z ": "we do have some non-public API to do this: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3862 and https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3868 but they we might change the API when we officially release quantization as a stable feature.",
      "Y ": "use this link https: //github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml#L3862 "
    },
    {
      "X ": "tied conv1d and conv transpose1d not geting the same result as the input",
      "Z ": "I think I misunderstand the ,tied weight concept.I wrote the conv_transposed1d in doubly block circulant matrix form and I find that one don't need to flip the temporal axis actually.Suppose the conv1d's matrix is  and the corresponding conv_transpose1d's matrix is  .The square matrix  apprently is not always identity matrix. So the result need not to be identical to the input.",
      "Y ": " reault should not be identical "
    },
    {
      "X ": "dynamic quantization error mixed serialization of script and non script modules is not supported",
      "Z ": "It looks like you are trying to quantize the scripted net.The correct order seems like first quantize your net then script it!",
      "Y ": " first quantize your net and then script it  "
    },
    {
      "X ": "pytorch1 5 0 win7 64bit didnt find engine for operation quantized conv2d prepack noqengine",
      "Z ": "We use VS 14.11 to build binaries for CUDA 9.2, so there is no FBGEMM support. If you need FBGEMM, then please use the binaries with other CUDA versions instead.",
      "Y ": "use the VS 14.11  build "
    },
    {
      "X ": "did pytorch support int16 quantization",
      "Z ": "We currently do not support int16 quantization. There is support for fp16 dynamic quantization.",
      "Y ": "use fp16 dynamic quantization"
    },
    {
      "X ": "dose static quantization support cuda",
      "Z ": "No, it only works on CPU right now, we will consider adding CUDA support in the second half of the year",
      "Y ": "currently it works only on CPU "
    },
    {
      "X ": "quantized model consists of relu6",
      "Z ": "That is correct, we will work on adding support for fusing relu6 soon. For now, if you are doing post training quantization, you could replace relu6 with relu and proceed as a work around. Thanks,",
      "Y ": " repalce relu6 with relu "
    },
    {
      "X ": "loading of quantized model",
      "Z ": "Hi mohit7,Make sure you create the net using previous definition, and let the net go through process that was applied during quantization before (prepare_model, fuse_model, and convert), without rerun the calibration process.After that you can load the quantized state_dict in. Hope it helps.",
      "Y ": "Make sure you create the net using previous definition, and let the net go through process that was applied during quantization before (prepare_model, fuse_model, and convert), without rerun the calibration process.After that you can load the quantized state_dict in"
    },
    {
      "X ": "problem in computing loss in multiple cpu distribution training",
      "Z ": "Typically you want to run the forward and backward pass on each process separately and then average the gradients across all processes and then run the optimizer independently on each process.I‚m wondering what is the reason you‚re trying to build this yourself. PyTorch has a DistributedData Parallel module, which does all of this for you.",
      "Y ": "use PyTorch Distributed  Data Parallel module"
    },
    {
      "X ": "how to link a custom nccl version",
      "Z ": "You can see here that NCCL is statically linked to the binaries and can take a look at the repository for more information about the build process. ",
      "Y ": "Check the build versions"
    },
    {
      "X ": "training with ddp and syncbatchnorm hangs at the same training step on the first epoch",
      "Z ": "[Solved] My problem was that I have random alternating training that go down different branches of my model. I needed to set the random seed that samples the probability of which alternating loss it will perform. This is probably because when pytorch does it reduce_all somewhere, it notices a difference in batch norm statistics since I believe it assumes some ordering on the statistics.",
      "Y ": " set the random seed that samples the probability of which alternating loss it will perform"
    },
    {
      "X ": "dataparallel and conv2d",
      "Z ": "The conv2d library was not the problem. I found out problem was listed here : Since I was running VGG on cifar100, I had to rewrite the forward method on pytorch‚Äôs default VGG network since its built for ImageNet and includes a averagepool layer that will error with cifar100's  data size. Using types.MethodType to replace methods in a network is incompatible with DataParallel. My solution was to create my own ‚MyVGG‚class that takes a VGG model as an input and takes all of its parameters, and then I could write my own forward function within that class.",
      "Y ": "using types.MethodType to replace methods in a network is incompatible with DataParallel"
    },
    {
      "X ": "how dose distributed sampler passes the value epoch to data loader",
      "Z ": "The sampler is passed as an argument when initializing the DataLoader, so the train loader will have access to the sampler object. Neither the loader not the sampler need to be re-constructed every epoch.",
      "Y ": "the sampler is passed as an argument when initializing the DataLoader, so the train loader will have access to the sampler object."
    },
    {
      "X ": "the ddp seem to be disable to find the second node",
      "Z ": "If I understand correctly, you are trying to train with 4 GPUs, 2 on one machine and 2 on another machine? If this is the case, then you will need to launch your training script separately on each machine. The node_rank for launch script on the first machine should be 0 and node_rank passed to the launch script on the second machine should be 1. It seems here like you are passing 2 separate node_ranks for processes launched on the same machine.See the multi-node multi-process distributed launch example here: Distributed communication package - torch.distributed ‚ PyTorch 1.7.0 documentation",
      "Y ": "use Distributed communication package - torch.distributed ‚ PyTorch 1.7.0 documentation"
    },
    {
      "X ": "why the output of children part of a network has low resolution",
      "Z ": "Based on the posted code I assume the left image represents the input while the right one the model output?If thats the case, I guess your model isn't able to create sharp images and you could check the literature for new architectures, which could avoid the blurry output.",
      "Y ": "I guess your model isn't able to create sharp images and you could check the literature for new architectures"
    },
    {
      "X ": "build model from submodels",
      "Z ": "I think I found the solution by myself.For everyone struggling with the same problem: You can use ModuleList. I my example, I can just append each encoder and the classifier to the ModuleList. Using this class, my Main-Model is aware of its submodels and for example the number of parameters is calculated correctly. I think there is a pretty good explanation of the concept here.",
      "Y ": "use ModuleList"
    },
    {
      "X ": "how to install pytorch 1 3 0 or above with cuda 8",
      "Z ": "Thank you for your reply.I haven't tested building it from source. I decided to use the cpu version for now.",
      "Y ": "install using build package "
    },
    {
      "X ": "custom mean of tensor partitions",
      "Z ": "Id look at the third-party package PyTorch scatter. It has a reduction=mean mode. You need to convert lst to a tensor and possibly use broadcasting. Now, the scatter implementation uses atomics, which is problematic e.g. in terms of performance. If the partitions are ordered (as your example suggests), you might compare to just doing a for loop and taking means over the slices. Best regards Thomas",
      "Y ": "look into PyTorch scatter package "
    },
    {
      "X ": "functional linear may cause runtimeerror one of the variables needed for gradient computation has been modified by an inplace operation",
      "Z ": "Finally, I solved the problem.I wrongly use the output of the model as input for the next iteration.What a fool mistake!",
      "Y ": "Use model input "
    },
    {
      "X ": "how to remove the grad fn selectbackward in output array",
      "Z ": "Hi,The detach() in the no_grad block is not needed. You will need to move all the ops into the no_grad block though to make sure no gradient is tracked ",
      "Y ": "The detach() in the no_grad block is not needed."
    },
    {
      "X ": "can i get gradients of network for each sample in the batch",
      "Z ": "If you use simple  NN, you can use tricks like the one mentionned here to reuse computations.",
      "Y ": "Use simple NN"
    },
    {
      "X ": "question about loading the model that was trained using 4gpu with distributed dataparallel to only 1 gpu job",
      "Z ": "I‚Äôm not sure to understand the use case.It seems you would like to load the state_dict to a single GPU machine, but in your code you are wrapping the model again in DDP.Would creating the model, loading the state_dict, and pushing the model to the single GPU not work?",
      "Y ": "Create  the model   loading the state_dict, and push  the model to the single GPU"
    },
    {
      "X ": "how to deploy different scripts on different gpus",
      "Z ": "You could pass the device you want to train on as an argument to the script. For example cuda: 0 corresponds to the 1st GPU in your system, cuda: 1 corresponds to the 2nd GPU and so on. Then assuming you store the passed argument in a variable named device, all you have to do is to call .to(device) on your tensors etc.",
      "Y ": " call .to(device) "
    },
    {
      "X ": "distributeddataparralled not support cuda",
      "Z ": "DistributedDataParallel (DDP) does supports CUDA. The comment suggests extra care might be necessary when backward run on non-default stream. Actually, even if backward occurs on non-default streams it should be fine for most use cases. Below is why:background: I learned from @albanD that autograd engine will use the same stream as the forward pass.Let‚Äôs take a look at what could go wrong for the code you quoted.1: the tensor is not ready when launching the allreduce operation2: the tensor was destroyed too soon before the allreduce finishes.We can rule out 2 for now, as all_reduce does recordStream() properly to prevent CUDA blocks to be freed too early.Then the only thing left is 1. The operation on that tensor before allreduce is bucket_view.copy_(grad.view({-1}), /* non_blocking */ true); in mark_variable_ready_dense. The copy here happens on the same device (replica.contents and grad). And Reducer itself does not switch streams in between. So the only case that could hit race condition is when the application used different streams for different operators during the forward pass, and grads associated with those operators fall into the same bucket in reducer.",
      "Y ": "bucket_view.copy_(grad.view({-1}), /* non_blocking */ true); in mark_variable_ready_dense. "
    },
    {
      "X ": "using custom method in distributed model",
      "Z ": "bigyeet:Is this right, or do I have to write a custom DataParallel wrapper that has scatter, gather, etc methods? If so, how would I do it? It depends on what you expected reset_hidden_state to achieve. Below is what happens in EVERY forward pass when you use DataParallel. split input data replicate model to all devices feed input data splits to all model replicas gather outputs from all replicas done with forward After the forward pass, the autograd graph actually contains multiple model replicas. It looks sth likeoriginal model &lt;- scatter &lt;- model replicas &lt;- replica output &lt;- gather &lt;- final output.So in your above use case, if reset_hidden_state has any side effect that you would like to apply to the backward pass, it will only apply to the original model, not to model replicas. But if you are only trying to clear some states for the next forward pass, it should work.",
      "Y ": "original model <- scatter <- model replicas <- replica output <- gather <- final output."
    },
    {
      "X ": "unable to load waveglow checkpoint after training with multiple gpus",
      "Z ": "This usually happens when multiple processes try to write to a single file.However, this should be prevented with the if condition if rank == 0:.Did you remove it or changed the save logic somehow?",
      "Y ": "use rank == 0"
    },
    {
      "X ": "strange behavior nn dataparallel",
      "Z ": "Thanks for the information. This points towards some communication issues between the GPUs.Could you run the PyTorch code using NCCL_P2P_DISABLE=1 to use shared memory instead of p2p access?",
      "Y ": "run model using NCCL_P2P_DISABLE=1"
    },
    {
      "X ": "loss collection for outputs on multiple gpus",
      "Z ": "If you are using nn.DataParallel the model will be replicated to each GPU and each model will get a chunk of your input batch.The output will be gathered on the default device, so most likely you wouldn‚Äôt have to change anything. However, I‚Äôm not sure about the use case.How are you calculating the memory consumption and is this operation differentiable?I assume it‚s not differentiable so that your accumulated loss will in fact just be the nn.CrossEntropyLoss.",
      "Y ": "If you are using nn.DataParallel the model will be replicated to each GPU and each model will get a chunk of your input batch."
    },
    {
      "X ": "default collate fn sending data to cuda 0",
      "Z ": "Have you tried setting CUDA_VISIBLE_DEVICES env var before launching the process? It would be more clear if you share some minimum code snippet ",
      "Y ": "set CUDA_VISIBLE_DEVICES env var befor launching the model  "
    },
    {
      "X ": "distributed gpu calculations and cuda extensions",
      "Z ": "Would splitting the data and sending each chunk to a specific device work? Something like this could already solve your use case: data = torch.randn(4,100) chunks = data.chunk(4,0) res = [] for idx, chunk in enumerate(chunks): res.append(my_fun(chunk.to('cuda: {}'.format(idx))).to('cuda: 0')) res = torch.stack(res)",
      "Y ": "data = torch.randn(4, 100) chunks = data.chunk(4,) res = [] for idx, chunk in enumerate(chunks): res.append(my_fun(chunk.to('cuda: {}'.format(idx))).to('cuda: 0'))res = torch.stack(res)"
    },
    {
      "X ": "question about torch distributed p2p communication",
      "Z ": "Hey @yijingThe message will directly send from  10.0.0.2 to 10.0.0.3.In init_process_group, the init_method=‚Äútcp: //10.0.0.1:8888‚Äù is only for rendezvous, i.e., all process will use the same ip:port to find each other. After that communications don‚Äôt need to go through master.BTW, if you are using p2p comm, torchrpc might be useful too. Here is a tutoral.",
      "Y ": " The message will directly send from  10.0.0.2 to 10.0.0.3.  and also use  torchpc "
    },
    {
      "X ": "behavior of dataloader when resuming training from the existing checkpoint",
      "Z ": "Are you trying to train your model for only 1 epoch because you have so much data and it‚Äôll take too long to do more, or are you possibly trying to do 1 epoch because your machine won‚Äôt allow it to finish and everything shuts off, so you‚Äôd like to save intermediate progress? (Epoch = single pass through your entire dataset) Just asking out of curiousity, no worries if there‚Äôs no reason. As for for your question, I‚Äôd do one of the following:Drop shuffle=True and as you train keep track of an id (either the step number, which will represent what batch you are on, or just the raw id of current sample you‚Äôre on). If you‚Äôre using a HuggingFace Trainer instance for your model training, you can use callbacks to do this (add a on_step_end or on_step_begin callback to write out current step # to a file, can be found here in the docs). When continuing training, You can slice examples starting from the id you left on, and ending with the last id of the dataset, then append all the samples you‚Äôve already trained with at the end of this slice (essentially shifting the samples you trained with, but putting them at the end). If you don‚Äôt care about re-using the samples at the end, you can just use PyTorch‚Äôs Subset dataset class. Keep shuffle=True but have a small function call when you fetch a sample that writes-out the id that‚Äôs getting fetched/processed. When continuing training, do a similar process as above (option 1) but rather than working with a single slice from shuffle=False you can slice out a subset of your dataset using the ids you‚Äôve saved",
      "Y ": "Keep shuffle=True, but when you fetch a sample, run a tiny function that prints out the id that's being fetched/processed. Continue training in the same way as before (option 1), but instead of working with a single slice from shuffle=False, use the ids you've saved to slice out a portion of your dataset."
    },
    {
      "X ": "architecture of deeplabv3 resnet50",
      "Z ": "Printing out the model wouldn't show the computation graph and would only print the child modules, so I agree that this would not be sufficient to see‚the structure.You could check out e.g. PyTorchViz to visualize the computation graph in case that's helpful.PS: Often I also take a look at the source code, but for segmentation/detection models this is unfortunately also not trivial.",
      "Y ": "use this https: //github.com/szagoruyko/pytorchviz"
    },
    {
      "X ": "create diagonal matrices from batch",
      "Z ": "Hi Samuel! Samue1: x = torch.rand(size=(M, N)) and want to create for each of the M inputs a diagonal matrixTry: torch.diag_embed (torch.rand (size = (M, N))) k",
      "Y ": "torch.diag_embed (torch.rand (size = (M, N))) "
    },
    {
      "X ": "how to display incorrect samples predicted by the model",
      "Z ": "HarshRangwala:Invalid shape (1,3,224,224) for image dataThat first dimension should be squeezed out as an image should have 3 dimensions: number of channels, height, and width. (i.e. (3,224,224)).Try img = img.squeeze() before calling ax.imshow(img)",
      "Y ": "Try img = img.squeeze() before calling ax.imshow(img)"
    },
    {
      "X ": "moving tensor to cuda",
      "Z ": "If you are pushing tensors to a device or host, you have to reassign them: a = a.to(device='cuda') nn.Modules push all parameters, buffers and submodules recursively and don't need the assignment.",
      "Y ": "a = a.to(device='cuda')"
    },
    {
      "X ": "efficient implementation of jacobian of softmax",
      "Z ": "Hi Samuel! Samue1: Does this also work for batched versions of S? No.  If you had tried it, you would have discovered that torch.outer() does not accept multidimensional tensors. Is the result correct if I use J = torch.diag_embed(S) - torch.outer(S, S) No, this will throw an error (because you pass a multidimensionaltensor to torch.outer()). You can, however, use pytorch‚Äôs swiss army knife of tensor multiplication functions to construct a batch version of outer: &gt;&gt;&gt; import torch gt;&gt;&gt; torch.__version__ '1.9.0' &gt;&gt;&gt; S = torch.arange (6).reshape (2,3).float() &gt;&gt;&gt; S tensor([[0., 1., 2.],[3., 4., 5.]])&gt;&gt;&gt; torch.diag_embed (S) - torch.einsum ('ij, ik -&gt; ijk', S, S)tensor([[[  0.,   0.,   0.],[  0.,   0.,  -2.],[  0.,  -2.,  -2.]],[[ -6., -12., -15.],[-12., -12., -20.],[-15., -20., -20.]]])(As an aside, none of this has anything to do with the title you gavethis thread, namely ‚ÄúJacobian of Softmax.‚Äù)Best.K. Frank",
      "Y ": "use pytorch’s swiss army knife of tensormultiplication functions to construct a batch version of outer"
    },
    {
      "X ": "pytorch can not move tensor to cuda",
      "Z ": "Since you are using an Ampere GPU (3070), you would need to use CUDA;=11.0, so the old PyTorch 1.5.1 release with CUDA9.2 won‚Äôt work. Update to the latest release with CUDA11.1 and it should work.",
      "Y ": "use updated version "
    },
    {
      "X ": "torch from numpy not support negative strides",
      "Z ": "how about torch.from_numpy(np.flip(x,axis=0).copy())",
      "Y ": "use torch.from_numpy(np.flip(x,axis=0).copy())"
    },
    {
      "X ": "dropout argument input position 1 must be tensor not str",
      "Z ": "Hey, The fix was to keep transformers to version 3 pip install transformers==3",
      "Y ": "update the latest version "
    },
    {
      "X ": "discrepancy in matmul while batching",
      "Z ": "Hi Vignesh! Vignesh_Baskaran: Is this normal to have a slight discrepancy? Yes, such a discrepancy is to be expected due to floating-point round-off error.Can someone please explain why this discrepancy happens with some under the hood explanation or please point me to resources to understand this in depth?In short, when you use  floating-point arithmetic to perform operationsin different orders that should be mathematically equivalent, you, ingeneral, expect to get (slightly) different results.In particular, the associative law no longer holds, in that withfloating-point arithmetic, (a + b) + c  !=  a + (b + c).  (Thetwo expressions can be equal for some values of a, b, and, c,but, in general, will differ.)For a start, you can look at the WIkipedia article for round-off error, and dig into it more deeply, if you like, with Goldberg‚Äôs classic paper,‚ÄúWhat Every Computer Scientist Should Know About Floating-Point Arithmetic.‚Äù Best. K. Frank",
      "Y ": "use this link http: //perso.ens-lyon.fr/jean-michel.muller/goldberg.pdf"
    },
    {
      "X ": "iterabledataset never stops despite stopiteration being raised",
      "Z ": "Okay I had to set drop_last=True. Also I was making the mistake of assuming __init__ would be called again when the dataset is recreated for each dataloader process but that is not the case.",
      "Y ": "set drop_last=True"
    },
    {
      "X ": "bizarre cuda behavior with nn linear",
      "Z ": "Could you update to 1.9.0 and rerun the code, please?We‚Äôve seen some issues in the pip wheels in 1.8.0 and 1.8.1 in particular for sm_61 by leaking cublas symbols, which might be also visible on the K80.",
      "Y ": "update the latest version "
    },
    {
      "X ": "simultaneously multiple block indexing in a torch tensor",
      "Z ": "i got it!Instead of using the for loop for the whole matrix, i just have to use it for the block_matrix size####a=torch.ones(8, 8)####print(a)####block_size=2####for i in range(block_size):####a[i: : 4]=2*a[i: : 4] #### print(a)",
      "Y ": "Instead of using the for loop for the whole matrix, i just have to use it for the block_matrix size"
    },
    {
      "X ": "gaussianblur transform causing error",
      "Z ": "This seems to be the PIL/Pillow issue as described here .",
      "Y ": "use this link https://discuss.pytorch.org/t/quickstart-tutorial-array-takes-1-positional-argument-but-2-were-given/125550/2 "
    },
    {
      "X ": "dynamically set conv2d based on input channels",
      "Z ": "You are creating the new self.conv and self.bn layers inside the forward pass without specifying the device, so they will be created on the CPU by default. To properly push them to the GPU, you could use:def reset_parameters(self, x): self.conv = nn.Conv2d(x.size(1), self.num_classes, kernel_size=1).to(x.device)self.bn = nn.BatchNorm2d(self.num_classes).to(x.device) Additionally, you could also check the lazy modules, e.g. LazyConv2d, which perform a similar approach.",
      "Y ": "    def reset_parameters(self, x): self.conv = nn.Conv2d(x.size(1), self.num_classes, kernel_size=1).to(x.device)self.bn = nn.BatchNorm2d(self.num_classes).to(x.device)"
    },
    {
      "X ": "inferring shape via flatten operator",
      "Z ": "As of 1.8, PyTorch now has LazyLinear which infers the input dimension:A torch.nn.Linear module where in_features is inferred.",
      "Y ": "use torch.nn.Linear module"
    },
    {
      "X ": "is it possible to execute two modules in parallel in pytorch",
      "Z ": "If you are running the code on a GPU and call p1 and p2 after each other, these calls will be queued onto the device and executed asynchronously. Depending on the workload of p1, p2 might start while p1 is still executing.If you have multiple GPUs in your system (and don‚Äôt use data parallel), you could execute p1 and p2 on each device and concatenate the result back on a single device.",
      "Y ": "If you call p1 and p2 after each other when running the code on a GPU, these calls will be queued on the device and executed asynchronously. Depending on p1's workload, p2 may begin while p1 is still running. If your system has numerous GPUs (and you aren't using data parallel), you could run p1 and p2 on each one and then concatenate the results on a single device."
    },
    {
      "X ": "detected that pytorch and torch scatter were compiled with",
      "Z ": "Since it'ss the same error I assume pytorch-geometric still claims to be using CUDA10.1?In that case, your install command wasn‚Äôt successful and you would have to uninstall the previous package and reinstall the new one.",
      "Y ": "install the new package"
    },
    {
      "X ": "typeerror tuple object is not callable",
      "Z ": "The trailing comma might create transform as a tuple.Could you remove it and try your code again?",
      "Y ": "remove the trailing comma "
    },
    {
      "X ": "typeerror cant convert cuda 0 device type tensor to numpy use tensor cpu to copy the tensor to host memory first",
      "Z ": "As the error message suggests, you would have to push the tensor to the CPU first before converting it to a numpy array via tensor.cpu(). In particular np.array(targets.argmax(1)) seems to raise the error to use: targets = targets.argmax(1).cpu().numpy() instead. PS: you can post code snippets by wrapping them into three backticks ```, which makes debugging easier.",
      "Y ": "use targets = targets.argmax(1).cpu().numpy()"
    },
    {
      "X ": "expected tensors on same device",
      "Z ": "Have you checked if x,y (the test data here) are also on GPU or if they are on CPU? The training code has X,y=X.to(device),y.to(device);so something similar might be needed here.",
      "Y ": "X,y=X.to(device),y.to(device);"
    },
    {
      "X ": "initialization of first hidden state in lstm and truncated bptt",
      "Z ": "Yes, zero initial hiddenstate is standard so much so that it is the default in nn.LSTM if you don‚Äôt pass in a hidden state (rather than, e.g. throwing an error).  Random initialization could also be used if zeros don‚Äôt work. Two basic ideas here: If your hidden state evolution is ‚Äúergodic‚Äù, the state will move closer to some ‚Äústeady distribution‚Äù anyways, so it doesn‚Äôt matter as much. You want the initial hidden state handling to be somewhat consistent between training and inference. The fancy Bayesian way would be to sample from said steady state, but deep learning is too wild to resort to fancy when it isn‚Äôt necessary. For BPTT (aka ‚Äúfeeding in a long sequence bit by bit‚Äù), you could keep the last hidden state and use that (detached) as the new initial hidden state if you think that state should be carried between batches. Language model training on Wikipedia (as a common example) will do things like that.In theory this is really true. In practice you would run out of memory instead. I can‚Äôt speak about other tutorials, but those that I have seen do the detaching (or don‚Äôt keep state from previous batches) and it would seem necessary to do so.Best regards Thomas P.S.: For the forum: lines with triple backticks  ```python at the beginning of your codeand ``` at the end will make it look nice.",
      "Y ": "zero initial hiddenstate is standard so much so that it is the default in nn.LSTM if you don’t pass in a hidden state (rather than, e.g. throwing an error). Random initialization could also be used if zeros don’t work."
    },
    {
      "X ": "which copy is better",
      "Z ": "Hi,The first one should never be used as .data should not be used. The new api for #1 is: with torch.no_grad(): x.copy_(k) And the difference is clearer: use #1 (EDIT: the version of #1 given just above in my comment, not your original one) if the copy should not be tracked by the autograd (like initializing some weights) and use #2 if you want gradients.",
      "Y ": "with torch.no_grad():x.copy_(k)"
    },
    {
      "X ": "dataloader parameter shuffle affects model accuracy",
      "Z ": "auzuha:Also , somehow using model.eval() solves the above problem.This would be expected, since calling model.eval() would disable dropout layers (shouldn‚Äôt make a difference regarding shuffling the dataset) and would use the running stats of all batchnorm layers.If you leave the model in training mode, the batchnorm layers would use the current batch stats to normalize the inputs and would also update the running stats. Shuffling the dataset is thus making a difference, which is why you should call model.eval() during validation and testing.",
      "Y ": "This is to be expected given the calling model. eval() disables dropout layers and uses the running statistics of all batchnorm layers.If you leave the model in training mode, the batchnorm layers will normalise the inputs using the current batch stats and will also update the running stats. Thus, rearranging the dataset makes a difference, which is why you should call model. During validation and testing, use eval()."
    },
    {
      "X ": "expected scalar type long but found float",
      "Z ": "I'm unsure how the kernel size is related to the input shape of the batchnorm layer.However, if the input contains only a single value for each channel (as is the case here), you won't be able to use batchnorm layers in training mode, since they need to calculate the stats from the input.Since it's impossible to calculate the var from a single sample (it would create NaNs) and also subtracting the mean from a single value would create a zero output, you could remove these layers from the model.",
      "Y ": "I'm not clear how the kernel size relates to the batchnorm layer's input shape.However, because batchnorm layers require to generate the stats from the input, you won't be able to utilise them in training mode if the input just contains a single value for each channel (as is the case here).You might eliminate these layers from the model because calculating the var from a single sample (which would result in NaNs) and subtracting the mean from a single value would result in a zero output."
    },
    {
      "X ": "smush multi channel tensor into one image channel with maximums",
      "Z ": "To create the predictions containing the class index associated with the highest probability (or logit) you could use: preds = torch.argmax(output, dim=0) # assuming output has the mentioned shape [nb_classes, height, weight]",
      "Y ": "preds = torch.argmax(output, dim=0) "
    },
    {
      "X ": "how to split dataset into test and validation sets",
      "Z ": "You can use built in function torch.utils.data.random_split(dataset, lengths). Check docs here: https: //pytorch.org/docs/stable/data.html#torch.utils.data.random_split Check source here: https: //pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#random_split Also, you can see this nice example.",
      "Y ": "torch.utils.data.random_split(dataset, lengths)."
    },
    {
      "X ": "bcewithlogitsloss giving negative loss",
      "Z ": "You shouldn't use a softmax on the model outputs when you want to calculate the loss using nn.CrossEntropyLoss, since (as you‚Äôve already said) nn.CrossEntropyLoss applies F.log_softmax and nn.NLLLoss internally, so pass the raw logits to this loss function instead.On the other hand, you can apply a softmax on the model outputs (logits), if you want to ‚Äúvisualize‚Äù the probabilities or use them in any other way besides the input to nn.CrossEntropyLoss.",
      "Y ": "calculate loss using nn.CrossEntropyLoss,"
    },
    {
      "X ": "how to backpropagate a transformer",
      "Z ": "Autograd will use the backward methods of each submodule used in the nn.Transformer.forward method to calculate the gradients so the nn.Transformer module doesn't necessarily need to implement a custom backward method.You could check these submodules and see, how the backward methods are defined (i.e. if they are using custom ones or just other PyTorch operations with already defined backwards).",
      "Y ": "nn.Transformer.forward method"
    },
    {
      "X ": "learnable parameter as int not tensor",
      "Z ": "The intent here looks like some kind of differentiable neural architecture search (e.g., DARTS: Differentiable Architecture Search ‚Äì Google Research)However, the shape parameters to layer declarations like max_pool1d are not differentiable; the output of the model doesn‚Äôt directly depend on them! For example, it is similar to writing y = a + b + c and expecting to take a derivative to optimize how many variables we are using in our equation. One workaround (as shown in the DARTS paper) is to consider a model with several weighted branches (where the weights are differentiable and learnable), where each branch has a different maxpool_1d configuration. The weights of each branch are updated during the architecture search training process and the branch(es) with the highest weight(s) are chosen.",
      "Y ": "consider a model with several weighted branches (where the weights are differentiable and learnable), where each branch has a different maxpool_1d configuration. The weights of each branch are updated during the architecture search training process and the branch(es) with the highest weight(s) are chosen."
    },
    {
      "X ": "how to unnormalize output of batch norm",
      "Z ": "The model.train() and model.eval() calls will switch between the training mode (normalizing the input batch with its own stats and updating the running stats) and evaluation mode (normalizing the input batch/sample with the running stats). You don't need to apply the running stats manually.",
      "Y ": "model. train and model.eval()  in training mode normalizing the input batch with its own stats and updating the running stats and normalizing the input batch/sample with the running stats. no need to apply any running stats"
    },
    {
      "X ": "understand dataset and dataloader",
      "Z ": "change torch.LongTensor(index) to  torch.LongTensor([index]) the first creates a tensor of size index, the second creates a tensor with value index.",
      "Y ": "use torch.LongTensor([index])"
    },
    {
      "X ": "ask about run my model with parallel device",
      "Z ": "Hyeonuk_Woo: If I use 4 devices and dim0 of my input data is 8, in this case, each GPU handle two process simultaneously or sequentially? +) the word ‚Äòsingle process‚Äô means that batch size of the data assigned to single GPU is 1? No, each device will get a batch containing 2 samples.Hyeonuk_Woo: Why single process per GPU is recommended?DDP should be faster as it reduces the communication overhead in nn.DataParallel.The details of the latter (including the scatter/gather calls) are described in this blog post.",
      "Y ": "DDP should be faster as it reduces the communication overhead in nn.DataParallel."
    },
    {
      "X ": "do loss backward and optimizer step with different frquency",
      "Z ": "Based on your description I understand that you are calling optimizer.step() more often (1 out of 100 steps) and calculate the gradients only 1 out of 1000 steps.In this case, the general problem could be that the optimizer updates the parameters with ‚old‚Ägradients, which might not work.To change the gradients, you could either scale the loss itself (divide with a constant) or use hooks to manipulate the .grad attributes of all parameters.",
      "Y ": " scale the loss itself (divide with a constant) or use hooks to manipulate the .grad attributes of all parameters."
    },
    {
      "X ": "windowed tensor duplication",
      "Z ": "You could use tensor.unfold for it:out = x.unfold(1, window_size,1) out = out.squeeze(0).permute(0,3,1,2) print(out.shape) # torch.Size([8,3,60,256]) print((out == new_x).all()) # tensor(True)",
      "Y ": "use tensor.unfold "
    },
    {
      "X ": "check the norm of gradients",
      "Z ": "Actually it seems the answer is in the code I linked to:For a 2-norm: for p in model.parameters():param_norm = p.grad.data.norm(2)total_norm += param_norm.item() ** 2 total_norm = total_norm ** (1. / 2)",
      "Y ": "for p in model.parameters(): param_norm = p.grad.data.norm(2) total_norm += param_norm.item() ** 2 total_norm = total_norm ** (1. / 2)"
    },
    {
      "X ": "assertionerror torch not compiled with cuda enabled",
      "Z ": "You can just install the NVIDIA driver, select a desired PyTorch binary using the posted link, and install it. E.g.conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch will install PyTorch with the CUDA 10.2 runtime (as well as cudnn7.6.5).",
      "Y ": "Install NVIDIA driver "
    },
    {
      "X ": "batchnorm1d input shape",
      "Z ": "Your code looks correct, since the batchnorm layer expects an input in [batch_size, features, temp. dim] so you would need to permute it before (and after to match the input of the rnn). In your code snippet you could of course initialize the tensor in the right shape, but I assume that code is just to show the usage.",
      "Y ": "batchnorm layer expects an input in [batch_size, features, temp. dim] "
    },
    {
      "X ": "why is pytorch ok with this wierd operation and not fail",
      "Z ": "Shisho_Sama:basically, here the author is using a one-hot encoded output with the shape (batch, sequence_length, features) with a normal not-one-hot encoded target tensor of shape (batchsize, sequence_length) !For this part what I can tell you is that nn.CrossEntropyLoss() does take output as one-hot encoded and targets with indices that isn‚Äôt encoded into one-hot. Internally it converts it into one-hot encoding and computes the loss.You can find this easily on documentation and its source code.",
      "Y ": "For this part what I can tell you is that nn.CrossEntropyLoss() does take output as one-hot encoded and targets with indices that isn’t encoded into one-hot. Internally it converts it into one-hot encoding and computes the loss."
    },
    {
      "X ": "understanding time taken from moving data from gpu to cpu",
      "Z ": "CUDA operations are called asynchronously, so you should synchronize the code before starting and stopping the timer using torch.cuda.synchronize(). In your current code snippet logits = logits.detach().cpu().numpy() will create a synchronization point, so that your code will wait at this line of code for all preceding operations to finish (which also might include the forward pass of your models) while the timer was already started.",
      "Y ": "  synchronize the code before starting and stopping the timer usinguse torch.cuda.synchronize(), as CUDA operations are called asynchronously "
    },
    {
      "X ": "should i flatten before the linear layer",
      "Z ": "Yes, you should have something of the shape (batch_size, linear_in), and after the linear layer you will have something of the shape (batch_size, linear_out), where your linear layer should be declared like nn.Linear(linear_in, linear_out).",
      "Y ": "Yes you. should have something shape of  , (batch_size, linear_in),  (batch_size, linear_out),, "
    },
    {
      "X ": "runtime error size mismatch m1 2048 x 1 m2 2048 x 1",
      "Z ": "Two comments:You define self.out=nn.Linear(2048,1) but your x after flattening is (16,128). So it should be self.out=nn.Linear(128,1) to fit your current setup.For a text CNN, Conv1d should be the way to go. Sure each data item is 2d with (seq_len, hidden_dim) but 2d convolutions don‚Äôt make sense here. You might want to check examples using Conv1d. For the first layer, it would look like nn.Conv1d(in_channels=embedding_dim, out_channels=..., kernel_size=...). Then you also have to adjust the embedding output with x = embeds.transpose(1,2), since Conv1d wants (batch size x input_channels x seq_length) as input shape (with input_channels = ‚Äòhidden_dim‚Äô here) Please see my working example of a Text CNN Classifier; sorry, I don‚Äôt have it in a proper GitHub repository! Note that I don‚Äôt use multiple deep Conv1d layers but multiple parallel ones with different kernel sizes (reflecting different ngram sizes). This is in line with this paper.",
      "Y ": "use self.out=nn.Linear(128,1)  and Conv1d wants (batch size x input_channels x seq_length) as input shape (with input_channels = ‘hidden_dim’ here)"
    },
    {
      "X ": "neural net pytorch error multi target not supported",
      "Z ": "There is a convention that the labels should be a 1-dimensional array, not a 2-dimensional array. I.e., right now, you have 200x1 as the label array, but it should be simply 200. Try loss = loss_fn(outputs, labels.view(-1).long()) instead of loss = loss_fn(outputs, labels.long()) This be able to solve your issue. Btw I see that you are using Variable etc, which is old PyTorch code (it has been deprecated and removed in newer versions) ‚Äì just wanted to mention that in case you encounter errors related to that later on when you update your PyTorch installation to the current ones.",
      "Y ": "use loss = loss_fn(outputs, labels.long())"
    },
    {
      "X ": "help runtimeerror assertion cur target 0 cur target n classes failed",
      "Z ": "Use nn.BCELoss as the loss_func.Alternatively, you could also remove the sigmoid and use nn.BCEWithLogitsLoss.",
      "Y ": "Use nn.BCELoss or nn.BCEWithLogitsLoss "
    },
    {
      "X ": "handling the hidden state with minibatches in a rnn for language modelling",
      "Z ": "No, the sentences in a batch are processed independently and in parallel in one go. Otherwise there would be no need for the batch_size dimension in the shape of the hidden state.",
      "Y ": "No, the sentences in a batch are processed independently and in parallel in one go."
    },
    {
      "X ": "expected object of scalar type long but got scalar type float for argument 2 mat2",
      "Z ": "Have you checked your weights? Initially Pytorch Tensors have Float type and it will cause problems when you want to use different type of tensors. For example:For solution, you should change your layer weight type. Also you can set default tensor type if you are using all LongTensor:torch.set_default_tensor_type('torch.cuda.LongTensor')",
      "Y ": "change to torch.set_default_tensor_type('torch.cuda.LongTensor')"
    },
    {
      "X ": "effect of vocabulary size on gpu memory slow training",
      "Z ": "Maybe I misunderstand something wrong, but I think nothing should change when using pretrained word embeddings. When your dataset has a vocabulary of size 20k, you only need the 20k respective word embeddings.An embedding layer is essentially just a look-up layer. So there is no need for embeddings that are never looked up :). Just create a embedding matrix of size (vocab_size, embed_dim) ‚Äì e.g., (20000,300) ‚Äì and fill the 20k embeddings using the pretrained ones.",
      "Y ": "create a embedding matrix of size (vocab_size, embed_dim) , nothing should change when using pretrained embeddings"
    },
    {
      "X ": "cuda runtimeerror cuda error device side assert triggered occuring only with some data",
      "Z ": "You can see in the exception  Assertion srcIndex srcSelectDimSize failed.This means that you tried to index a Tensor with an index that is larger than it‚Äôs size. You want to check in more details all your indices and make sure they are correct.Note that to get a stack trace that points exactly to the operation that cause the issue, you can either run on the CPU or set CUDA_LAUNCH_BLOCKING=1 before launching you script on the GPU.",
      "Y ": "you can either run on the CPU or set CUDA_LAUNCH_BLOCKING=1 before launching you script on the GPU."
    },
    {
      "X ": "creating a sparse tensor from csr matrix",
      "Z ": "You could convert a csr format matrix to coo, and then process that a little before sending it into the sparse tensor constructor. I believe scipy's coo format looks similar to pytorch' sparse tensors.",
      "Y ": "Convert csr format matrix to coo , you can use spacy's coo format it looks similar to  pytroch sparse tensors "
    },
    {
      "X ": "clarification regarding the return of nn gru",
      "Z ": "Short answer: You can use either.For a bit more understanding, first check the output of output and h_n output.shape = (seq_len, batch, num_directions * hidden_size), so you get a (context) vector for each item in your sequence h_n.shape = (num_layers * num_directions, batch, hidden_size) Now look at the following archtecture ‚Äì it‚Äôs for an LSTM, for a GRU you can simple ignore the c_n: lstm-output-vs-hidden.png640√ó548 20.9 KB In the basic case where num_layer=1 and bidirectional=False, and you don‚Äôt use PackedSequence, then output[-1]=h_n. In other words, h_n is the vector after the last layer and the last sequence item (bidirectional=True makes it more complicated).Fore Seq2Seq, most people use h_n as context vector since the encoder and decoder often have symmetric architectures (e.g., same number of layers), so they can simple copy h_n between encoder to the decoder.For classification, I usually use h_n instead of output. The reasons is ‚Äì as far as I understand ‚Äì then when you have batches with sequences of different lengths and padding, and use PackedSequence, output[-1] != h_n. It still works, of course, but I experience less accuracy.",
      "Y ": "output.shape = (seq_len, batch, num_directions * hidden_size), so you get a (context) vector for each item in your sequence h_n.shape = (num_layers * num_directions, batch, hidden_size)"
    },
    {
      "X ": "clarification on backpropagation for lstms",
      "Z ": "Re #1: No, I was talking solely about LSTM. It wraps an LSTMCell to support multiple layers, dropout, bidirectionality, etc. An LSTMCell only ever takes just one word as input. LSTM is a full layer allowing for whole sequences as output. It‚Äôs just that no-one is stoping you to give it sequences of length 1. An LSTM with num_layers=1, bidirectional=False and dropout=0.0that takes one word at a time should be more or less the same as an LSTMCell. My recommendation, stick with LSTM for the time being, and consider LSTMCell if you really need more control of the recursion.Re #2: Yes, padding is the common technique. Don‚Äôt forget to at a fixed EOS (end-of-sequence) token behind each question, so the decoder learns when to stop generating a caption. At the beginning, you can just use padding and see how it works. For more advanced methods, you can check PackedSequence or even sort your dataset in such a way, that each batch has captions of the same length. In short: Use LSTM and padding to get a basic network training. Then you can see if you want and can improve.",
      "Y ": "No, I was only referring to LSTM. It encases an LSTMCell in order to support multiple layers, dropout, bidirectionality, and other features. An LSTMCell can only ever accept one word as input. LSTM is a full layer that can output entire sequences. It's just that no one is stopping you from giving it 1-length sequences. An LSTM with num layers=1, bidirectional=False, and dropout=0.0 that accepts one word at a time should behave similarly to an LSTMCell. Stick with LSTM for the time being, and consider LSTMCell if you require more control over the recursion."
    },
    {
      "X ": "when shuffle is true for dataloader what is the default sampler",
      "Z ": "If no sampler was specified and shuffle=True, the RandomSampler will be used as shown in this line of code.",
      "Y ": "sampler = RandomSampler(dataset)"
    },
    {
      "X ": "what is the most standard way to put data in batch",
      "Z ": "I don't know exactly what difference you would like to highlight.In the first approach train_inputs and train_labels seem to be undefined (as well as validation_x), so I assume you would like to use x_train etc.?Also, the DataLoader loop is different, since you are unpacking the values in the first approach inside the loop (but that doesn't matter and is just different coding style).Are these the differences or what would you like to discuss?",
      "Y ": "In the first approach train_inputs and train_labels seem to be undefined,  he DataLoader loop is different, since you are unpacking the values in the first approach inside the loop"
    },
    {
      "X ": "what is the most efficient way to get the closest pair of a set of vectors",
      "Z ": "Hi Fly!flyaway:I have a matrix M in size of n*d, where n is the number of vectors and d is the dimension of each vector. Now, I want to find the closest pair of this n vectors.functional.pdist works well, it does not double computing.Any suggestions about solving this efficiently? In practice, for reasonable values of n and d, I expect that pdist will do about as well as you can. However, the so-called computational cost of your pdist solution is a suboptimal O(d * n^2).  If you search on closest-pair problem, you will see that your problem can be solved in O(d * n * log n) time.  However, these ‚Äúfaster‚Äù algorithms don‚Äôt vectorize naturally, so you give up much of the benefit of using a gpu (and even pipelines in a cpu).  Thus your naive pdist solution may well run faster (even as it requires more operations). Good luck. K. Frank",
      "Y ": "computational cost of your pdist solution is a suboptimal O(d * n^2). If you search on closest-pair problem, you will see that your problem can be solved in O(d * n * log n)"
    },
    {
      "X ": "installing pytorch fails on macos",
      "Z ": "Ultimately this is what solved my problem: conda install -y pytorch torchvision torchaudio -c pytorch -c conda-forge @smth sorry to ping you again but why does the official pytorch installation not include the conda-forge channel? It seems I always have issues without it‚Ä¶at least letting you know. Hope this helps!",
      "Y ": "use this conda install -y pytorch torchvision torchaudio -c pytorch -c conda-forge"
    },
    {
      "X ": "np pi equivalent in pytorch",
      "Z ": "No there isn‚Äôt a pretty built-in way.There are however a lot of ugly ways of doing so. You can, for instance, define torch.pi at run-time: import torch torch.pi = torch.acos(torch.zeros(1)).item() * 2 # which is 3.1415927410125732 You can even modify the __init__.py file (which is located in print(torch.__file__)) in your environment by adding a line with pi = 3.14 etc, so that you will always be able to use the torch.pi constant.",
      "Y ": "use import torch torch.pi = torch.acos(torch.zeros(1)).item() * 2 "
    },
    {
      "X ": "an iterabledataset implementation for chunked data",
      "Z ": "Thanks for sharing this implementation!I think you could start with a feature request on GitHub and explain your use case as well as your implementation. Currently you are using 3rd party modules such as pandas, which I assume could be removed to allow for a more general use case.Once the feature request is done the code owners will take a look at it ",
      "Y ": " Start a featurerequest on github "
    },
    {
      "X ": "return item in forward hook",
      "Z ": "No, the forward hook won't terminate the execution and you could store the intermediate output in e.g. a list or dict as seen here.",
      "Y ": "the forward hook won't terminate the execution"
    },
    {
      "X ": "saving torch distributions with state dict",
      "Z ": "bigger issue is that Distribution objects are immutable, so you can‚Äôt push nn.Parameters inside them in __init__. Keep parameters in modules, create transient Distribution objects in forward, then saving &amp; training will work correctly.",
      "Y ": "The main problem is because Distribution objects are immutable, which means you can't just push nn. In __init__, parameters are contained within them. Maintain parameters in modules, construct transient Distribution objects in forward, then save and train as usual."
    },
    {
      "X ": "training routine inside model definition class",
      "Z ": "Yes it is possible. In general: nearly everything that can be done with classes can be done by inheriting torch.nn.Module and defining the missing things yourself. Whether it is good practice is hard to tell. If it helps, I can ensure you, that I have done something similar quite often as it is one of the easiest ways to define a API. Another approach would be to only define a closure for each network. I've done this to write a generic trainer class.",
      "Y ": "Inherit torch.nn.Module and define missings things"
    },
    {
      "X ": "updating the transform class variable of a dataset instance",
      "Z ": "YannLe: random_split So random_split returns to Subset instances that themselves refer to the (single!) DataSet instance. The modern workflow for this is to override the collate function in the dataloader (calling default collate and then optionally augmenting the dataset). Or you could make full instances of your dataset. In the end, it just is running idx through an index indirection generated randperm and adjusting the length.Best regards Thomas",
      "Y ": "Overriding the dataloader's collate method is the modern solution for this (calling default collate and then optionally augmenting the dataset). You could also create entire replicas of your dataset. In the end, it's just a matter of running idx through a randperm generated via index indirection and tweaking the length."
    },
    {
      "X ": "check if tensor elements in a value list",
      "Z ": "You can achieve this with a for loop: &gt;&gt;&gt; sum(a==i for i in vals).bool() tensor([[ True,  True, False],[ True,  True, False ]])",
      "Y ": "sum(a==i for i in vals).bool() tensor([[ True,  True, False],[ True,  True, False]])"
    },
    {
      "X ": "typeerror new received an invalid combination of arguments got tensor int",
      "Z ": "You would have to create a model instance before passing in the inputs:model = CustomNeuralNet(num_classes) outputs = model(inputs) Based on the error message it seems you are trying to pass the inputs directly to the initialization of the model.",
      "Y ": "ou are trying to pass the inputs directly to the initialization of the model"
    },
    {
      "X ": "save torchsummary",
      "Z ": "Assuming you are using this method from torchsummary you could call:result, params_info = summary_string( model, input_size, batch_size, device, dtypes) and save the result into a file instead of printing it out.",
      "Y ": "result, params_info = summary_string(model, input_size, batch_size, device, dtypes)"
    },
    {
      "X ": "using for loop to train model multiple times produces vastly different results",
      "Z ": "Based on the description it seems that your overall training is unstable and the success rate of the training could be low. You could retrain N models using different seeds for each run and check how many times the model converges properly vs. a failure. To stabilize it, you could try to use other parameter initialization methods etc. Rohit_R:and think it might be due to PyTorch's random seed or with cuda memory allocation.I don't quite understand how the CUDA memory allocation could be related to this issue. Could you explain your concern a bit more?",
      "Y ": "Based on the description, it appears that your overall training is insecure, and the program's success rate may be poor. You could retrain N models with different seeds for each run and see how many times the model converges correctly vs. how many times the model fails. You might try using different parameter initialization methods, for example, to stabilise it."
    },
    {
      "X ": "legacy autograd function with non static forward method is deprecated and will be removed in 1 3",
      "Z ": "Try to use: x, mean = BinActive.apply(x)",
      "Y ": "mean = BinActive.apply(x)"
    },
    {
      "X ": "how to use netcdf data for dcgan",
      "Z ": "ljeonjko: x already has (lat, lon) dimensions.In that case I think you can just unsqueeze dim0 and dim1 as x should already contain the values to create the batch and channel dimension.This would create a tensor in the shape [1,1, lat, lon] which would be an ‚mage-like tensor.",
      "Y ": "unsqueeze dim0 and dim1 as x should already contain the values to create the batch and channel dimension."
    },
    {
      "X ": "very slow training on gpu for lstm nlp multiclass classification",
      "Z ": "Hm, can' t see anything obvious what might cause the performance issues. Just some generic comments:Since you initialize h0 and c0 you don't need to detach them as well. Maybe that's even bad for the training itself, but here I'm not sure.I cannot see the point of the lines below. Again, I don‚Äôt think they hurt, but in the end you return the last hidden state after being pushed through a linear layer. lstm_out is never used lstm_out.size() lstm_out[: ,-1,:] lstm_out = self.linear(lstm_out[: ,-1,:])lstm_out.size() I have a implementation of a multiclass LSTM/GRU classifier. Maybe you can have a look to see what might be different. The code is a bit verbose to make the classifier configurable.",
      "Y ": "Since you initialize h0 and c0 you don’t need to detach them as well. Maybe that’s even bad for the training itself,"
    },
    {
      "X ": "valueerror expected input whentraining bert",
      "Z ": "The input batch-size should be (512,1) and not (1,512) .",
      "Y ": "Input batch size should be (512,1"
    },
    {
      "X ": "understanding input shape to pytorch conv1d",
      "Z ": "I think the confusion here stems from the fact that PyTorch by default uses a NCHW memory format, with tensor dimensions structured accordingly. In other words, tensor dimensions are (batch, channel, height, width).Therefore, in your example, if you change the input tensor to the assumed structure, everything should work as expected;input = torch.randn(6,68,512)  Note that some work has been done, which shipped in v1.5, that allows you to change the memory format to one that is (arguably) more intuitive, namely the NHWC format. An additional benefit of the NHWC format is that it can lead to drastic performance improvements in some specific scenarios due to CuDNN optimizations. Some references regarding the memory format: https: //pytorch.org/tutorials/intermediate/memory_format_tutorial.html https: //github.com/pytorch/pytorch/issues/28619",
      "Y ": " PyTorch by default uses a NCHW memory format, with tensor dimensions structured accordingly. In other words, tensor dimensions are (batch, channel, height, width)."
    },
    {
      "X ": "unboundlocalerror local variable loss referenced before assignment",
      "Z ": "I guess get_batches might not return anything, so that the complete training and thus the loss calculation will be skipped and will later raise this error in: train_history['loss'].append(loss.item())",
      "Y ": "get_batches might not return anything, so it might cause the error "
    },
    {
      "X ": "ulmfit fine tuning freeze",
      "Z ": "Try this,for name, params in your_learner_object.model.named_parameters():if params.requires_grad:print(name)",
      "Y ": " try this for name, params in your_learner_object.model.named_parameters(): if params.requires_grad: print(name)"
    },
    {
      "X ": "typeerror cannot assign torch floattensor as parameter layer weights torch nn parameter or none expected",
      "Z ": "self.layer_weights is not a tensor but a Parameter that is registered as an attribute of your module. That's why when you assign it a tensor it throws an error. Instead, try self.layer_weights = nn.Parameter(F.softmax(self.layer_weights,dim=0))",
      "Y ": " try this self.layer_weights = nn.Parameter(F.softmax(self.layer_weights,dim=0))"
    },
    {
      "X ": "training loss is not changing at all while training lstm",
      "Z ": "nn.CrossEntropyLoss uses F.log_softmax and nn.NLLLoss internally, so it expects raw logits as the model outputs.Remove the F.softmax at the end of your model and pass the output of self.lin(3) directly to the criterion. Let me know, if that helps. ",
      "Y ": "Remove  F.softmax at the end of your model and pass the output of self.lin(3) "
    },
    {
      "X ": "runtimeerror size mismatch m1 1 x 30 m2 20 x 128 at c w 1 s windows pytorch aten src th generic thtensormath cpp 752",
      "Z ": "Your code and comments are not really helpful, so I can only make some half-informed guesses.I assume the shape of inputs is (1,3) since it comes from a array called trigrams. After embeds=self.embeddings(inputs) the shape of embeds is (1,3,10) with 10 being the embed_dim. After embeds=embeds.view(1,-1) the shape of embeds is (1,30).However, self.linear1=nn.Linear(context_size*embedding_dim,128), given your parameter values means self.linear1=nn.Linear(2*10,128). Here you have to mismatch between the tensor of size 30 from embeds and the expected size of 20 from self.linear1.If my assumptions are correct, model=NGram(len(vocab),10,3) should help.",
      "Y ": " use  model=NGram(len(vocab),10,3)"
    },
    {
      "X ": "runtimeerror expected object of scalar type long but got scalar type float for argument 2 mat2",
      "Z ": "Could you print the shape or words? nn.Linear expects the input as a tensor of shape [batch_size, *, in_feautres].Also, nn.NLLLoss expects a LongTensor as the target containing the class indices. It should therefore not be one-hot encoded.",
      "Y ": "nn.Linear expects the input as a tensor of shape [batch_size, *, in_feautres]."
    },
    {
      "X ": "rnn working on cpu shows loss but cannot send to gpu",
      "Z ": "Were you able to use the GPU in the past?If so, are you able to create a simple tensor on the device via: torch.randn(10, device='cuda')?",
      "Y ": " use torch.randn(10, device='cuda')"
    },
    {
      "X ": "retrieving hidden and cell states from lstm in a language model",
      "Z ": "Hi. The outputs for the LSTM is shown in the attached figure. The output shape for h_n would be (num_layers * num_directions, batch, hidden_size). This is basically the output for the last timestep. Your output is (2,1,1500) so you are using 2 layers*1 (unidirectional) ,1 sample and a hidden size of 1500).Now the LSTM would return for you output, (h_n, c_n). In your case, (h_n, c_n) is named hidden. So by indexing hidden you can extract the h_n and c_n (i.e hidden[0] = h_n and hidden[1] = c_n)",
      "Y ": "LSTM would return for you output, (h_n, c_n). In your case, (h_n, c_n) is named hidden. So by indexing hidden you can extract the h_n and c_n (i.e hidden[0] = h_n and hidden[1] = c_n)"
    },
    {
      "X ": "randomly initialized embeddings for torchtext",
      "Z ": "This code snippet would assign embedding vectors to the nn.Embedding layer.Note that nn.Embedding will already randomly initialize the weight parameter, but you can of course reassign it.You could use torch.from_numpy(np.random.rand(...)).float() to avoid a copy, but your code should also work.",
      "Y ": " use  torch.from_numpy(np.random.rand(...)).float() "
    },
    {
      "X ": "runtimeerror cuda out of memory after many epochs",
      "Z ": "Your script might be already hitting OOM issues and would call empty_cache internally. You can check it via torch.cuda.memory_stats(). If you see that OOMs were detected, lower the batch size as suggested.",
      "Y ": "use torch.cuda.memory_stats()"
    },
    {
      "X ": "module list activation function",
      "Z ": "Hi James! drbeethoven: how do I put a sigmoid layer between each module within the module list? I‚Äôm not entirely sure what you‚Äôre asking or how you intend to use your ModuleList, but note that a torch.nn.ReLU is a Module so you can include it in your ModuleList, in between, for example, some torch.nn.Linears. Best.K. Frank",
      "Y ": " torch.nn.ReLU is a module "
    },
    {
      "X ": "rescale image data 0 255 to 1 1",
      "Z ": "Given that torchvision.transform.ToTensor already standardize the data to [0, 1], you could just multiply that by 2 and subtract 1 on your input layer.",
      "Y ": "Given that torchvision.transform.ToTensor already standardize the data to [0, 1], you could just multiply that by 2 and subtract 1 on your input layer."
    },
    {
      "X ": "loading a tensor from file in batches",
      "Z ": "Based on this older post it seems that you could use a Storage to load the data in chunks.However, I don‚Äôt see an offset argument, so I guess the proper way would be to use np.memmap and load chunks of a numpy array (assuming you could store it via numpy).",
      "Y ": "use numpy.memmap"
    },
    {
      "X ": "a100 is slower than 1080ti with pytorch",
      "Z ": "Your command should install the latest 1.9.0 release, which would ship with the missing cutlass kernels in cudnn8.0.5. However, you might still get a better performance by building from source with cudnn8.2 as said before.",
      "Y ": "install the latest release"
    },
    {
      "X ": "how to get batch norms running stats such as running var and running mean in pytorch",
      "Z ": "You can directly access them via: my_net.bn.running_mean my_net.bn.running_var",
      "Y ": "my_net.bn.running_mean my_net.bn.running_var"
    },
    {
      "X ": "parameters initialised by nn parameter not present in the model parameters",
      "Z ": "Yes, plain Python ‚containers‚such as list, dict etc. won't (recursively) register the parameters and buffers, so you should use the PyTorch equivalents instead such as nn.ParameterList, nn.ModuleList, nn.ModuleDict etc.",
      "Y ": "use pytroch equivalents such as nn.ParameterList, nn.ModuleList, nn.ModuleDict "
    },
    {
      "X ": "encounter the runtimeerror one of the variables needed for gradient computation has been modified by an inplace operation",
      "Z ": "x = x + 1 is not in-place, because it takes the objects pointed to by x, creates a new Variable, adds 1 to x putting the result in the new Variable, and overwrites the object referenced by x to point to the new var. There are no in-place modifications, you only change Python references (you can check that id(x) is different before and after that line).On the other hand, doing x += 1 or x[0] = 1 will modify the data of the Variable in-place, so that no copy is done. However some functions (in your case *) require the inputs to never change after they compute the output, or they wouldn‚Äôt be able to compute the gradient. That‚Äôs why an error is raised.",
      "Y ": "x += 1 or x[0] = 1  changes the data variable in place "
    },
    {
      "X ": "how to train a network with two outputs",
      "Z ": "Your model (its forward function) returns a tuple (x,x1) and not a tensor. And below, you use this tuple to try to calculate the loss function, hence the error. # ... recon = model(img) # x, x2 loss = criterion(recon, img) # ... I don‚Äôt know what you want to achieve as a goal, but looking at it I think it‚Äôs x that you want to use in the loss function. loss = criterion(recon[0], img)",
      "Y ": "recon = model(img) # x, x2 loss = criterion(recon, img), loss = criterion(recon[0], img)"
    },
    {
      "X ": "softmax cross entropy loss",
      "Z ": "Oh sorry I misunderstood what you said. Yea from my experience I always use Integer Encoding (e.g. [2, 3, 1, 1] vs [[0 0 1 0], [0 0 0 1], [0 1 0 0], [0 1 0 0] for CrossEntropyLoss and BCELoss The docs for CrossEntropyLoss and BCELoss / BCEWithLogitsLoss both show examples using integer encoding the labels rather than one-hot encodingThe predictions that go into these loss functions are just the raw logits from the model",
      "Y ": "use this links for CrossEntropyLoss (https: //pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss), and BCELoss(https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) "
    },
    {
      "X ": "how to change cudnn version in pytorch",
      "Z ": "You could try to install the binaries with cudatoolkit=10.2, which should ship with a newer cudnn version, use the NGC container, or build PyTorch from source with your local CUDA and cudnn installation.",
      "Y ": "use cudatoolkit=10.2"
    },
    {
      "X ": "how to process variable length sequence of images with cnn",
      "Z ": "Sure, each data sequence is of format [image_0, action_1, image_1, action_2, image_2, action_3, image_3‚Ä¶], and the task is to predict action sequence [action_1, action_,‚Ä¶, action_k] given image sequence [image_0, image_1,‚Ä¶, image_k]. The image sequence of a variable length k+1 in a batch, so I pad each sequence with zero images until sequence length is max_seq_len. The batched input is thus of shape (B, max_seq_len, C, H, W). My network uses a CNN model to embed each image into a feature vector state, and then uses a LSTM model to predict the action sequence from the state sequence. To avoid embedding all the zero images that are just padding I use pack_padded_sequence(images, image_seq_lens, batch_first=True, enforce_sorted=False) to produce packed_images. Run the CNN on packed_images.data to get packed_states_data. Instantiate (a hacked advised against)  packed_states = PackedSequence(packed_states_data, packed_images.batch_sizes, packed_images.sorted_indices, packed_images.unsorted_indices) Send packed_states to the LSTM to predict packed_actions Unpack actions with actions, action_seq_len = pad_packed_sequence(packed_actions, batch_first=True, total_length=self.max_seq_len-1) To calculate the loss between predicted actions and ground truth actions_gt, I also pack actions_gt into packed_actions_gt. I use assert (packed_actions.sorted_indices == packed_actions_gt.sorted_indices).all() to make sure they are sorted in the same way before packing. Then I compute the loss on packed_actions.data and packed_actions_gt.data. My questions are that 1) are these valid operations? 2) is there going to be a performance issue (longer time) with numerous packing and padding back and forth on multiple GPUs? Right now my batched input live on GPU devices and these hacks happen there. Thank you for your time!",
      "Y ": "pack_padded_sequence(images, image_seq_lens, batch_first=True, enforce_sorted=False) to produce packed_images.Run the CNN on packed_images.data to get packed_states_data."
    },
    {
      "X ": "python dictionary in the model not trained",
      "Z ": "Plain Python containers, such as list and dict won't be properly registered inside your module, so use nn.ModuleDict in that case (or nn.ModuleList instead of list).",
      "Y ": "use nn.ModuleList "
    },
    {
      "X ": "problem with dimension size lstm",
      "Z ": "Your current code throws an error, if I pass an input of shape [50, 256] to it in the self.lstm layer.For the batch_first=True argument, the input should be [batch_size, seq, feature], so just passing embeds without the view gets rid of this error. If you would like to use the output features of all time samples, setting in_features=256*256 should work, as you already tried.That being said, if you are using nn.NLLLoss, you should apply F.log_softmax on your output instead of F.sigmoid.Could you change that and try to train your model again?",
      "Y ": " if batch_first=True argument, the input should be [batch_size, seq, feature]"
    },
    {
      "X ": "problem in making embedding layer for a cnn document classification",
      "Z ": "nn.EmbeddingBag != nn.EmbeddingThe first will give you the average of all the embeddings in the sequence while the second will give you the embeddings in a sequence.  I think you want to use the latter rather than the former.",
      "Y ": "nn.EmbeddingBag != nn.Embedding, use nn.Embedding  "
    },
    {
      "X ": "packing in rnns",
      "Z ": " longer answer: when using packed sequence the strings are rearranged by length then they are stacked vertically in your example it will be Jemaldinov Griffiths Naifeh each iteration it will process the i‚Äôth column across all rows if exists and pass it to the next rnn cell that‚Äôs why it seems to you like the letters got scrambled. check this blog for a good example of how to use padded seq: https: //towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972echeck this repo for understanding the inside of a packed seq:https: //github.com/sgrvinod/a-PyTorch-Tutorial-to-Sequence-Labeling#language-models",
      "Y ": " use thi slink https: //towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e 1"
    },
    {
      "X ": "pack padded sequence with unsorted question length",
      "Z ": "Yes, if you are not planning to export the model via ONNX, you could use enforce_sorted=False and pass the unsorted padded sequence to this method.",
      "Y ": " use enforce_sorted=False"
    },
    {
      "X ": "optimization of performance",
      "Z ": "If you want to reduce the model capacity, you could lower the number of hidden neurons in the model. Also note that nn.BCELoss is expecting a single neuron with a sigmoid applied on it for a binary classification use case.So you would have to change the last linear layer to nn.Linear(512,1) and use nn.Sigmoid. Or remove the softmax and sigmoid, return the raw logits, and use nn.BCEWithLogitsLoss`.If you want to use two neurons for the binary classification, you could keep the last linear layer, remove the softmax, and use nn.CrossEntropyLoss.",
      "Y ": "So you would have to change the last linear layer to nn.Linear(512,1) and use nn.Sigmoid"
    },
    {
      "X ": "noob model with embedding bag not learning problem with my model",
      "Z ": "It's usually better to remove the sigmoid and use nn.BCEWithLogitsLoss or F.binary_cross_entropy_with_logits.However, your model might be too small, so I would recommend to add a relu and another linear layer to it.Let me know, if that helps. ",
      "Y ": " use nn.BCEWithLogitsLoss or F.binary_cross_entropy_with_logits. instead of sigmoid  and add a relu and another linear layer to it   "
    },
    {
      "X ": "multi head attention forward and batch dimension index",
      "Z ": "Hi, Actually, Linear consider anything that is not the last dimension as batch. So this is not an issue here.",
      "Y ": "Linear consider anything that is not the last dimension as batch"
    },
    {
      "X ": "lstm training loss does not decrease",
      "Z ": "Thank you Olivier for looking into it. Your hunch on the learning rate was in right direction. However, the problem was rather simple. I am not sure anyone can run into this. It may be very basic about pytorch. That being said, at the risk of sounding stupid, here's the problem.overall_loss += loss.tolist() before loss.backward() was the issue. It wasn't optimizing at all. loss.tolist() is a method that shouldn't be called I suppose. The correct way to access loss is loss.item().Now the network does what it should. Thanks everyone for looking into this and it‚Äôs less likely but hope this saves someone else is time if somebody runs into this.",
      "Y ": "The correct way to access loss is loss.item()"
    },
    {
      "X ": "lstm layer dimensionality",
      "Z ": "if u set batch_first=True, the shape of input must be (B, L, D). batch_first  If  True , then the input and output tensors are provided as (batch, seq, feature). Default:  False",
      "Y ": " input and output tensors are provided as (batch, seq, feature) if batch_first = True "
    },
    {
      "X ": "lstm input and output dimentions",
      "Z ": "You define your LSTM with batch_first=True. This changes the expected dimensions if the input to (batch, seq_len, input_size). Try setting batch_first=False, and you will see that an error is thrown due to incorrect dimensions.",
      "Y ": "set batch_first=True,  expected dimensions of the input  to be in  (batch, seq_len, input_size) "
    },
    {
      "X ": "lstm hidden states issue",
      "Z ": "If you look at the docs, you can resolve the num_layers*num_directions withh = h.view(num_layers, num_directions, batch, hidden_size) After that, the last layer is simply h[-1], the last layer for the forward pass is h[-1][0], and the last layer for the backward pass is h[-1][1]. Does this answer your question? I'm not quite if I understood what you need to know.",
      "Y ": " you can resolve the num_layers*num_directions with h = h.view(num_layers, num_directions, batch, hidden_size)"
    },
    {
      "X ": "lstm error when using a saved model for prediction",
      "Z ": "I got this workaround. I don‚Äôt know what the problem is and how to correctly resolve it. But just posting it.Since the weights are already there in the loaded_model, so I copied those weights into another model that I created using same arguments as in the training time:model = charGen(n_letters,512, hidden_dim=512).to(device)  # This is exactly how I had my model during trainingmodel.load_state_dict(loaded_model.state_dict()) And it worked now. predict(model, i am happy , device)#'{i am happy}and away along in the act and finish a straightforward battle on the starboard, and starts to the ope'",
      "Y ": "use this model = charGen(n_letters,512, hidden_dim=512).to(device)  # This is exactly how I had my model during training model.load_state_dict(loaded_model.state_dict())"
    },
    {
      "X ": "lstm error in second epoch only on gpu",
      "Z ": "Do you re-initialize your hidden sate after each batch? See this Seq2Seq PyTorch tutorial, specifically the use if the initHidden() methods of the encoder and decoder.Without re-initalizing (or the usage of detach() at the right spot), the backpropagation path of your RNN continuously grows, definitely leading to memory issues.",
      "Y ": " use inithidden() and use this link  https: //pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html"
    },
    {
      "X ": "clarification on weightedrandomsampler",
      "Z ": "Pyroka: As far as i understand the WeightedRandomSampler makes sure (if correctly used), that in each Batch there is approximatly the same amount of samples for each Class. That‚Äôs one use case, but I wouldn‚Äôt say that it‚Äôs the only correct one. As the name suggests this sampler is used for weighting the sampling strategy, not only balancing it. Pyroka: Now i can set ‚Äúreplacement‚Äù to false,Yes, your description is correct. If you want to have balanced batches, you should not use replacement=False.Pyroka: But that also means it is possible to have the same exact image of class ‚Äú1‚Äù multiple times in my training leading to an at least slightly biased learning.Yes, that might be the case, but could still perform better than the imbalanced dataset. You would have to apply the sampler on your use case and compare the model performance metric to see how e.g. the confusion matrix changes and what the acceptable result would be.",
      "Y ": "If you want to have balanced batches, you should not use replacement=False."
    },
    {
      "X ": "loss function and lstm dimension issues",
      "Z ": "tcsn_wty: But if I understand correctly, you meant that I need to do the permutation before log_softmax called, am I correct? If you want to keep F.log_softmax(x, dim=1), then yes.Otherwise use dim=2, if you want to permute the tensor afterwards.tcsn_wty:As for loss function, if I‚Äôm using nn.CrossEntropyLoss, do you meant that there‚Äôs no need to apply F.log_softmax in forward() function?Yes, since it will be applied twice at the moment (in your forward and inside nn.CrossEntropyLoss).",
      "Y ": "use F.log_softmax(x, dim=1), and "
    },
    {
      "X ": "last hidden state in bidirectional stacked gru",
      "Z ": "If you're interested in the last hidden state, i.e., the hidden state after the last time step, I wouldn‚Äôt bother with gru_out and simply use hidden (w.r.t. to your examples). According to the docs: hidden.shape = (num_layers*num_directions, batch, hidden_size) layers can be separated using  h_n.view(num_layers, num_directions, batch, hidden_size) So you shouldn‚Äôt simply do hidden[-1] but first do a view() to separate the num_layers and num_directions (1 or 2). If you do hidden = hidden.view(num_layers,2, batch, hidden_size) # 2 for bidirectional last_hidden = hidden[-1] then last_hidden.shape = (2, batch, hidden_size) and you can do last_hidden_fwd = last_hidden[0] last_hidden_bwd = last_hidden[1]If you want to sum them up or concatenate them is up to you. You can also have a look at this related post.",
      "Y ": "use this hidden = hidden.view(num_layers,2, batch, hidden_size) # 2 for bidirectionallast_hidden = hidden[-1] "
    },
    {
      "X ": "how to handle a memory expensive step in the model to support higher batch size",
      "Z ": "If intermediate tensors in vocab_process need the device memory and will thus raise the OOM issue, your loop approach could work (assuming that vocab_process returns the right output for sub-tensors). Since Python uses function scoping you wouldn‚Äôt have to delete the intermediate tensors, as they will be freed once you exit vocab_process. If you are running into the OOM before exiting the method, the del methods might still be useful. In-place operations won‚Äôt allocate new memory, but will instead manipulate the tensor directly.",
      "Y ": "If the intermediate tensors in vocab process require device memory and thus raise the OOM issue, your loop approach might work (assuming vocab process returns the correct output for sub-tensors). Because Python uses function scoping, you don't need to delete the intermediate tensors because they'll be freed when you exit vocab process. If you encounter an OOM before exiting the method, the del methods may still be useful.In-place operations do not allocate new memory, but instead directly manipulate the tensor."
    },
    {
      "X ": "how do i use nn crossentropyloss for seq2seq where my predication is of size bs seq len vocab size and truth of size bs seq len",
      "Z ": "Try to permute the dimensions in your predication tensor to [batch_size, nb_classes, seq_len], i.e. [2, 5, 3] and it should work.",
      "Y ": "permute the dimensions in your predication tensor   to   [batch_size, nb_classes, seq_len]"
    },
    {
      "X ": "error expected object of scalar type long but got scalar type float for argument 2 mat2",
      "Z ": "If I understand properly, the error is when you call self.gru. Based on the documentation https: //pytorch.org/docs/stable/nn.html#torch.nn.GRU, you have to pass float numbers to GRU object, so you have to remove .long() before passing outputs to self.gru.By the way, can you test your model by passing hidden_intitial_state like below example in the documentation: rnn = nn.GRU(10,20,2) input = torch.randn(5,3,10) h0 = torch.randn(2,3,20) output, hn = rnn(input, h0) In the end, I tested a simple model and for sure you have to pass float numbers to GRU.Please let me know the result.",
      "Y ": "use this link https: //pytorch.org/docs/stable/nn.html#torch.nn.GRU"
    },
    {
      "X ": "embeddings shows index out of range if padding id is 1",
      "Z ": "&gt;&gt;&gt; embedding = nn.Embedding(10, 3, padding_idx=-1); tokens = torch.tensor([1, 2]).long()   # positive indexes here; embedding(tokens)  grad_fn=&lt;EmbeddingBackward&gt;)",
      "Y ": "use embedding = nn.Embedding(10,3, padding_idx=-1) ; tokens = torch.tensor([1,2]).long()"
    },
    {
      "X ": "efficient way to use torchtext tabulardataset on large dataset",
      "Z ": "If your dataset is indeed too large to fit into memory, you need to split it and train using the different ‚Äúsub-datasets‚Äù one after another. In this case, I would actually prepocess the text data completely so that the file already contained the sequences of indexes to speed up the training. Otherwise, you would have to do the preprocessing potentially in each epoch.As a side comment, when using tokenize = lambda x:x.split() I hope your input text documents have whitespaces before punctuation marks ‚Äì and after them, which is not a given in user-generated data.",
      "Y ": "use tokenize = lambda x:x.split() "
    },
    {
      "X ": "creat a tensor with random number of 1 and 1",
      "Z ": "If I understand correctly, you just want to generate a tensor which contains -1 and 1 values ?Would something like this work for you ? a = torch.Tensor([ -1,1]) idx = np.random.randint(2, size=your_shape) noise = a[idxEdit : Furthermore, if you want to have some control on the number of 1 or -1 in your tensor, I would suggest: a = torch.Tensor([-1,1]) idx = np.random.choice(2, size=your_shape, p=[r,1-r]) noise = a[idx]",
      "Y ": "use a = torch.Tensor([-1,1]) idx = np.random.choice(2, size=your_shape, p=[r,1-r])"
    },
    {
      "X ": "quantized cat running time is slower than fp32 model",
      "Z ": "This is because torch quantized concat op just piggy back to FP32 by dequnatize all inputs -&gt; do concat in FP32 -&gt; quantize concatenated tensorSo it will never be faster than FP32 concat. See the implementation github.com pytorch/pytorch/blob/f6c46df856d0588360db5b807960d1fc5e888c36/aten/src/ATen/native/quantized/cpu/qconcat.cpp#L61-L66 xs.push_back(qx.dequantize());} const Tensor y = at: :cat(xs, dim); Tensor qy; AT_DISPATCH_QINT_TYPES(x_dtype,qcat,[&amp;]() { qy = at: :quantize_per_tensor(y, scale, zero_point, SCALAR_TYPE); There are other operators that follow the same FP32 fallback approach (hence slower than FP32) such as quantized elemwise add, mul etc.",
      "Y ": "us ethis link pytorch/pytorch/blob/f6c46df856d0588360db5b807960d1fc5e888c36/aten/src/ATen/native/quantized/cpu/qconcat.cpp#L61-L66"
    },
    {
      "X ": "how to extract individual weights after per channel static quantization",
      "Z ": "OK. It seems like we can use layer[0].weight().int_repr().data[i,j,l,m] to get the INT8 representation of the weight entries. Also, layer[0].weight().dequantize() gives the tensor of the weights in FP format to have element-wise access to its contents.",
      "Y ": "use layer[0].weight().int_repr().data[i,j,l,m ]"
    },
    {
      "X ": "the parameters saved in the checkpoint are different from the ones in the fused model",
      "Z ": "Thanks to dear God, after hours of debugging I finally found out the cause:The error messages with the form of : While copying the parameter named ‚Äúxxx.weight‚Äù, whose dimensions in the model are torch.Size([yyy]) and whose dimensions in the checkpoint are torch.Size([yyy]). are actually generic messages, only returned when an exception has occured while copying the parameters in question. Pytorch developers could easily, add the actual exception args into this spurious yet unhelpful message, so it could actually help better debug the issue at hand. Anyway, looking at the exception which was by the way : 'copy_' not implemented for  QInt8' you‚Äôll now know what the actual issue is/was!",
      "Y ": "Pytorch developers could easily insert the actual exception args into this functions  but unhelpful message, allowing them to better debug the issue at hand."
    },
    {
      "X ": "loading a dynamically quantized transformers model",
      "Z ": "While loading the model is the model now a quantized model? If you convert the model to quantized model and then load the quantized state_dict it should work.",
      "Y ": "If you convert the model to quantized model and then load the quantized state_dict it should work."
    },
    {
      "X ": "post quantizing conv1d prelu layernorm layers can be done",
      "Z ": "Only the quantized model will work if all the layers were quantized is it right? Or else we need to dequantize the parameters again before it passes through not quantized layer is it so?You can insert QuantStub, DequantStub blocks around the code that can be quantized. Please see https: //pytorch.org/tutorials/advanced/static_quantization_tutorial.html for an example of this.May I know that the changes in that PR is applicable for torch CPU version or not?The change is applicable for CPU, to get these changes you can either build pytorch from source or install from pytorch nightly.",
      "Y ": "You can insert QuantStub, DequantStub blocks around the code that can be quantized. "
    },
    {
      "X ": "error in qat evaluate",
      "Z ": "I think somebody have the same error: Issue with Quantization I think i know the answew: # pytorch 1.4 #save torch.save(net.state_dict(),'xx') # fp32 model #load model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') y = net(x) print(y) print('Success')",
      "Y ": "use this link https: //discuss.pytorch.org/t/issue-with-quantization/67457"
    },
    {
      "X ": "mobilenetv2 ssdlite quantization results in different model definition",
      "Z ": "yeah, you‚Äôll need to quantize lq_model after lq_model = create_mobilenetv2_ssd_lite(len(class_names), is_test=True)  before you load from the quantized model",
      "Y ": "use lq_model = create_mobilenetv2_ssd_lite(len(class_names), is_test=True)"
    },
    {
      "X ": "quantization not decreasing model size static and qat",
      "Z ": "Hi Raghav, For post training quantization, we want the model to be in eval mode (see https: //github.com/pytorch/pytorch/blob/530d48e93a3f04a5ec63a1b789c19a5f775bf497/torch/quantization/fuse_modules.py#L63).  So, you can add a model.eval() call before you fuse modules: model.eval() torch.quantization.fuse_modules(...)",
      "Y ": "use this link https: //github.com/pytorch/pytorch/blob/530d48e93a3f04a5ec63a1b789c19a5f775bf497/torch/quantization/fuse_modules.py#L63"
    },
    {
      "X ": "quantstub dequantstubs for qat confusion",
      "Z ": " kekpirat:From my understanding only layers in between the Quant/DequantStubs are supposed to be quantised (is that correct?)Which layers to quantize is controlled by the qconfig, when we do model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') it applies the defaults settings of which modules to swap to the entire model.",
      "Y ": " use this model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')"
    },
    {
      "X ": "implement selected sparse connected neural network",
      "Z ": "The parameters of MySmallModels are most likely missing in model.parameters(), since you are storing them in a plain Python list, thus the optimizer is ignoring them.Try to use self.networks = nn.ModuleList instead.",
      "Y ": "use self.networks = nn.ModuleList"
    },
    {
      "X ": "how to know the memory that is allocated from your model on gpu",
      "Z ": "Yes, you could use print(torch.cuda.memory_summary()).",
      "Y ": "use print(torch.cuda.memory_summary())"
    },
    {
      "X ": "gpu is not utilized while occur runtimeerror cuda runtime error out of memory at",
      "Z": "Hi, Tensorflow has the bad habbit of taking all the memory on the device and prevent anything from happening on it as anything will OOM. There was a small bug in pytorch that was initializing the cuda runtime on device 0 when printing that has been fixed A simple workaround is to use CUDA_VISIBLE_DEVICES=2. This will hide all devices but the one you specify and will make sure you never use other devices.",
      "Y ": "use CUDA_VISIBLE_DEVICES=2. "
    },
    {
      "X ": "how to take three tensor elements one by one",
      "Z ": "Maybe something like this script would be helpful: a = torch.randn(2)",
      "Y ": "new_tensor = [] for i, j, k in zip(a,b,c): new_tensor.append([i.item(),j.item(),k.item() ])"
    },
    {
      "X ": "how to use a self argument in a function define outside the class attributeerror function object has no attribute data",
      "Z ": "Yes, exactly. I've used args as a placeholder for the actual input arguments for generate.",
      "Y ": "use args as a place holder"
    },
    {
      "X ": "efficiency of dataloader and collate for large array like datasets",
      "Z ": "In case where bulk read makes sense, you should not use DataLoader in the default random read style. Here is a quick start on how to switch to bulk loading mode. Instead, write a sampler that spits out keys representing batch index (e.g., range, or list of indices). For DataLoader, set batch_size=None and use sampler= for your new sampler. And in dataset code __getitem__, the input is what your custom sampler yields, so something that represents batch indices, so use that to index into your hdf5 and return the array.",
      "Y ": "you should not use DataLoader in the default random read style."
    },
    {
      "X ": "difference between seq len and input size in lstm",
      "Z ": "Yes, in case of sentences, input_size derives from the size of your word or character embeddings. For example, if you use Word2Vec or GloVe embeddings of size 300 (i.e., each word is represented as a vector with 300 dimensions), the input_size of your LSTM needs to be 300.",
      "Y ": "  input_size derives from the size of your word or character embeddings"
    },
    {
      "X ": "convert lstm from keras to pytorch",
      "Z ": "So one thing you need to do to get it to work is to pass batch_first to the LSTM instantiation if that is what you want.While taking the last timestep (as you do with lstm_out[: ,-1,:]) is certainly a common way to set up sequence-to-one problems (assuming your inputs are of the same length), I would not call it a ‚size adjustment. It says that the LSTM should ‚memorize‚the relevant information. Other ways are possible, in fast.ai‚Äôs (Howard and Ruder) ULMFiT, they recommend to concatenate the last timestep with the average and max over time (but then the linear would have 3 * hidden_size inputs, of course).Best regards Thomas",
      "Y ": "If that's what you want, one thing you'll need to do is pass batch first to the LSTM instantiation.While using the last timestep (as with lstm out[: ,-1,:]) is certainly a common way to set up sequence-to-one problems (assuming your inputs are all the same length), I would not call it a size adjustment. It is stated that the LSTM should memorise the pertinent information. Other options are available in a hurry. In ai's (Howard and Ruder) ULMFiT, they recommend concatenating the last timestep with the average and maximum over time (of course, the linear would have 3 * hidden size inputs)."
    },
    {
      "X ": "combine train and test set torchtext but concatdataset object has no attribute get vocab",
      "Z ": "random_split will return Subsets, which wrap the dataset. To access the underlying dataset, you could use train_dataset.dataset.get_vocab().However, this would call the get_vocab() methon of course on the complete dataset.I'm not familiar with this method, but if it creates the vocabulary based on the used dataset, it would contain the words from the training and validation dataset.",
      "Y ": "random_split will return Subsets"
    },
    {
      "X ": "cant find inplace operation that messep up backwards",
      "Z ": "Thanks for the code.The offending call should be the inplace unsqueeze of prototypes[ 0]. Change it to this and it should work: dist = M_2 - self.distance_function(x_in, prototypes[0].unsqueeze(0))",
      "Y ": "use dist = M_2 - self.distance_function(x_in, prototypes[0].unsqueeze(0))"
    },
    {
      "X ": "batchnorm1d for evaluation",
      "Z ": "If the batch norm layers are between linear layers, the shape should most likely once be changed before the first linear layer.Anyway, you could define a Permute layer and use it inside your nn.Sequential container:",
      "Y ": " define a Permute layer and use it inside your nn.Sequential container"
    },
    {
      "X ": "batch examples in document",
      "Z ": "I solved my problem by overriding the create_batches methods inside the data.Iterator.",
      "Y ": "use this class class EvalIterator(data.Iterator): "
    },
    {
      "X ": "batch dimension for calculating loss",
      "Z ": " Marla:For an RNN, ‚output‚ is given as (seq_len, batch, num_directions * hidden_size) with ‚batch‚no longer represented (as per the documentation).What do you mean the batch is no longer represented? It's in the second dimension :). If you need it in the first one you could simply do output.transpose(0,1)",
      "Y ": "use output.transpose(0,1)"
    },
    {
      "X ": "average of the gru lstm outputs for variable length sequences",
      "Z ": "You could sum the padded tensor, and then divide it by the lengths of the sequences. I suppose padded is shaped [batch_size, seq_len, embed_size] so padded.sum(dim=1) is shaped [batch_size, embed_size] and lengths is shaped [batch_size out = padded.sum(dim=1).div(lengths.float().unsqueeze(dim=1))",
      "Y ": "use this out = padded.sum(dim=1).div(lengths.float().unsqueeze(dim=1))"
    },
    {
      "X ": "attention implemented using encoder outputs rather than hidden states",
      "Z ": "The encoder outputs ARE the hidden states at each time step. Look at the following sketch of an LSTM cell: 1024px-The_LSTM_cell1024√ó672 75.5 KB there is not extra output, it's just the current hidden state h_t. For example, if you define an lstm with num_layers=0 and batch_first=False, and given, say output, (h, c) = lstm(inputs, hidden) then output[-1] = h_n, simply meaning that you can get the final hidden state directly from h or as the last hidden state from output. If you have more layers (num_layers&gt;1), things are a bit different: output will contain only the hidden states for each time stamp after the last layer, while h will contain the last hidden states for each layer.",
      "Y ": "output, (h, c) = lstm(inputs, hidden)"
    },
    {
      "X ": "accessing the lstm properties",
      "Z ": "You can directly access these properties and print them: model = nn.LSTM(10,10, 2,0.3, bidirectional=True)",
      "Y ": "model = nn.LSTM(10,10,2,0.3, bidirectional=True)"
    },
    {
      "X ": "proper distributeddataparallel usage",
      "Z ": "QUESTION: Suppose each process has a different random generator state, when DistributedDataParallel is initialized does each process need to have the same parameter values?No. Rank 0 will broadcast model states to all other ranks when you construct DDP. Code for that is here.In order to evaluate, on one GPU, can we use ddp_model.module? Yes, this should work.Can we use something like EMA to copy new parameters to ddp_model.module and then restore them after evaluation?Yes, if you make sure you restored those model param values correctly. Otherwise, if this introduces inconsistency across param values across different processes, DDP will not fix that for you, as DDP only syncs grad instead of params. This might be helpful to explain. In order to save the model, can we use ddp_model.module Yes. And when you restore from the checkpoint, it‚Äôs better to reconstruct the DDP instance using the restored module to make sure that DDP starts from a clean state. Do we need to use torch.distributed.barrier so that the other processes don‚Äôt continue training while the master evaluates? It‚Äôs recommended this way. But if you are not consuming the checkpoint right away and not worried about timeout due to rank0 is doing more work, this is not necessary. Because the next DDP backward will launch allreduce comm ops, which will sync anyway. Some of this is also explained here.",
      "Y ": "https: //pytorch.org/tutorials/intermediate/ddp_tutorial.html"
    },
    {
      "X ": "distributed machine learning on multiple cores",
      "Z ": "IIUC, this is typical DistributedDataParallel training? If so, yes, PyTorch natively support that. Here is another tutorial.",
      "Y ": "use this tutorial https://pytorch.org/tutorials/intermediate/ddp_tutorial.html"
    },
    {
      "X ": "multiple gpu model getting runtimeerror caught runtimeerror in replica 0 on device 0",
      "Z ": "Looks like some layers of the model lives on GPU and others live on CPU. Is this intentional? DataParallel does not support mixed CPU-GPU model, all layers of the same model need to live on the same GPU. If you have multi-GPU model, e.g., some layers live on cuda: 0 and others live on cuda: 1, you can try DistributedDataParallel. Check out this.",
      "Y ": "try DistributedDataParallel "
    },
    {
      "X ": "distributed training creates multiple processes in gpu0",
      "Z ": "Found the bug. So we need to be careful with setting the right GPU context while calling clear_cache() function, otherwise it allocates fixed memory on GPU0 for the other GPUs. Relevant issue here.",
      "Y ": " Set the right GPU context while calling clear_cache"
    },
    {
      "X ": "is it expected for distributeddataparallel to use more memory on 1 gpu in a 1gpu 1process setup",
      "Z ": "Is the DDP process the only process using that GPU? The extra size ~500MB looks like an extra cuda context. Does this behavior still persist if you set CUDA_VISIBLE_DEVICES  env var properly (instead of using torch.cuda.set_device(rank)) before launching each process?",
      "Y ": "torch.cuda.set_device(rank))"
    },
    {
      "X ": "how to divide the dataset when it is distributed",
      "Z ": "Yeah, I find a solution with the help of Szymon Maszke. Use torch.utils.data.random_split instead. Namelytrain_data, val_data = torch.utils.data.random_split(train_data, (num_train, num_val))",
      "Y ": "train_data, val_data = torch.utils.data.random_split(train_data, (num_train, num_val))"
    },
    {
      "X ": "rpc with raw irecv reduce distributed primitives",
      "Z ": "Yes, this is possible, but there is one caveat. The current implementation of init_rpc() sets the default process group. It actually shouldn‚Äôt do that, and we are fixing it (see this issue). With the current RPC implementation, I see two ways to implement this feature. Option 1 First, call init_rpc() and let it set the default process group. Then use the new_group API to create new set of process group instances, and only call irecv/reduce/broadcast on the new process group instances. This would avoid messing up states of the default process group that RPC is running on. Option  2 Directly create process group using its constructor. See the test below as an example: github.com pytorch/pytorch/blob/f6f1384811b9cc722f650ed9ead8ee99938c009a/test/distributed/test_c10d.py#L1501-L1502store = c10d.FileStore(self.file_name, self.world_size) pg = c10d.ProcessGroupGloo(store, self.rank, self.world_size) Just curious, do you mind share more details of your use case? What‚Äôs the motivation to combine RPC with collective communications? I saw some requirement for this to support combining RPC with DDP. Is this also your use case?",
      "Y ": "use this link "
    },
    {
      "X": "adding dropout in pre trained vgg16.  How can I add dropout layer after every convolution layer in VGG 16 if I want to keep the value of VGG16 parameters ",
      "Z": "Youll need to change VGG.features to do so. Because it's a nn.Sequential object, all you have to do is change the layers for this particular item; you don't have to change the forward.",
      "Y": "To do so, you need to modify VGG.features. It is a nn.Sequential object and so all you need to do is modify the layers this specific object, you don’t need to modify the forward."
    },
    {
      "X": " copy encoder with same key name.,How can i copy the encoder with same key name as my base class.",
      "Z ": "You can just use net.named children() to pass an OrderedDict to nn.",
      "Y": "You can simply pass an OrderedDict with net.named_children() to nn.Sequential like this "
    },
    {
      "X ": "fine tuning a retinaface model. I'm having trouble mapping Pytorch's Resnet18 tutorial instructions to the Retinaface architecture.",
      "Z ": "The number of input features to the last linear layer is represented by in features in the lesson.As a result, you'll probably want to replace these modules or the last layer(s) within each of them.",
      "Y ": "The in_features used in the tutorial represent the number of input features to the last linear layer.\nBased on your code snippet it seems that the last activations are calculated in:\n        bbox_regressions = torch.cat([self.BboxHead[i](feature) for i, feature in enumerate(features)], dim=1)\n        classifications = torch.cat([self.ClassHead[i](feature) for i, feature in enumerate(features)],dim=1)\n        ldm_regressions = torch.cat([self.LandmarkHead[i](feature) for i, feature in enumerate(features)], dim=1)\n\nso you would most likely want to either replace these modules or the last layer(s) inside each of these modules."
    },
    {
      "X ": "cgan image labels to linear layer returns nan ,The problem is that after a certain number of steps, <code>self.head</code> returns NaN values while none of the inputs contain NaN values.",
      "Z ": "The problem was not in the Discriminator but rather in an unstable module of the generator (which previously worked fine and appeared stable, but in reality was not),The architecture described above seems to be working fine",
      "Y ": "The problem was not in the Discriminator but rather in an unstable module of the generator "
    },
    {
      "X ": "concatante k linear layers results to single output tensor",
      "Z ": "You could do something like You could do something like that should output the thing you are looking for",
      "Y ": "torch.stack(res, dim = 1)"
    },
    {
      "X ": "tensor shape for rnn batch training",
      "Z ": "x.permute(1,0,2)",
      "Y ": "x.permute(1,0,2)"
    },
    {
      "X ": "gru works even with incorrect initial hidden state",
      "Z ": "This is a bug, and has been fixed on the master branch of pytorch with link https://github.com/pytorch/pytorch/pull/3925 https: //github.com/pytorch/pytorch/pull/3925",
      "Y ": "This is a bug, and has been fixed on the master branch of pytorch"
    },
    {
      "X ": "passing text as input to a cnn",
      "Z ": "The size of the embedding weight tensor is (Vocab size,Embedding Size)  but according to the docs with link  http://pytorch.org/docs/0.3.0/nn.html#torch.nn.Embedding the embedding layer takes input of shape (batch_size, words) where each element is the index of a word, and produces output of shape  (batch_size, words, embedding_size)where each word index has been replaced by the corresponding embedding vector.Basically, the embedding is applied individually to each word of the input.",
      "Y ": "the embedding is applied individually to each word of the input."
    },
    {
      "X ": "runtimeerror inconsistent tensor size while using pre trained weights,RuntimeError: inconsistent tensor size, expected tensor [426 x 50] and src [114044 x 50] to have the same number of elements, but got 21300 and 5702200 elements respectively at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensorCopy.c:121</code>",
      "Z ": "The pretrained embedding was trained on a vocab of 114044 unique words.<br>But your embedding layer was be initialised for a vocab of 426 words.I see three possible solutions…,Initialise the embedding layer for a vocab of 114044 and make sure that the 426 words of your new dataset use the word indices from the 114044 word vocabulary.Extract the embedding data for those 426 words from the 114044 word pretrained embedding,Train new embeddings",
      "Y ": "A vocabulary of 114044 distinct words was used to train the pretrained embedding.However, your embedding layer was set up with a vocabulary of 426 words.There are three options that come to mind...Create an embedding layer with a vocabulary of 114044 words and ensure that the 426 words in your new dataset use the 114044 word vocabulary's word indices.From the 114044 w, extract the embedding data for the 426 words."
    },
    {
      "X ": "how can i build an rnn without using nn rnn",
      "Z ": "Sure, then you would have to call F.log_softmax(output) before passing it to <code>NLLLoss</code> or add it as a layer in your model.CrossEntropyLoss basically combines a log_softmax with NLLLoss",
      "Y ": "CrossEntropyLoss basically combines a log_softmax with NLLLoss"
    },
    {
      "X ": "how to padding sequence of variable length in nlp task,",
      "Z ": "Try  pad_sequence with link  https://pytorch.org/docs/stable/nn.html#pad-sequence ",
      "Y ": "Try pad_sequence."
    },
    {
      "X ": "loading saved model with hacky weight sharing resulting in gibberish",
      "Z ": "I figured it out. Basically I should use torch.matmul with self.embeds.weight. Doing this solved the saving / loading problem. The hacky way apparently doesn’t work!logits is a Variable of shape (Batch, Embedding_Dimension) and self.embeds.weight is a Parameter with shape (Vocabulary_Size, Embedding_Dimension)logits = torch.matmul(logits, self.embeds.weight.t())",
      "Y ": "logits = torch.matmul(logits, self.embeds.weight.t())"
    },
    {
      "X ": "my custom embeddding layer is returning only zeros",
      "Z ": "Correction is max_norm should be None not Falseoutputs = F.embedding(inputs, self.weights * (self.dmodel ** 0.5), self.pad_idx, None2, False, False)",
      "Y ": "outputs = F.embedding(inputs, self.weights * (self.dmodel ** 0.5), self.pad_idx, None,2, False, False)"
    },
    {
      "X ": "error loading bidirectional lstm model,When I loaded it using load_state_dict(torch.load()), it gives me an error says Unexpected key(s) in state_dict: ",
      "Z ": "Maybe your trained model had two or more layers in the layer1.lstm (l1 is the second) and the model you’re loading it to only one?",
      "Y ": "Maybe the layer1.lstm in your trained model had two or more layers (l1 is the first), but the model you're loading only has one?"
    },
    {
      "X ": "rnn usage in pytorch, I see that people often use RNN in a loop and set sequence length to 1. I would like to ask if there is any different if RNN is run on the entire sequence without the loop. In addition, if RNN can be used that way, how is it different from RNNCell?",
      "Z ": "Many interesting RNN architectures use things besides what can be done with the “vanilla” RNN/GRU/LSTM - the classic is when one timestep’s output has an impact on the next timestep’s input - one prominent example is attention.Once you have things like that, you cannot use the multi-timesteps anymore. I guess that people still like them, either because they’re used to them or because they still offer some speed advantage, but fundamentally, most uses are very similar to the *Cell variants.",
      "Y ": "Many fascinating RNN designs use things other than what the vanilla RNN/GRU/LSTM can do - the classic is when one timestep's result affects the following timestep's input - one noteworthy example is attentiveness.You can no longer use multi-timesteps after you've got something like that. People may still choose them because they are familiar with them or because they provide a speed advantage, although the majority of applications are fairly comparable to the *Cell variants."
    },
    {
      "X ": "masking recurrent layers,How can we mask our input sequences in RNNs?",
      "Z ": "I think this should be what you’re looking for  Simple working example how to use packing for variable-length sequence inputs for rnn  with link https://discuss.pytorch.org/t/simple-working-example-how-to-use-packing-for-variable-length-sequence-inputs-for-rnn/2120",
      "Y ": "For rnn, a simple working example of using packing for variable-length sequence inputs."
    },
    {
      "X ": "what is nn embedding exactly doing,For example is a pre-trained embedding being used to project the word tokens to its hypothetical space? Is there a distance measure being used? Or is it embedding models like word2vec?",
      "Z ": "It is just a look up table from indices to vectors. You can manually initialize them however you want, e.g. to work2vec weights",
      "Y ": "It's merely a table that converts indices to vectors. You can manually initialise them to work2vec weights, for example."
    },
    {
      "X ": "same model on scikit learn does well but fails in pytorch",
      "Z ": "Your loss_function CrossEntropyLoss with link  https://pytorch.org/docs/stable/nn.html#torch.nn.expects raw logits and the end of your forward function applies a F.softmax. You should either not do the softmax and just return tag_space or you should change it to a log_softmaxand change your loss_function to NLLLoss with link  https://pytorch.org/docs/stable/nn.html#torch.nn.NLLLoss ",
      "Y ": "Your loss feature CrossEntropyLoss raw logs and F.softmax applies to the end of your forward function. You either should not make the softmax or simply return tag space to a log softmax and change your loss function to NLLoss."
    },
    {
      "X ": "gradient of hidden state in lstm",
      "Z ": "For this you need to spell out the time loop using LSTMCell. That makes you pass the hidden state at each timestep and you can use hx.retain_grad()  etc.",
      "Y ": "use hx.retain_grad() "
    },
    {
      "X ": "embedding class in nlp tutorial",
      "Z ": "The nn.Embedding class is a lookup table mapping each index to a vector. In effect it is a matrix, the weight matrix of the embedding module.Changing the vectors by a little bit will change the following computation by a small amount and eventually the (cross-entropy, typically) loss. Thus the vectors get a gradient in the backward step.You could replace the nn.Embedding by a single parameter of dimension vocab_size * embedding_size and use indexing a[idxes                                                                                    ] to get the vectors to get something that is mathematically equivalent, unless you use fancy features like inverse frequency scaling of gradients.",
      "Y ": "The nn.Embedding class is a lookup  table that maps a vector to every index. It is actually a matrix, the weight matrix of the module of incorporation.When the vectors change slightly, the following calculation will change in a small quantity and eventually the loss (usually cross-entropy). Thus the vectors get a reverse gradient.You could use nn.Embedding to replace a single vocab size* dimensional parameter, using a [idxes] indexing, to get the vectors mathematically equivalent, without using fancy features such as gradient reversal."
    },
    {
      "X ": "does feeding input one by one affect gru output",
      "Z ": "It turns out that after I rewrote the same code in a simpler fashion. The problem is solved.",
      "Y ": "It turned out that when I recreated the identical code in a more straightforward manner. The issue has been resolved."
    },
    {
      "X ": "4 dimension input to lstm,My input is of shape (batch_size * num_sentences * sentence * embeddings). Any help on how to pass this 4-d input to an LSTM ?",
      "Z ": "Depending on what constraints you have on the dependence between sentences, you could have,Independent sentences: Each sentence modeled individually. (batch_size*num_sentences, sentence, embeddings),Independent paragraphs: Each paragraph(sequence of sentences) modeled independently. (batch_size, num_sentences*sentence, embeddings,Alternatively, you could have two layers of LSTM, one modeling and embedding individual sentences, the second taking these sentence embeddings as input and modeling paragraphs",
      "Y ": "Depending on the limits you impose on sentence dependencies, you could end up withSentences that are independent of one another are modelled separately. (num sentences, sentence, embeddings) (batch size*num sentences) (batch size*num sentences) (batch size*numParagraphs with distinct models: Each paragraph (sequence of sentences) is modelled separately. (num sentences*sentence, embeddings) (batch size, num sentences*sentence) (batch size, num sentences*sentence) (bAlternatively, you might use two layers of LSTM, with the first modelling and embedding individual sentences and the second using these sentence embeddings as input to model paragraphs."
    },
    {
      "X ": "why does code like fairseq override the default initialization of the nn embedding layer",
      "Z ": "IMHO, this is more Pythonic class Embedding(nn.Embedding):def __init__(self, num_embeddings, embedding_dim, padding_idx):super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)nn.init.uniform_(self.weight,-0.1,0.1)nn.init.constant_(self.weight[padding_idx],0)",
      "Y ": "class Embedding(nn.Embedding): def __init__(self, num_embeddings, embedding_dim, padding_idx): super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx) nn.init.uniform_(self.weight,-0.1,0.1) nn.init.constant_(self.weight[padding_idx,0)"
    },
    {
      "X ": "modifying a model without using nn sequential",
      "Z ": "Your pointer actually made me think a little more and I ended up doing it much for simply num_classes_in_new_model = 5 new_model = copy.deepcopy(old_model) in_features_last_layer = new_model.__getattr__('fc2').in_features new_model.__setattr__('fc2', nn.Linear(in_features_last_layer, num_classes_in_new_model))",
      "Y ": "num_classes_in_new_model = 5 new_model = copy.deepcopy(old_model) in_features_last_layer = new_model.__getattr__('fc2').in_features new_model.__setattr__('fc2', nn.Linear(in_features_last_layer, num_classes_in_new_model))"
    },
    {
      "X ": "model to device not identified error in c frontend",
      "Z ": "Hi All,Got the answer myself. Because we are using TORCH_MODULE() wrapper.  Which creates a shared pointer for us. Hence instead of m.to(device), we should use m-&gt;to(device). Notice the -&gt; insetd of .(dot). That builds as expected.",
      "Y ": "nstead of m.to(device),we should use m-;to(device)"
    },
    {
      "X ": "expected tensor not variable for argument 2 mat2",
      "Z ": "I guess the error might be thrown because you are using at::Tensor for your input  as well as in forward instead of torch::Tensor.PS: Tagging certain people might discourage others to answer in your thread and I‚am pretty sure there are more experienced C++ frontend users here. ",
      "Y ": " use  at: :Tensor for your input  as well as in forward instead of torch: :Tensor."
    },
    {
      "X ": "sparse tensor creation in the c api",
      "Z ": "I think this is the function you are looking for with link https://pytorch.org/cppdocs/api/function_namespacetorch_1afa32eb8069eda5ae03b45ee9f26de3c3.html?highlight=sparse#_CPPv4N5torch17sparse_coo_tensorERKN2at6TensorERKN2at6TensorERKN2at13TensorOptionsE,",
      "Y ": "I think this is the function you are looking for"
    },
    {
      "X ": "torchscript c extension list type,What is the type that torch considers as the proper list type for C++ when going from Torchscript to C++?",
      "Z ": "t’s a std::vector&lt;std::string&gt;. Sometimes just try shit and hope that the programmers were nice.",
      "Y ": "It’s a std::vector<std::string>"
    },
    {
      "X ": "dcgan tutorial where is the output image size of the generator set",
      "Z ": "I found the answer myself, keeping it here for documentary purposes…with link https://arxiv.org/pdf/1603.07285.pdf",
      "Y ": "I found the answer myself"
    },
    {
      "X ": "how to convert tensor entry to a float or double",
      "Z ": "Like this:torch: :Tensor tensor = torch: :randn({3,4});",
      "Y ": "torch: :Tensor tensor = torch: :randn({3,4});"
    },
    {
      "X ": "pytorch random number generation algorithm",
      "Z ": "Based on this recent PR use this link https://github.com/pytorch/pytorch/pull/21301 the curandStateMTGP32 usage was removed and instead curandStatePhilox seems to be used now.",
      "Y ": "Use curandStatePhilox now"
    },
    {
      "X ": "why didnt this cuda kernel implement",
      "Z ": "You probably want touse const torch: :OptionalDeviceGuard device_guard(a.device()); or so. This is similar to with torch.cuda.device(...) in Python. Check that a, b, c actually are on the same device. Best regards Thomas P.S.: If you take inspiration from PyTorch internals: The reason you don‚'t see this in aten/src/Aten/native/cuda/* is that there is a auto-generated wrapper doing it for you.P.P.S.: The new torchvision C++ extension  with link  https://github.com/pytorch/vision/blob/master/torchvision/csrc/cuda/ROIAlign_cuda.cu#L316 does it right and is a great example of a model PyTorch extension.",
      "Y ": "use const torch::OptionalDeviceGuard device_guard(a.device())"
    },
    {
      "X ": "parameters not changed after optimized",
      "Z ": "You win the ‚exceptionally bad luck in picking a small example award . Your parameters do change, just your loss does not. So what happens is that the gradient of (x-1)**2 is 2*(x-1) and with an SGD lr or 1, you get an oscillation between your initial value and 2-x which both happen to have the same loss value. With an lr of 0.1 or anything else below 1, your loss decreases as you would expect. If you allow two unsolicited comments: Many people claim that using std is a bad habit. Given that many people will read your code on the forums, I'd advise against doing this.Just 10 iterations would be fully sufficient so see the loss doesn't change. I will admit that I spent quite a while looking whether you somewhere made the parameter a non-leaf tensor by accident and only when I printed the gradient (to see if there was any at all) I noticed what happens.Best regards Thomas",
      "Y ": "If you allow two unsolicited comments:Many people claim that using std is a bad habit. Given that many people will read your code on the forums, I'd advise against doing this. Just 10 iterations would be fully sufficient so see the loss doesn't change. "
    },
    {
      "X ": "advanced tensor slicing in c",
      "Z ": "The C++ equivalent of [...] indexing is torch::index(self, indices) where indices is a vector of Tensors (which can be undefined). In your case: std: :vector&lt;torch: :Tensor&gt; idx_v{indexing}; auto result = torch: :index(xyz, idx_v);Best regards Thomas",
      "Y ": "The C++ equivalent of [...] indexing is torch: :index(self, indices) where indices is a vector of Tensors (which can be undefined)"
    },
    {
      "X ": "terminate called after throwing an instance of c10 error what istensor for lstm",
      "Z ": "Comparing output, hidden = model(examples.to(cpu, hidden) and at: :Tensor output = module-&gt;forward({indata1, tuple }).toTensor(); you would not expect the .toTensor() on the output to succeed, but you probably have a tuple that you need to unpack. This is why isTensor fails. Best regards Thomas P.S.: While you might have preferred ptrblck to answer, it's generally preferred to not tag people to not discourage anyone who wants to contribute from chiming in.",
      "Y ": "output, hidden = model(examples.to(cpu, hidden)and at: :Tensor output = module-&gt;forward({indata1, tuple  }).toTensor();"
    },
    {
      "X ": "vector of torch tensors to torch tensor",
      "Z ": "Just like you would in Python, use torch::cat to concatenate along a single axis and torch::stack to create a new axis. Note that dimensions (except possible the cat dimension) must match for this to succeed.",
      "Y ": "use torch::cat to concatenate along a single axis and torch::stack to create a new axis"
    },
    {
      "X ": "initializing a tensor from a tensor c",
      "Z ": "In the Python frontend, using torch.tensor() to copy-construct a Tensor from a Tensor is actually deprecated, and the recommended way is sourceTensor.clone().detach().requires_grad_().In C++ frontend, currently the best way to achieve the same functionality is:#include torch/csrc/autograd/variable.h; torch: :Tensor a = torch: :eye(3); torch: :Tensor b = torch: :autograd: :Variable(a.clone()).detach().set_requires_grad(true); We are actively working on improving the API to make a.clone().detach().requires_grad_() work soon.",
      "Y ": "sourceTensor.clone().detach().requires_grad_()"
    },
    {
      "X ": "error when initializing registered tensors",
      "Z ": "I have found the problem. I needed to move the guard up. Here is the body of the constructor that fixes this.",
      "Y": "I needed to move the guard up"
    },
    {
      "X ": "if i use scripting to convert my model how can i call other methods in c",
      "Z ": "You follow the tutorial and at step 4 you replace at: :Tensor output = module->forward(inputs).toTensor(); with torch: :Tensor output = module -> run_method( weighted_kernel_sum , input).toTensor(); where input is your input tensor.",
      "Y ": " use this at: :Tensor output = module->forward(inputs).toTensor(); with torch: :Tensor output = module -> run_method(weighted_kernel_sum, input).toTensor();"
    },
    {
      "X ": "loss does not improve on training",
      "Z ": "Thanks tom for your reply. The learning rate was indeed set to 1.0 which is a bit high I guess but maybe not yet rediculous. Switching the learning rate back to 0.01 alone is not enough as it seems because I'm still getting my nans then. While I have still trouble getting a proper standard deviation to work, scaling by only subtracting the mean from input data seems to mitigate the problem. So far the network loss is decreasing / the network is converging in the couple of tests I did with the current settings. So the solution to the specific problem in my case seemed to be Do not detach output Scale input data As for the standard deviation problem I will open another thread.Thanks tom for your help! Edit 1: Guess I was cheering too early once again In my scaling method I am subtracting the input data‚Äôs mean from input and dividing by the input data's standard deviation which happens to be broken. As soon as I remove the division by the broken std-Tensor I keep getting nans again  Edit 2: Interestingly when printing the result of torch: :std(data, 0); I get [ Variable[CPUType]{} ] Which I interpret as an empty Tensor with no values (?). Nevertheless division by that Tensor seems to scale values of the data Tensor which seems to fix the nan issue weird. Is this correct expected behaviour?",
      "Y ": "The learning rate was definitely set to 1.0, which is a little high but not ridiculous. Switching the learning rate back to 0.01 seems to be insufficient, as I'm still getting my nans.While I'm still having difficulties getting a decent standard deviation to operate, scaling the data by merely subtracting the mean seems to help. In the few tests I've run with the current settings, network loss is lowering and the network is converging. torch: :std(data,0)"
    },
    {
      "X ": "how to replace specific values in a tensor",
      "Z ": "I decided to use an accessor now, iterating over the tensor elements and comparing &amp; setting where necessary. May not be elegant but solves the problem at hand for now. Code is like: auto a = t.accessor&lt;float,1&gt;();for(int i=0; i&lt;a.size(0); i++) if(a[i] == 0.0)a[i] = 1.0; Thanks for your help!  )",
      "Y ": "auto a = t.accessor&lt;float, 1&gt;(); for(int i=0; i&lt;a.size(0); i++) if(a[i] == 0.0)a[i] = 1.0;"
    },
    {
      "X ": "deleting tensors in context save for backward",
      "Z ": "Yes, these tensors should be freed after the backward().To double check it, you could use this example and add some print statements to check the memory: Note that I've replaced the .grad attributes with None to free these tensors as well instead of zeroing them out.",
      "Y ": " Replace the .grad attributes with None to free these tensors as well instead of zeroing them out"
    },
    {
      "X ": "check if model is eval or train",
      "Z ": "module.training is the boolean you are looking for. ",
      "Y ": " using module.training   "
    },
    {
      "X ": "backpropagation in a multi branched cnn rnn network",
      "Z ": "Hi Louis-Vincent! lv_Poellhuber:When I apply the loss, would it backpropagate properly? I‚Äôm afraid it would only apply the loss to the LSTM and not the CNN, since they‚Äôre not directly connected.I haven't looked at your code, but yes, in general the kind of thingyou propose doing should work just fine.Make sure you understand the basics of how pytorch uses therequire_grad property of a tensor.  Normally the weights, etc.,of your model, and the tensors that depend on them, will haverequires_grad = True.  The input to your CNN would normallyhave requires_grad = False.  However, the output of your CNNwill have requires_grad = True (because it depends on theweights of the CNN).  The connection‚ between the LSTM andthe CNN is provided by the fact that the output of your CNN ‚ thathas requires_grad = True  is also the input to your LSTM.  Thisis what permits pytorch‚Äôs autograd system to backpropagate yourfinal loss through the LSTM all the way back to the weights of yourupstream CNN.",
      "Y ": "Make sure you understand how pytorch works with a tensor's require grad attribute. The weights, etc., of your model, as well as the tensors that rely on them, should normally have requires grad = True. Normally, requires grad = False would be the input to your CNN. Your CNN's output, on the other hand, will contain requires grad = True."
    },
    {
      "X ": "modulenotfounderror no module named engine",
      "Z ": "From the tutorial: In references/detection/ , we have a number of helper functions to simplify training and evaluating detection models. Here, we will use references/detection/engine.py , references/detection/utils.py and references/detection/transforms.py . Just copy them to your folder and use them here.",
      "Y ": " use references/detection/engine.py , references/detection/utils.py and references/detection/transforms.py "
    },
    {
      "X ": "division in batches of a 3d tensor",
      "Z ": "I think you can unsqueeze b so that a.dim() = b.dim() import torch n, m, p = 100,5,2 ; a = torch.rand(n, m, p) ;b = torch.rand(n, p);b = b.unsqueeze(dim=1) # n x 1 x p ;z=a/b",
      "Y ": " a.dim() = b.dim()import torch n, m, p = 100,5,2 ;a = torch.rand(n, m, p);b = torch.rand(n, p);b = b.unsqueeze(dim=1) # n x 1 x p ;z=a/b"
    },
    {
      "X ": "efficient way to get ordered permuations",
      "Z ": "I guess you could use itertools.permutations to permute your list.PS: this doesn't seem to be PyTorch-specific, so note that you might get a faster and better answer e.g. on StackOverflow",
      "Y ": "Use itertools.permutations to permute your list."
    },
    {
      "X ": "how to load model weights that are stored as an ordereddict",
      "Z ": "The stored checkpoints contains most likely the state_dicts of the model and optimizer.You would have to create a model instance first and load the state_dict afterwards as explained in the serialization docs:model = MyModel() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # change to whatever optimizer was used checkpoint = torch.load(path_to_checkpoint);model.load_state_dict(checkpoint['model']);optimizer.load_state_dict(checkpoint['opt']) ; Note that you need the model definition to create the model instance.",
      "Y ": "model = MyModel() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # change to whatever optimizer was used checkpoint = torch.load(path_to_checkpoint) model.load_state_dict(checkpoint['model'])optimizer.load_state_dict(checkpoint['opt'])"
    },
    {
      "X ": "torch nn ctcloss returns infinity for some inputs",
      "Z ": "Hi again, I‚ve found out that that is not a bug or a documentation issue.It due to your special example:your labels are [11,24,24]. Therefore especially the last two are the same. So why does the problem occur, when the lengths are equal? Well, for the input to represent the same label twice in a row, there must be a blank in between those two, as that‚how CTC works, what is impossible to achieve with equal input lengths and target lengths.",
      "Y ": "Well, for the input to represent the same label twice in a row, there must be a blank in between those two, as that's how CTC work"
    },
    {
      "X ": "why do we need gradient checking",
      "Z ": "This tutorial explains how to write custom autograd.Fucntions. TheOraware:Still i am not clear if i write following code framework to train neural network then how could it be break? Calculating gradients loss.backward and update weights optimizer.step are builtin in pytorch how come it will break? Are you talking about changes made by pytroch team within new package/release of pytorch?If you are ‚Äújust using‚Äù built-in PyTorch methods, you are most likely not interested in testing utilities.Yes, since PyTorch is an open source framework with a lot of active developers, things can accidentally break, which is why testing is needed.",
      "Y ": "If you only use PyTorch's built-in methods, you're probably not interested in testing utilities.Yes, because PyTorch is an open source framework with a large community of contributors, things can go wrong, which is why testing is necessary."
    },
    {
      "X ": "store index of max values,How can find the top 3 max values in the tensor and how can I store the index values in one variable",
      "Z ": "Thank you ,But my question is about how to pick the top 3 max values of the indexes and store that indexes in a variable . This is a solution I found frame = sorted(range(len(mul_reward)), key=lambda i: mul_reward[i], reverse=True)[:4]print(frame) con_frame = torch.Tensor(frame)",
      "Y ": "This is a solution I found frame = sorted(range(len(mul_reward)), key=lambda i: mul_reward[i], reverse=True)[: 4] print(frame) con_frame = torch.Tensor(frame)"
    },
    {
      "X ": "different time consumption of the same operation",
      "Z ": "Yes, the nonzero would need some time to be executed, but you are right that your profiling is wrong as it accumulates the model forward time into the nonzero operation, so it looks as if the nonzero op is more expensive than it is.As mentioned before, this operation is synchronizing, so should be avoided for this reason (in case you want to remove all syncs, if possible).markcheung:But I believe these indexing operations should be less effective than convolutions when the code is running on GPU.I would claim it depends on the actual workflow, so you could profile it with synchronizations and compare the runtimes, if needed.",
      "Y ": "Yes, the nonzero operation will take some time to complete, but you are correct that your profiling is incorrect since it adds the model forward time to the nonzero operation, making it appear like the nonzero operation is more expensive than it actually is.As previously stated, this operation is synchronous and should therefore be avoided (in case you want to remove all syncs, if possible).I'd say it depends on the procedure in question, so you could profile it with synchronizations and compare runtimes if necessary."
    },
    {
      "X ": "skip weights in covolution",
      "Z ": "Hi NPC!NPC: Lets say i have a kernel like [a, b, c].But i want  a == c (or more precisely I want the optim. to threat it as one parameter).To me the conceptually cleanest approach would be to build your constrained kernel from a separate two-element Parameter thatyou actually optimize.  I believe that the weight of a torch.nn.Conv1d has tobe a leaf tensor, so we have to use the functional form, torch.nn.functional.conv1d ().  By using differentiablepytorch tensor operations to build ker from k, you will be ableto backpropagate through the construction of ker and properlyoptimize the two elements of k.",
      "Y ": "We must use the functional form, torch.nn.functional.conv1d, because I assume the weight of a torch.nn.Conv1d must be a leaf tensor (). You may backpropagate through the creation of ker and appropriately optimise the two elements of k by utilising differentiable pytorch tensor operations to build ker from k."
    },
    {
      "X ": "what pytorch means by buffers",
      "Z ": "Buffers are tensors, which are registered in the module and will thus be inside the state_dict.These tensors do not require gradients and are thus not registered as parameters.This is useful e.g. to track the mean and std in batchnorm layers etc. which should be stored and loaded using the state_dict of the module.",
      "Y ": "Buffers are tensors that have been registered in the module and are thus contained within the state dict.Because these tensors don't require gradients, they aren't recorded as parameters.This is important for tracking the mean and standard deviation in batchnorm layers, for example, which should be stored and loaded using the module's state dict."
    },
    {
      "X ": "if i write a function using class name nn module and then call it 3 times will each call have its own parameters or will they share parameters",
      "Z ": "Hi,Everytime you instantiate the class WienerFilter() it will call the __init__ function and thus create a new set of Paramters.",
      "Y ": "you instantiate the class WienerFilter() it will call the __init__ function"
    },
    {
      "X ": "runtimeerror result type double cant be cast to the desired output type long",
      "Z ": "OK, I think the issue is that the targets are supposed to be of float dtype rather than long here. Does tgts.to(torch.float) work ",
      "Y ": "tgts.to(torch.float)"
    },
    {
      "X ": "concatenating images",
      "Z ": "If you concatenate the images, you'll get ,less‚samples, so I'm not sure how you would like to keep the batch size as 6.Could you explain your use case a bit?Since you are now dealing with multi-hot encoded targets (i.e. multi-label classification), you could use nn.BCELoss or nn.BCEWithLogitsLoss instead.",
      "Y ": "you could use nn.BCELoss or nn.BCEWithLogitsLoss instead"
    },
    {
      "X ": "cuda out of memory error during forward pass",
      "Z ": "Yes, Autograd will save the computation graphs, if you sum the losses (or store the references to those graphs in any other way) until a backward operation is performed.To accumulate gradients you could take a look at this post, which explains different approaches and their computation as well as memory usage.",
      "Y ": "If you sum the losses (or store the references to those graphs in any other way), Autograd will save the computation graphs until a backward operation is performed.You may learn how to accumulate gradients by reading this post, which describes alternative ways, their calculation, and memory usage."
    },
    {
      "X ": "do pytorch containers come with cudnn installed",
      "Z ": " mkserge: Edit: So I dug into the source code a bit, and it looks like pytorch has a completely separate implementation of cuDNN inside it‚Äôs own codebase. Is this true?No, PyTorch uses the official cudnn release and either links it dynamically or statically. Note that you are using the runtime container, so nvcc isn't installed either: root@f79b17da2a55:/workspace# nvcc --version bash: nvcc: command not found Also, the lib path is also empty: root@f79b17da2a55:/workspace# ls /usr/local/nvidia ls: cannot access '/usr/local/nvidia': No such file or directory If you want to build applications inside the container, use the devel container: root@389363a6c5ec:/workspace# find /usr/ -name libcudnn.so /usr/lib/x86_64-linux-gnu/libcudnn.so root@389363a6c5ec:/workspace# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Wed_Jul_22_19: 09: 09_PDT_2020 Cuda compilation tools, release 11.0, V11.0.221 Build cuda_11.0_bu.TC445_37.28845127_0",
      "Y ": "PyTorch utilises the official cudnn release and dynamically or statically links to it.Because you're using the runtime container, you don't need to install nvcc:Use the devel container to develop applications inside the container:"
    },
    {
      "X ": "loading big dataset bigger than memory using pytorch",
      "Z ": "You can start by taking a look at the default dataset classes: torch.utils.data  PyTorch 1.8.1 documentation and seeing if your data fits the map style of iterable style abstraction. The map style is usually a straightforward abstraction for many datasets as you only need to define an __getitem__ and a __len__ function. Once you have a usable dataset, using a dataloader torch.utils.data.dataloader ‚PyTorch 1.8.1 documentation will handle the parallelization and loading in memory for you.",
      "Y ": "Use torch.utils.data  "
    },
    {
      "X ": "problem with backprop runtime error expected tensoroptions on cuda but got tensoroptions on cpu with function mulbackward0",
      "Z ": "Could you post a minimal executable code snippet using your classes to reproduce this issue? Also, which PyTorch version are you using at the moment? If you are not using the latest stable version (1.7.0), could you update to it?",
      "Y ": "Could you please provide a simple executable code snippet that uses your classes to reproduce the problem?Also, what PyTorch version are you currently using? Could you perhaps update to the newest stable version (1.7.0) if you aren't already using it"
    },
    {
      "X ": "log softmax and softmax",
      "Z ": "Hello Carmelo! ccalafiore: Or do I need to train the model with nn.Softmax() to be able to use nn.Softmax() during the test phase?You can do anything you want in your test phase, regardless of how you trained your model.  (Just make sure that your test-phase results mean what you want them to mean.) You say you use CrossEntropyLoss.  The output of your model should therefore be raw-score logits, and these will most likely be the output of a final Linear layer without any subsequent activation function.  That‚Äôs fine. if you now want to do something in your test phase with the actual probabilities (instead of the logits), pass the output of your model through softmax() to convert them. (You should think through why you want probabilities in your test phase and whether or not you need them.  If all you want to do is convert those probabilities to integer class-label predictions, you can do that by applying agrmax() directly to the logits without converting them to probabilities first.)",
      "Y ": "Regardless of how you trained your model, you can do whatever you want in the test phase. (Just make sure the findings of the test phase imply what you want them to mean.)You mention CrossEntropyLoss. Raw-score logits should be the output of your model, and these will most likely be the output of a final Linear layer with no subsequent activation function. That's all right.If you wish to do something in your test phase with actual probabilities (rather than logits), use softmax() to convert the output of your model."
    },
    {
      "X ": "fourier transform and complex dtype restrictions",
      "Z ": "You might want to try to do the multiplication by hand as I think the regular multiplication is already supported:# replace t = torch.prod(t, dim=1) by dim = 1",
      "Y ": "dim = 1"
    },
    {
      "X ": "some additional output of forward function",
      "Z ": "I think you want ctx.mark_non_differentiable.It does not, however, spare you from taking a second grad_out argument in the backward even if it will be all zero (and indeed fully materialized at the moment):",
      "Y ": "Custom Function  MyFunction(torch.autograd.Function)"
    },
    {
      "X ": "how to add additional information for backward",
      "Z ": "Your backward won,'t be called because you return the original input inp‚ (This is changed in 0.3, but until then return inp.clone()) Your backward takes in too many arguments. It should only take in grad_out because the forward only returns a single argument. Your backward returns too many values. It should only return a single value because the forward only takes in a single value.",
      "Y ": "Because you return the original input inp (this is changed in 0.3, but before then, return inp.clone()), your reverse won't be called. Your backwards accepts much too many counter-arguments. Because the forward only returns a single argument, it should just take in grad out. Your reverse returns an excessive number of values. Because the forward only accepts a single argument, it should only return a single value."
    },
    {
      "X ": "torch model parameters by dictionary type",
      "Z ": "Using a Python dict will not properly register the parameters so use nn.ParameterDict instead. ",
      "Y ": "Use nn.ParameterDict  "
    },
    {
      "X ": "plot magnitude of gradient in sdm with training step",
      "Z ": "You can get all gradients by iterating the parameters after the backward operation: for name, param in model.named_parameters(): print(name, param.grad) Using this approach you can store the gradient magnitude and plot it for each iteration using e.g. matplotlib. To get the learning rate and other internal states you can use optimizer.state_dict() or optimizer.param_groups[id] and check th",
      "Y ": "for name, param in model.named_parameters(): print(name, param.grad)"
    },
    {
      "X ": "gradient is none after applying custom mask",
      "Z ": "module1 should create valid gradients as long as mask2 isn't full of zeros",
      "Y ": "module1 should create valid gradients as long as mask2 isn't full of zeros "
    },
    {
      "X ": "backpropagation over summed hidden states in birnn",
      "Z ": "Hi,Do you actually see an error? From your comments, you should not set requires_grad=True on the output if you don't  need to access its .grad field later. And you don't need to wrap things in Variables So it should work just fine yes.",
      "Y ": "If you don't need to use the output's.grad field afterwards, don't set requires grad=True on it.You also don't have to wrap everything with Variables.So, yeah, it should work properly."
    },
    {
      "X ": "where is the grad defined",
      "Z ": "Modules don't have a grad definition, but parameters or tensors.You can find the attribute/property definition here. tensor.grad will be populated e.g. after a backward operation or torch.autograd.grad call and will give you the gradient of X w.r.t. this tensor.",
      "Y ": "Modules lack a grad definition in favour of parameters or tensors.The attribute/property definition can be found here. Tensor.grad will be filled in after a backward operation or torch, for example. The autograd.grad function will return the gradient of X with respect to this tensor."
    },
    {
      "X ": "differentiate through sgd step",
      "Z ": "Hi,Unfortunately there isn't as the  nn.Module is by design, not functional. I woudl advise using a library like with this  https: //github.com/facebookresearch/higher that handles all of that for you (and provide differentiable version of the pytorch optimizers as well).",
      "Y ": "Unfortunately there isn't as the  nn.Module is by design, not functional , and  refer this link https://github.com/facebookresearch/higher "
    },
    {
      "X ": "let custom autograd function know if create graph is enabled",
      "Z ": "Hi, You can check if torch.is_grad_enabled() in the backward and if gd.requires_grad.That will tell if you something wants the gradients to be computed for your function. Namely if grad mode is enabled and the input requires_grad then you should create the graph. Otherwise, it is not needed.",
      "Y ": "check if torch.is_grad_enabled() in the backward and if gd.requires_grad. Namely if grad mode is enabled and the input requires_grad then you should create the graph. Otherwise, it is not needed."
    },
    {
      "X ": "kernel size cant greater than actual input size",
      "Z ": "Your last conv layer in your encoder throws this error, since it receives an input of [1, 30, 1017, 1] and tries to use kernel_size=3. You have to lower the kernel size to 1 or modify your model somewhere before this layer.",
      "Y ": "You must either reduce the kernel size to 1 or adjust your model before to this layer."
    },
    {
      "X ": "problem with autograd",
      "Z ": " Andrea_Rosasco: What could I do to prevent this? If you use nightly build, you can pass inputs= to .backward() to specify what you want to compute gradients for. In this case ds_loss.backward(inputs=(lr_list[j],), retrain_graph=True). Otherwise, you can do grad, = autograd.grad(ds_loss, lr_list[j], retain_graph=True); lr_list[j].grad = grad but it is not as nice ",
      "Y ": "ds_loss.backward(inputs=(lr_list[j],), retrain_graph=True). Otherwise, you can do grad, = autograd.grad(ds_loss, lr_list[j], retain_graph=True"
    },
    {
      "X ": "triplet loss backprop equations,However, I am still confused about the computation of backprop in this implementation. Specifically, as PyTorch accumulates the derivatives, the gradients of the triplet loss w.r.t. to the last linear layer (embedding) ",
      "Z ": "Right but these formulas give the gradients wrt to the output of the net, not the weights. And you never actually sum the gradient contribution of the outputs (because these are different Tensors), only the weights and actually shared and so gradients are accumulated",
      "Y ": "Correct, however these formulas only supply the gradients for the net's output, not the weights. And because the outputs are different Tensors, you never truly aggregate the gradient contribution of the outputs; just the weights are shared, and hence gradients are compounded."
    },
    {
      "X ": "reentering variable execution ending run backward twice error trying to backward through the graph a second time but the saved intermediate results have already been freed",
      "Z ": "Hey guys, I figured it out. As someone new to using BERT encodings, I forgot to wrap my BERT generated embeddings with with torch.no_grad() I pray for any new NLP learners out there to find this post and not spend 12+ hours debugging. Good luck to you all!",
      "Y ": " surround  BERT generated embeddings with torch.nograd()  "
    },
    {
      "X ": "torchvision totensor dont change channel order,When you use transforms.ToTensor() , by default it changes the input arrays from HWC to CHW order.Is it possible to have ToTensor not do this if my input data are already in the ordering I want?What can I replace this code with to make it keep the ordering I want and still convert to a torch tensor?",
      "Z ": "As there is no parameter in transforms.ToTensor()  to control this behavior, the easiest and probably efficient way to solve this issue is to do what PyTorch have in its source code. <li>        img = torch.from_numpy(np.array(pic, np.int16, copy=False))</li><li>    elif pic.mode == 'F':</li><li>        img = torch.from_numpy(np.array(pic, np.float32, copy=False))</li><li>    elif pic.mode == '1':</li><li>        img = 255 * torch.from_numpy(np.array(pic, np.uint8, copy=False))</li><li>    else:</li><li>        img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))</li><li></li><li>    img = img.view(pic.size[ 1], pic.size[0], len(pic.getbands()))</li><li>    # put it from HWC to CHW format</li><li class=selected>    img = img.permute((2, 0,1)).contiguous()</li><li>    if isinstance(img, torch.ByteTensor):</li><li>        return img.float().div(255)</li><li>    else:</li><li>        return img</li> As you can see, it permutes the channels, So, same code can be used to reverse the order.the reason that PyTorch is doing this is that, PyTorch uses CHW convention by default. ",
      "Y ": "the reason that PyTorch is doing this is that, PyTorch uses CHW convention by default."
    },
    {
      "X ": "runtime error expanded size error,in this case bias.shape = torch.Size([1104])  and result.shape = torch.Size([1,1104,15,20])I am getting the error in the line: result += bias.unsqueeze(0).expand_as(result)",
      "Z ": "The error is raised due to a mismatch in the number of dimensions.New dimensions would be added to the front of the expanded tensor, so you would have to unsqueeze the missing last dimensions:bias = torch.randn([1104]) result = torch.randn([1,1104,15,20]) bias[None,: , None, None].expand_as(result)",
      "Y ": "Due to a discrepancy in the number of dimensions, the error is increased.New dimensions are added to the front of the expanded tensor, so that the missing last dimensions are unpressed:"
    },
    {
      "X ": "handling large 3d image dataset with dataloader",
      "Z ": "the  num_workers specified in the DataLoader are executed on the CPU using multiprocessing and will not be shown on nvidia-smi The GPU utilization might be low, if you are facing bottlenecks in your code, such as data loadingTake a look at this post with link https: //discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548/19 which explains it well and suggests some performance improvements,This is not correct. The model will use the specified device and PyTorch will not automatically use the CPU based on the performance of the DataLoader",
      "Y ": "The DataLoader num workers are executed with multiprocessing on the CPU and will not be displayed on the nvidia-smiIf you face bottlenecks in your code, for example data loading, GPU use may be low.See this post, which is good explanation and suggests certain improvements in performance, that's not right. The model uses the device and PyTorch won't use the CPU automatically based on DataLoader performance"
    },
    {
      "X ": "data sampler to handle class imbalance,Is there a way in PyTorch to make such data loader and/or data sampler to handle aforementioned scenario?",
      "Z ": "If you want a predefined number of samples from each class, you could implement a custom sampler which would yield the batch indices using your desired logic.On the other hand, if your use case would allow weighted random sampling with replacement, you could use a  WeightedRandomSampler as described here with link  https: //discuss.pytorch.org/t/how-to-handle-imbalanced-classes/11264/2",
      "Y ": "use WeightedRandomSampler "
    },
    {
      "X ": "training process keep getting killed ",
      "Z ": "could you try to run it via <code>gdb</code> and see, if you would get any proper error? gdb --args python train.py --model.... ... run .. bt",
      "Y ": "gdb --args python train.py --model...."
    },
    {
      "X ": "torchvision totensor dont change channel order,When you use transforms.ToTensor() , by default it changes the input arrays from HWC to CHW order.Is it possible to have ToTensor not do this if my input data are already in the ordering I want?What can I replace this code with to make it keep the ordering I want and still convert to a torch tensor?",
      "Z ": "As there is no parameter in transforms.ToTensor()  to control this behavior, the easiest and probably efficient way to solve this issue is to do what PyTorch have in its source code. <li>        img = torch.from_numpy(np.array(pic, np.int16, copy=False))</li><li>    elif pic.mode == 'F':</li><li>        img = torch.from_numpy(np.array(pic, np.float32, copy=False))</li><li>    elif pic.mode == '1':</li><li>        img = 255 * torch.from_numpy(np.array(pic, np.uint8, copy=False))</li><li>    else:</li><li>        img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))</li><li></li><li>    img = img.view(pic.size[ 1], pic.size[0], len(pic.getbands()))</li><li>    # put it from HWC to CHW format</li><li class=selected>    img = img.permute((2, 0,1)).contiguous()</li><li>    if isinstance(img, torch.ByteTensor):</li><li>        return img.float().div(255)</li><li>    else:</li><li>        return img</li> As you can see, it permutes the channels, So, same code can be used to reverse the order.the reason that PyTorch is doing this is that, PyTorch uses CHW convention by default. ",
      "Y ": "the reason that PyTorch is doing this is that, PyTorch uses CHW convention by default."
    },
    {
      "X ": "runtime error expanded size error,in this case bias.shape = torch.Size([1104])  and result.shape = torch.Size([1,1104,15,20])I am getting the error in the line: result += bias.unsqueeze(0).expand_as(result)",
      "Z ": "The error is raised due to a mismatch in the number of dimensions.New dimensions would be added to the front of the expanded tensor, so you would have to unsqueeze the missing last dimensions:bias = torch.randn([1104]) result = torch.randn([1,1104,15,20]) bias[None,: , None, None].expand_as(result)",
      "Y ": "Due to a discrepancy in the number of dimensions, the error is increased.New dimensions are added to the front of the expanded tensor, so that the missing last dimensions are unpressed:"
    },
    {
      "X ": "handling large 3d image dataset with dataloader",
      "Z ": "the  num_workers specified in the DataLoader are executed on the CPU using multiprocessing and will not be shown on nvidia-smi The GPU utilization might be low, if you are facing bottlenecks in your code, such as data loadingTake a look at this post with link https: //discuss.pytorch.org/t/how-to-prefetch-data-when-processing-with-gpu/548/19 which explains it well and suggests some performance improvements,This is not correct. The model will use the specified device and PyTorch will not automatically use the CPU based on the performance of the DataLoader",
      "Y ": "The DataLoader num workers are executed with multiprocessing on the CPU and will not be displayed on the nvidia-smiIf you face bottlenecks in your code, for example data loading, GPU use may be low.See this post, which is good explanation and suggests certain improvements in performance, that's not right. The model uses the device and PyTorch won't use the CPU automatically based on DataLoader performance"
    },
    {
      "X ": "data sampler to handle class imbalance,Is there a way in PyTorch to make such data loader and/or data sampler to handle aforementioned scenario?",
      "Z ": "If you want a predefined number of samples from each class, you could implement a custom sampler which would yield the batch indices using your desired logic.On the other hand, if your use case would allow weighted random sampling with replacement, you could use a  WeightedRandomSampler as described here with link  https: //discuss.pytorch.org/t/how-to-handle-imbalanced-classes/11264/2",
      "Y ": "use WeightedRandomSampler "
    },
    {
      "X ": "training process keep getting killed ",
      "Z ": "could you try to run it via <code>gdb</code> and see, if you would get any proper error? gdb --args python train.py --model.... ... run .. bt",
      "Y ": "gdb --args python train.py --model...."
    },
    {
      "X ": "how to reset gru, Should I reset hidden state of GRU to predict MOS value for another video file? because hidden state of GRU may affect different behavior?,Should I reset hidden state of GRU to predict MOS value for another video file?<br>because hidden state of GRU may affect different behavior?",
      "Z ": "The <code>nn.GRU</code> module expects the initial hidden state as the second input argument and will default it to zeros, if none is provided.<br>In case you are not manually passing the hidden state to it, you are already using the zeroed out hidden state.",
      "Y ": "nn.GRU module expects the intial hidden state"
    },
    {
      "X ": "pytorch training network with cudnn inference network",
      "Z ": "Unfortunately, I don’t know how cudnn logging can be enabled in Jupyter notebooks or Google Colab and also don’t know if it’s a limitation of the former of latter.<br>One common issue users are seeing when trying to set env variables inside a notebook is the order:<br>e.g. setting CUDA_VISIBLE_DEVICES  or CUDA_LAUNCH_BLOCKING</code> after creating the CUDA context wouldn’t have any effect. Maybe you are facing a similar issue and could move the <code>os.environ</code> calls to the very beginning of the notebook.</p>",
      "Y ": "etting CUDA_VISIBLE_DEVICES  or CUDA_LAUNCH_BLOCKING after creating the CUDA context wouldn’t have any effect."
    },
    {
      "X ": "difference between adam and adamw implementation,What is the difference between the implementation of Adam(weight_decay=…) and AdamW(weight_decay=…)?",
      "Z ": "please check the paper behind AdamW with link https://arxiv.org/pdf/1711.05101.pdf",
      "Y ": " check the paper behind AdamW"
    },
    {
      "X ": "even though the seed is fixed i still get different results,I even add torch.manual_seed(seed) and torch.cuda.manual_seed_all(seed) in init() of every model or submodel. The trained models still got different performance on test dataset.",
      "Z ": "Have a look at Reproducibility docs with link   https://pytorch.org/docs/stable/notes/randomness.htmlReproducibility docs and in particular set torch.use_deterministic_algorithms(True), which would yield an error, if no deterministic algorithms can be found for your workload",
      "Y ": "torch.use_deterministic_algorithms(True)"
    },
    {
      "X ": "typeerror new received an invalid combination of arguments got tensor int but expected one of torch device device didnt match because some of the arguments have invalid types tensor int",
      "Z ": "So the problem looks to be in the way you defined your model. In pytorch when using a model and the nn library you need to define the model before you can execute the forward pass in it. Currently you do  not define your model before you are trying to use it so it is still expecting the number of classes and not an input to the model. You probably want to define your model before hand. You can do something like this:   model = ResNet18(num_classes)train(train_dataset,model,criterion1,criterion2,optimizer,epoch=2,result_directory=result_dir )then just input how many classes the model should output where I wrote num_classes. This will initialize your model correctly",
      "Y ": "model = ResNet18(num_classes)train(train_dataset,model,criterion1,criterion2,optimizer,epoch=2,result_directory=result_dir )"
    },
    {
      "X ": "splitting unbalanced classes of images into training validation and testing sets",
      "Z ": "You could call train_test_split again on train_idx instead of the np.arange input with stratify=class_array[train_idx] to create the train and validation indices. Let me know, if this works",
      "Y ": "stratify=class_array[train_idx] "
    },
    {
      "X": "simplest lstm possible and it does not work",
      "Z ": "Hi Tom The problem is that the accumulated gradients are not reset before each iteration. Therefore the optimizer is updating the weights based on not just the gradients from the current iteration but also all previous iterations. As explained in the  torch.optim docs with link https: //pytorch.org/docs/master/optim.html#taking-an-optimization-step you need to call  optimizer.zero_grad()to reset the gradients at each step.With this diff:y = torch.tensor(np.arange(2,L+2)/20,  , grad_fn=&lt;CopySlices",
      "Y ": "you need to call  optimizer.zero_grad()to reset the gradients at each step."
    },
    {
      "X ": "combine several features,How can I use several features in my net, e.g. current token, next token, current pos-tag, next pos-tag?",
      "Z ": "You may need two embedding matrices, one for token, one for pos tags. And the network is like this input token index -&gt; token embedding matrices -&gt; token embeddinginput pos-tag index -&gt; pos-tag embedding matrices -&gt; pos-tag embeddingand then concatenate token embedding and its corresponding pos-tag embedding",
      "Y ": "Two embedding matrices, one for tokens and the other for pos tags, may be required. And the network looks like this: input token index -&gt; token embedding matrices -&gt; token embedding input pos-tag index -&gt; pos-tag embedding concatenate token embedding and its corresponding pos-tag embedding"
    },
    {
      "X ": "is there any comprehensive torchtext tutorial,, I was wondering that is there any comprehensive “official” tutorial for torchtext package?",
      "Z ": "If you’ve ever worked on a project for deep learning for NLP, you’ll know how painful and tedious all the preprocessing is. Before you start training your model, you have to: Read the d… with link  https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/ ",
      "Y ": "A Comprehensive Introduction to Torchtext (Practical Torchtext part 1)"
    },
    {
      "X ": "confused by pytorch embeding layer,I have a word index tensor, it was converted to python list type after I pass it to an embedding layer! What’s the problem?",
      "Z ": "It’s my fault 0.0, there is a bug in my code",
      "Y ": "Bug in the code"
    },
    {
      "X ": "what kind of attention mechanism used in seq2seq tutorial,",
      "Z ": "I find this Seq2Seq example more useful with link  https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb As far as I can tell, it expands on the linked tutorial by using different attention mechanisms (Bahdanau &amp; Luong). The tutorial uses only the Bahdanau approach.",
      "Y ": "It uses only the Bahdanau   & Luong approach."
    },
    {
      "X ": "Convert int * into tensor",
      "Z ": "Yes there is a good way: use torch::from_blob with link https://pytorch.org/cppdocs/api/function_namespacetorch_1aff6f8e6185457b2b67a1a9f292effe6b.html ",
      "Y ": "torch::from_blob "
    },
    {
      "X ": "compilation c extension using recentpytorch version and cuda 1 20 and 10 1",
      "Z ": "Maybe it’s something about having the version it is compiled with and the one it is run with go out of sync?My impression is that unless you use some internal API, that is the most common source of the error.",
      "Y ": "Maybe it's because the version it's compiled with and the one it's running with aren't in sync?Unless you're using an internal API, I believe this is the most typical cause of error."
    },
    {
      "X ": "source code for conv2d,Can someone point to the cpp code for conv2d. I would like to know how pytorch implements convolution internally.",
      "Z ": "We try to use highly efficient libraries such as cudnn or mkldnn as much as possible. So in practice, their code is most likely to be the one that runs when you call conv.he general implementation we have that works in all cases can be found there with link  https://github.com/pytorch/pytorch/blob/4f1f084d221a7ab9edbf09ab1904edde6d49848c/aten/src/THCUNN/generic/SpatialConvolutionMM.cu for gpu and with link  https://github.com/pytorch/pytorch/blob/master/aten/src/THNN/generic/SpatialConvolutionMM.c for cpu ",
      "Y ": "Hi, We try to use high-performance libraries like cudnn and mkldnn as much as possible. In practise, their code is most likely to be the one that is executed when conv is called. There you can find the general implementation we have that works in all cases: for gpu and for cpu."
    },
    {
      "X ": "how to create multiple distributeddataparallel tasks on a single node, I’m trying to start 2 training tasks on a single node and I want each task to occupy 2 GPUs respectively.",
      "Z ": "Hi!See  the link https: //pytorch.org/docs/stable/distributed.html#environment-variable-initialization for an overview of the env initialization method. Also see the help output of the launch utility run with help You’ll find that you’re silently trying to use the same port on localhost for the processes in a single task to find each other. Specify a different port for each task and it will work.",
      "Y ": "A general overview of the Env initialization method is the environment-variable-initialization. See also the aid of the launch utility running with assistance You will discover that you try to use the same port on localhost to find each other in a single task. Specify and it will work for a different port for each task."
    },
    {
      "X ": "gradient scaling in federated learning,What should I do if I want to scale some workers’ gradient since their data are more valuable than others? ",
      "Z ": "tensor.register_hook(customHook) may work, you need to write customHook to modify grad of the tensor.but as far as I know customHook should be a function of grad only. For your case, you want to make customHook to be a function of grad and workerRank as well?",
      "Y ": "Tensor.register hook(customHook) may work, but you must write customHook to change the tensor's grad. However, as far as I know, customHook should only be a function of grad. In your case, do you want customHook to be a function of both grad and workerRank?"
    },
    {
      "X ": "accessing tensors present on different gpus",
      "Z ": " To send all tensors to one GPU, you’d want to use dist.gather, which will gather all of the tensors onto one gpu (this is assuming you have one process running per gpu). If your tensor is t, then your call would look like: t = your_tensor(size) # rank 0 is the node all tensors will be gathered on Then, you can compute what you want and send the result back wrapped in a tensor via a <code>dist.scatter</code>. To make sure the tensors don’t change on the gpu before they are gathered, ensure all nodes call into <code>gather</code> at the same time when all nodes have the desired value. You could also use torch.distributed.barrier if you need additional synchronization. Check out the docs at https: //pytorch.org/docs/stable/distributed.html",
      "Y ": "Then, you can compute what you want and send the result back wrapped in a tensor via a <code>dist.scatter</code>. To make sure the tensors don’t change on the gpu before they are gathered, ensure all nodes call into <code>gather</code> at the same time when all nodes have the desired value. You could also use torch.distributed.barrier if you need additional synchronization. Check out the docs at https://pytorch.org/docs/stable/distributed.html"
    },
    {
      "X ": "memory issue of using nn dataparallel, ’m currently using nn.DataParallel for mutli-gpu (8-gpu) training in a single node. However, if I put the data and model to devices[0], I found the memory on GPU 0 will be huge and make the program exits (cuda out of memory) at the begining of training.",
      "Z ": "This effect is described by thomas_wolf with the link https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255,We generally recommend using DDP ",
      "Y ": "We generally recommend using DDP"
    },
    {
      "X ": "loading of duplicated data in distributed train,I would like to pretrain BERT by using DDP",
      "Z ": "Did you store the complete dataset in a single tensor?,If so, I think you might need to load it once and store smaller chunks of the data (and load only certain chunks in each process) or load the data lazily from the beginning",
      "Y ": "Did you keep the entire dataset in a single tensor? If that's the case, I believe you should load it once and store smaller chunks of data (and load only certain chunks in each process) or load the data lazily from the start."
    },
    {
      "X ": "change my username",
      "Z ": "Sure, it should be good now",
      "Y ": "Sure, it should be good now"
    },
    {
      "X ": "gradient of output of network with respect to parameters,I need to calculate the gradient of output of network with respect to the parameters of network ( say with respect to first layer weights).",
      "Z ": "torch.autograd.functional.jacobian works.",
      "Y ": "torch.autograd.functional.jacobian"
    },
    {
      "X ": "pytorch 1 8 and rtx 3070,Should we except any performance changes from pytorch 1.8 Cuda 11.1 in comparison to pytorch 1.7.1 with Cuda 11?",
      "Z ": "You shouldn’t expect a lot of changes for CNNs, as both versions ship with cudnn8.0.5. Due to  this issue we couldn’t use the cudnn8.1 release and are working on it with link  https://github.com/pytorch/pytorch/issues/50153,For the best performance on your 3070 you could build from source using the latest cudnn release",
      "Y ": "For the best performance on your 3070 you could build from source using the latest cudnn release"
    },
    {
      "X ": "how to include gradient of nn wrt the input in the loss function",
      "Z ": "I guess the error is raised by using retain_graph=True and trying to update parameters multiple times, which would be wrong, since the gradient would be stale.Could you explain why you are using this argument?",
      "Y ": "I believe the error is caused by using retain graph=True and repeatedly updating parameters, which would be incorrect because the gradient would be stale."
    },
    {
      "X ": "what step backward and zero grad do",
      "Z ": "Hopefully, you use them in the other order - opt.zero_grad(), loss.backward(), opt.step().zero_grad clears old gradients from the last step (otherwise you‚Äôd just accumulate the gradients from all loss.backward() calls).loss.backward() computes the derivative of the loss w.r.t. the parameters (or anything requiring gradients) using backpropagation.opt.step() causes the optimizer to take a step based on the gradients of the parameters.",
      "Y ": " use them in the other order - opt.zero_grad(), loss.backward(), opt.step()."
    },
    {
      "X ": "why loss backward didnt update parameters gradient",
      "Z ": "In self.Bi_weight = Parameter(torch.Tensor(self.weight.shape)).cuda() the .cuda() is computation and you don't have a Parameter in self.Bi_weight.Use self.Bi_weight = Parameter(torch.Tensor(self.weight.shape).cuda()) or better yet just leave the .cuda() alone and do model.cuda() at the end.Best regards Thomas",
      "Y ": "Use self.Bi_weight = Parameter(torch.Tensor(self.weight.shape).cuda()) or better yet just leave the .cuda() alone and do model.cuda() at the end"
    },
    {
      "X ": "custom loss function runtimeerror tensor does not have a grad fn",
      "Z ": "You are breaking the computation graph by recreating tensors and the deprecated Variables. To keep the computation graph intact you would need to apply operations on tensors only, so remove ... Variable(output),Variable(target) torch.tensor(error_diff,requires_grad=True) loss = Variable(loss) etc.",
      "Y ": "Remove Variable(output),Variable(target) torch.tensor(error_diff,requires_grad=True) loss = Variable(loss)"
    },
    {
      "X ": "custom autograd function must it be static",
      "Z ": "Hi,Option (1) is the old way to define Functions. This does not support gradients of gradients and it‚Äôs support might be discontinued in the future (not sure when).The second one is the way to go. Note that you can do exactly the same thing as you can save arbitrary stuff in the ctx (the same way you would save in self in (1)), and the apply method that calls forward accept any parameter, so you can just pass what you used to give the __init__() function here. That means that you don‚Äôt need to define options globally, just pass them to the forward method.",
      "Y ": "Option (1) is the traditional method of defining functions. This does not support gradients of gradients, and it may be phased out in the future (not sure when).The second option is the best option. Note that you can save arbitrary data in the ctx (in the same manner that you could save data in self in (1)), and the apply method that calls forward accepts any parameter, so you can just send whatever you passed to the __init__() function here. That is, instead of defining options globally, you may simply send them to the forward method."
    },
    {
      "X ": "can gradient back propagate through circular padding,What about circular or reflect mode? In these two modes, the padding values are the sliced results of the previous layer, so that the gradient can back-propagate to previous layers",
      "Z ": "Yes the gradients will flow back to the input corresponding to everywhere it is used (including padding).",
      "Y ": "As a result, the gradients will flow back to the input corresponding to wherever it is used (including padding)."
    },
    {
      "X ": "error running multiple models in torch 1 7 1 but works in torch 1 0",
      "Z ": "Yes, the inplace updates of parameters are raising an error now, if you are using stale gradients as described in the 1.5 release notes (described in the torch.optim optimizers changed to fix in-place checks for the changes made by the optimizer section).The reason is that the gradient computation would be incorrect. In your example you would calculate loss1 and loss2 using the model parameters in the initial state s0. loss1.backward() calculates the gradients and opt1.step() updates the parameters to state s1.loss2.backward() was computed using the model in state s0 and would thus calculate the gradients of loss2 w.r.t. parameters s0, while the model is already updated to s1. These gradients would thus be wrong and the error is raised.",
      "Y ": "Yes, if you're using stale gradients as indicated in the 1.5 release notes (explained in the torch), the inplace changes of parameters are now throwing an issue. Optimizers were updated to fix in-place tests for the optimizer section's changes).The reason for this is because the gradient calculation would be wrong. In your case, you'd use the model parameters in the starting state s0 to calculate loss1 and loss2. loss1.backward() calculates the gradients, while opt1.step() updates the parameters to state s1. loss2.backward() was calculated using the model in state s0, and so calculates the gradients of loss2 w.r.t. parameters s0, even though the model has already been updated to s1. These are the grades"
    },
    {
      "X ": "pytorch jvp slow Are there any known issues on using JVP? Are there any methods to make it more efficient (maybe using other packages)?",
      "Z ": "hi As mentioned in the note on the function definition, the function uses the “double backward trick” to compute the jvp and is thus expected to be slower.We are working on changing that for 1.7.",
      "Y ": "As noted in the function definition note, the function computes the jvp using the double backward trick,hence it is expected to be slower.We're working on a solution for 1.7."
    },
    {
      "X ": "cuda invalid configuration error on gpu only",
      "Z ": "Thanks for replying @ptrblck!  While trying to reproduce the error at home (I was at work when I first got the error) I started with a smaller batch size (I was using 1024), and lo and behold it worked!  I did some investigating, and apparently 1023 works, but any batch_size = 1024 causes that error.  It seems to me that it‚Äôs a GPU memory issue, but I don‚Äôt know why I‚Äôm not getting the traditional RuntimeError: CUDA out of memory.  I‚Äôll double-check that this is the case on my work setup as well, and will reply back tomorrow.",
      "Y ": "Batch size of 1023 works but >= 1024 throwas error"
    },
    {
      "X ": "different train result when just defining extra layers in code",
      "Z ": "Initializing an additional layer will initialize all parameters by calling into the pseudorandom number generator and will thus change all following random calls. This is expected behavior and we had some topics about the same behavior here. with link  https://discuss.pytorch.org/images/emoji/apple/wink.png?v=9",
      "Y ": "By running the pseudorandom number generator to initialise all parameters, you will affect all subsequent random calls. This is normal behaviour, and we've discussed it before on this site. "
    },
    {
      "X ": "tensor backward low level computation",
      "Z ": "Yes, the dtype would be recorded by Autograd and the backward computation would be executed in the same precision as its forward part, if you don't change it.This would also mean that your FP16 gradient updates might easily underflow, which is why we recommend to use the mixed-precision training utility via torch.cuda.amp.",
      "Y ": "If you don't modify it, Autograd will record the dtype and the backward computation will be performed with the same precision as the forward component.This also means that your FP16 gradient updates could easily underflow, which is why we advocate using torch.cuda.amp's mixed-precision training tool."
    },
    {
      "X ": "control the training batch",
      "Z ": "I guess you might be shuffling the data in your training DataLoader, which would thus yield random samples in all batches.If you want to use the same image, you could directly get it by indexing the dataset via: data, target = train_loader.dataset[index].",
      "Y ": "I assume you're shuffling the data in your training DataLoader, resulting in random samples across all batches.If you want to use the same image, you may access it by indexing the dataset with the following command: data, target = train loader.dataset[index]."
    },
    {
      "X ": "how to get the gradients for both the input and intermediate variables",
      "Z ": "Your code should raise a warning and explain the reason and workaround for it:userWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the gradient for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations.print(y.grad)After adding y.retain_grad() you‚Äôll get the gradient value for y.",
      "Y ": "You'll get the gradient value for y after calling y.retain grad()."
    },
    {
      "X ": "why does autograd track things like torch cat",
      "Z ": "Yes and by default the torch.cat operation is tracked, so based on your description you would append the outputs of the model with the complete computation graph and would then backpropagate through all these outputs and the model.",
      "Y ": "Yes, the torch.cat operation is tracked by default, thus based on your description, you'd append the model's outputs to the entire computation graph, then backpropagate via all of these outputs and the model."
    },
    {
      "X ": "confusing about autograd mechanism with pytorch1 8 0",
      "Z ": "Hi,The issue is that old versions of pytorch were not considering the optimizer step as an inplace operation properly. And so this check was not working properly and was not raising an error even though it was computing wrong gradients. This has been fixed in latest master and this error is expected in this case.The problem being that you need the value of the parameters to compute the backward but optimizer.step() modifies them inplace. So you will need to wait for all the backward to be done before doing the step. Or you need to redo the forward after the step to use the new value of the weights.",
      "Y ": "The problem is that older versions of Pytorch didn't properly treat the optimizer step as an inplace operation. As a result, despite computing incorrect gradients, this check was not performing properly and did not raise an error. This has been resolved in the newest master, and in this instance, the error is expected.The issue is that you require the parameter values to compute the backward, but optimizer.step() updates them in the middle of the computation. As a result, you'll have to wait until the entire reverse process is completed before proceeding to the next step. Alternatively, you can perform the forward after the step to use the updated weight values."
    },
    {
      "X ": "pytorch no computation graph mode",
      "Z ": "Hi,To prevent the autograd from saving anything, setting with torch.no_grad(): is enough.The other issue you might have is about how you measure GPU memory as pytorch use a custom allocator and so memory reported by the OS is not necessarily correct.Also when you do out = relu(out), the relu is first computed, and thus a temporary output is allocated. Then the old content of out is discarded and replaced with the temporary one. So I would expect to see a memory bump here.Note that most activation functions have an inplace flag to make them perform their update inplace and avoid this issue.",
      "Y ": "Setting with torch to prevent the autograd from saving anything. no grad(): is sufficient.The other difficulty you may encounter is how you assess GPU memory, as Pytorch uses a proprietary allocator, and so memory provided by the OS is not always accurate.Out = relu(out) also computes the relu first, allocating a temporary output in the process. The original content of out is then discarded in favour of the temporary one. As a result, I'm expecting a memory blip here.It's worth noting that most activation functions provide an inplace flag that forces them to update in place, avoiding this problem."
    },
    {
      "X ": "one of the differentiated tensors appears to not have been used in the graph",
      "Z ": "I cannot find any obvious issues by skimming through the models.Try to add print statements into the forward methods of all models and check, if the activations have a valid .grad_fn via print(x.grad_fn).At one point in a model the statement should return None and the previous operation would detach the tensor from the graph. Also make sure you haven't globally disabled the gradient calculation e.g. via torch.set_grad_enabled(False).",
      "Y ": "By skimming through the models, I am unable to identify any glaring flaws.Try adding print statements to all models' forward procedures and using print(x.grad fn) to see if the activations have a valid.grad fn.The statement should yield None at some point in the model, and the prior operation should remove the tensor from the graph. Also, double-check that you haven't disabled the gradient calculation worldwide, such as using torch.set grad enabled (False)."
    },
    {
      "X ": "error grad can be implicitly created only for scalar outputs",
      "Z ": "As @bharat0to said, your loss is most likely a multi-dimensional tensor, which will thus throw this error.You could add some reduction or pass a gradient with the same shape as loss.",
      "Y ": "Your loss is very certainly a multi-dimensional tensor, resulting in this inaccuracy.You might apply a reduction or pass a gradient that has the same shape as loss."
    },
    {
      "X ": "when should we set torch backends cudnn enabled to false especially for lstm",
      "Z ": "It's a limitation from the cudnn implementation. From the cudnn docs:fwdMode Input. Specifies inference or training mode (CUDNN_FWD_MODE_INFERENCE and CUDNN_FWD_MODE_TRAINING). In the training mode, additional data is stored in the reserve space buffer. This information is used in the backward pass to compute derivatives.Thus if you call model.eval(), you won't be able to calculate the gradients using the cudnn RNN implementation. As described before, calling: model.eval() model.rnn_layer.train() could solve this issue, as it would keep the RNN in training mode.",
      "Y ": "Thus if you call model.eval(), you won't be able to calculate the gradients using the cudnn RNN implementation. As described before, calling:model.eval() model.rnn_layer.train()"
    },
    {
      "X ": "difficulty using multiprocessing num workers",
      "Z ": "I guess this line of code:orch.set_default_tensor_type('torch.cuda.FloatTensor')might be problematic, as it could use CUDA tensors inside the <code>Dataset</code> and thus inside each process, which could then fail due to using multiprocessing with multiple CUDA context instances.",
      "Y ": "torch.set_default_tensor_type('torch.cuda.FloatTensor')"
    },
    {
      "X ": "how to calculate integral of function",
      "Z ": "As you guessed, you can only approximate the integral as a sum, not that the integral you are dealing with is also an expectation, so all your formula can be seen as the expectation over X,Y and delta. In my opinion, you can create a loader over your original dataset, and that yields in addition to X and Y the perturbed X’. Then you can do the rest of the training as usual,One other thing that seems strange in you case, is that you model $i_{omega}$ seems to be non-deterministic. However I think in practice I think this will be a neural network, so the second expectation on X’ can be ignored, and you can just  compute the loss using the output of   $pi_{omega}$You use X’ as your new input to the classification model you are training, i.e., the model labeled with $\theta$, and you compute the first part of the loss as for any classification problem, and you add  the regularization term.",
      "Y ": "You can only approximate the integral as a sum, as you guessed, because the integral you're dealing with is also an expectation, so your entire formula can be viewed as the expectation over X,Y, and delta. In my opinion, you can create a loader over your original dataset that returns the perturbed X' in addition to X and Y. Then you can continue with the rest of the training as usual. Another oddity in your case is that your model $pi omega$ appears to be non-deterministic. However, I believe that in practise, this will be a neural network, so the second expectation on X' can be ignored, and you can simply compute the loss using the output of $pi omega$."
    },
    {
      "X ": "what are the main reasons for receiving runtimeerror stack expects a non empty tensorlist error for torch stack",
      "Z ": "I’m not sure how you are loading the images, but it seems that the used library has trouble loading these problematic image files (e.g. PIL</code> or <code>OpenCV</code>)Based on the error message a <code>None</code> object will be returned which then fails in the <code>torch.stack</code> operationCould you try to load the mentioned images using the image lib and check what the output would be?",
      "Y ": " Based on the error message a None object will be returned which then fails in the <code>torch.stack</code> operationCould you try to load the mentioned images using the image lib and check what the output would be?"
    },
    {
      "X ": "how to free gpu memory changing architectures while training",
      "Z ": "Hi Two things We have a custom allocator, so even when the memory is released, you won’t see it available on nvidia-smi  but you will be able to use it in pytorch.The memory is realeased only when you don’t reference it anymore. You might want to wrap the content of your inner loop in a function so that all the intermediary results go out of scope (and are thus released) between loop iterations.",
      "Y ": "We have a custom allocator so you won't see it on nvidia-smi, even if the memory is released, but can use it in pytorch.only if you no longer refer to the memory is realised. You may want to pack your inner loop into a function to remove (and release) all intermediate results between loop iterations."
    },
    {
      "X ": "how to solve resnet overfitting",
      "Z ": "Got better results after switching to SGD from Adam Optimizer. But validation accuracy and loss saturate early. Learning curves (before early stopping)with link  https: //discuss.pytorch.org/uploads/default/original/3X/4/e/4ebb329817637565687739ed3594d7c57ca2af8c.jpeg",
      "Y ": "I got better results after switching from Adam Optimizer to SGD. However, validation accuracy and loss saturate quickly.Curves of learning (before early stopping)"
    },
    {
      "X ": "how shift an image vertically or horizontally",
      "Z ": "You should be able to do this using <code>torchvision.transforms.RandomAffine</code>. You can set the <code>degrees</code> to 0 and set the <code>translate</code> parameter to what you need. with link https: //pytorch.org/vision/stable/transforms.html#torchvision.transforms.RandomAffine An example. Note that translate refers to the fraction of the length of that dimension to translate by.shift = transforms.RandomAffine(degrees = 0, translate = (0.2,0.2))",
      "Y ": "shift = transforms.RandomAffine(degrees = 0, translate = (0.2,0.2))"
    },
    {
      "X ": "transfer learning with mixed precision,I am currently trying to figure out how to facilitate mixed-precision training when using transfer learning",
      "Z ": "Since you are working with a ResNet, have a look at this link  https://github.com/pytorch/vision/blob/e79a74e1bfdf4e8e0275eeecea9063adba600ba0/torchvision/models/resnet.py#L230-L246 its forward. As you can see, there is unfortunately no model.featuresattribute, so in this case it would be easier to replace model.classifier = nn.Identity()  and just call the model directly.",
      "Y ": "replace model.classifier = nn.Identity()  and just call the model directly"
    },
    {
      "X ": "resnet 50 takes 10 13gb to run with batch size of 96, I have been working on using a ResNet-50 and have images of shape (3, 256, 256) and I’m trying to run it in batch size of 96, but I keep getting an error stating that it ran out of CUDA memory.",
      "Z ": "A memory usage of ~10GB would be expected for a ResNet50 with the specified input shape.<br>Note that the input itself, all parameters, and especially the intermediate forward activations will use device memory.<br>While the former two might be small the intermediates (which are needed for the gradient computation) would most likely use the majority of the memoryA quick way of checking it would be to register forward hooks and check all output activations.Note that this approach is not exact, as you would need to check which operations are performed inplace etc., but would at least give you an approx. estimation:                                  nb_acts = 0",
      "Y ": "If you don’t want to train the model and would like to save memory by not storing the intermediates, wrap the forward pass into with torch.no_grad(),"
    },
    {
      "X ": "torchvision cifar 10 and imagefolder cifar 10 has different behaviors in torchvision preset networks",
      "Z ": "The simple solution is actually shuffling the CIFAR-10 dataset in the dataloader.",
      "Y ": "shuffling the CIFAR-10 dataset in the dataloader."
    },
    {
      "X ": "failedpreconditionerror tagger master8 train is a directory,while running a training session of semantic role labeling. I`m using python 2.7 (anaconda) with TensorFlow 1.12 on Ubuntu 18.04.",
      "Z ": "Problem is solved. Was due to the improper organization of the directory structure…",
      "Y ": "This was due to the directory structure being improperly organised..."
    },
    {
      "X ": "where did gru implement,I want to know how GRU is implemented in pytorch. ",
      "Z ": "The “native” implantations are in with link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/RNN.cpp  https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/RNN.cu, There also are CuDNN bindings…",
      "Y ": "There also are CuDNN bindings…"
    },
    {
      "X ": "i want to serve my chatbot, How can I do it? ",
      "Z ": "Its good to see you’re wanting to productionize a model!There’s now a tutorial about writing an API around your model using Flask wth link  https://pytorch.org/tutorials/intermediate/flask_rest_api_tutorial.html,Then your react-node app would simply call it. Bearing in mind, that there’s probably a bit more you want to do before deploying to production, eg serve the Flask app with gunicorn or something similar,Another approach would probably be to look into ONNX - I don’t know too much about that. I typically deploy models using Django or Flask",
      "Y ": "There is now a tutorial on writing an API around your model using Flask, which can be found at https://pytorch.org/tutorials/intermediate/flask rest api tutorial.html. Your react-node app would then simply call it. Keeping in mind that there is probably a bit more you want to do before deploying to production, such as serving the Flask app with gunicorn or something similar, another approach would be to look into ONNX - I don't know much about that. I usually deploy models with Django or Flask."
    },
    {
      "X ": "how to get the minimal subset of libtorch for inferencing in c",
      "Z ": "to the best of my knowledge, this is not possible at the moment. This is mainly because we have a dynamic dispatcher and so which part of the lib will be used is not known at compilation time.,That being said, I think there is work towards this as this would be very useful for the mobile build as well. But I don’t think there is any solution yet",
      "Y ": "To the best of my knowledge, this is not currently possible. This is primarily due to the fact that we have a dynamic dispatcher, and thus which part of the library will be used is unknown at the time of compilation. That being said, I believe there is progress being made in this direction, as it would be very useful for the mobile build as well. However, I do not believe there is a solution at this time."
    },
    {
      "X ": "range of a tensor in c,I’m currently writing some C++ code with torch (to eventually make it a PyTorch extension).",
      "Z ": "see Subtensor add and division in libtorch with link https://discuss.pytorch.org/t/subtensor-add-and-division-in-libtorch/63224 ",
      "Y ": "Subtensor add and division in libtorch"
    },
    {
      "X ": "how can i cuda code to support mixed precision",
      "Z ": "What is your use case, i.e. would you like to perform the computation in FP16 or pseudo-FP16, i.e. FP32 math for FP16 inputs? Also, what kind of operations are you using inside your kernel? have a look   at nvidia/apex for some use cases with link https://github.com/NVIDIA/apex/tree/master/csrc",
      "Y ": "Look at nvidia/apex for some use cases "
    },
    {
      "X ": "using torch grid sampler 2d from c api,I can’t find the values for the interpolation mode and border mode, neither in the docs nor in the code and this doesn’t compile:<br>torch: :grid_sampler_2d(img, flow, torch: :kBilinear, torch: :kBorder, true)",
      "Z ": "Thank you for your reply. Seems like grid_sampler_2d is more like a low-level function, not to be used by “normal” users. I use grid_sample now and it works as expected:namespace F = torch: :nn: :functionalF: :grid_sample(img1_t, map_t, F: :GridSampleFuncOptions().mode(torch: :kBilinear).padding_mode(torch: :kBorder).align_corners(true))",
      "Y ": " I use grid_sample now and it works as expected"
    },
    {
      "X ": "loss goes down on gpu0 but up on other 3 when using distributeddataparallel",
      "Z ": "I checked the model parameters, they were in perfect sync,I had another stupid bug. I used ReduceLROnPlateau when the validation accuracy plateaued, but each process looked at the validation accuracy of its subset of data. 1st process reduced the learning rate first, the others reduced it 1 epoch later, hence the problem.",
      "Y ": "DDP should have kept all model replicas in sync, i.e., all model replicas should have the same parameter values. Could you please check if this is true in your use case, say by using all_gather to collect all model parameters into one rank and compare?"
    },
    {
      "X ": "how to share cpu memory in distributed training,Is there any method that I can train the model with DDP mode?",
      "Z ": "One option is to use torch.multiprocessing.Queue 24 as the shared memory. The main process can prepare multiple queues and then pass one queue to each DDP processes. The main process reads from the file and dispatch data items to the queue, while DDP processes wait on their own queue for that data item.Another option is to split the tar file into multiple smaller pieces and let each DDP process read a different one.Yes, but one caveat is that those input data splits need to generate the same number of input batches for DDP. If, say rank 0 processes 3 batches and rank 1 process 4 batches, rank 1 would hang on the last batch.",
      "Y ": "One possibility is to use a torch. multiprocessing. As the shared memory, use Queue 24. The main process can prepare multiple queues before passing one to each DDP process. The main process reads from the file and sends data items to the queue, while DDP processes wait for that data item on their own queue.Another possibility is to divide the tar file into smaller pieces and have each DDP process read a different one.Yes, but there is one caveat: those input data splits must produce the same number of DDP input batches. If rank 0 processes three batches and rank 1 processes four, rank 1 will hang on the last batch."
    },
    {
      "X ": "distributed pagerank with pytorch,I am trying to implement Pagerank with libtorch.",
      "Z ": "It should be possible. And there are several levels of APIs that you can use, send/recv APIs: with link https://pytorch.org/docs/stable/distributed.html#torch.distributed.send and  collective communication APIs:with  link https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce and RPC api linkshttps://pytorch.org/docs/stable/rpc.html, https://pytorch.org/tutorials/intermediate/rpc_tutorial.html, https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html ",
      "Y ": "It can be possible using APIs like send/recv APIs, collective communication APIs and RPC APIs"
    },
    {
      "X ": "change rank of machines manually,I am trying to deploy a crash-resilient distributed deployment in PyTorch. ",
      "Z ": "if you are looking for elastic training for distributed data parallel, torchelastic is built for this purpose. It will conduct re-rendezvous on living nodes when failure occurs. with link https://github.com/pytorch/elastic",
      "Y ": " torchelastic is built for this purpose"
    },
    {
      "X ": "intersection between 2d tensors , is there a way for PyTorch to calculate the intersection of two 2D tensors?",
      "Z ": "Check (a==b) *a . This should give elements where intersecting and zeros if not. Use result.nonzeor() if you don’t want zeros",
      "Y ": "Check (a==b) *a, result.nonzeor()"
    },
    {
      "X ": "can not execute backward after expand",
      "Z ": "Hi,Some precisions to Alexey's message:Indeed, the problem is not with only expand, but with expand + the inplace. And in particular because expand is a view of a.  This is why you don't see any issue if you remove the inplace or if you replace the expand with a repeat that is not a view (it allocates new memory).it is about ‚unreachable I don't think that is. And the error in an internal bug on our end I think The reason for .data to ‚fix‚ the issue is because it hides the inplace op from the autograd (in a bad way) and so it looks like you do only the view for the autograd.In general, you should never use .data! and the with torch.no_grad() is the right way to do this!The temporary fix here is to remove the print of b between the no_grad block and the line where you use it for the sum.I opened an issue here that explains the exact issue and discusses how to solve it.: Printing should not have (bad) autograd side effects ¬∑ Issue #49756 ¬∑ pytorch/pytorch ¬∑ GitHub",
      "Y ": "The temporary fix here is to remove the print of b between the no_grad block and the line where you use it for the sum."
    },
    {
      "X ": "determinism of gradient accumulation,Am I correct in assuming that adding CPU operations like above makes gradient accumulation (for tensor shared_ctx) non-deterministic?",
      "Z ": "Yes you are correct that when using multiple devices, the accumulation is not forced to run in a specific order.A “simple” solution to fix that is to use a custom function that does the copy to the different devices in the forward and the accumulation in the backward (in a fixed order in your custom backward). Would that work for you?",
      "Y ": "Yes, you are correct that the accumulation is not forced to run in a specific order when using multiple devices. A simple solution is to use a custom function that copies to the various devices in the forward direction and accumulates in the backward direction (in a fixed order in your custom backward). Is that something you'd be interested in?"
    },
    {
      "X ": "is there any way of accessing the output from intermediate layers,is there any way of somehow accessing the intermediate results in the intervening layers, from passing the batch through those layers, to get the output?",
      "Z ": "To reduce memory usage we try very hard not to save all of these no. So you cannot guarantee that they are actually saved.You could use global nn.Module forward hooks to force saving some of the results during the forward pass so that you can access them later.",
      "Y ": "To save memory, we make every effort not to save all of these no. As a result, you cannot guarantee that they are saved.You could make use of global nn.Module forward hooks are used to force the saving of some results during the forward pass so that they can be accessed later"
    },
    {
      "X ": "why do the generated samples have require grad false",
      "Z ": "pytorchkiw: return z You are returning z which is your input and as @mruberry has mentioned, if you need grad for your inputs, then you should enable it. But you have mentioned that you only need grads w.r.t. model parameters (not input). In that case, you should return samples instead of z to see grads. I think you are returning the wrong tensor.PS: You can use torch.normal instead of numpy.Bests",
      "Y ": "You can use torch.normal instead of numpy."
    },
    {
      "X ": "cuda error an illegal memory access when fine tuning groupnorm",
      "Z ": "Could you update to the latest PyTorch release and also try the CUDA10.2 or 11.0 binaries?If you are still hitting the illegal memory access, could you run the script via: CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?",
      "Y ": "use CUDA_LAUNCH_BLOCKING=1 python script.py args"
    },
    {
      "X ": "sharing a parameter between multiple loss functions graphs",
      "Z ": "This approach would pass the same param_mat to all optimizers, which would update it with its gradient and I assume that's the use case you are looking for.",
      "Y ": "This solution would pass the same param mat to all optimizers, who would update it with its gradient, and I believe that's the use case you're after."
    },
    {
      "X ": "step activation function,Is there a step activation function in pytorch?",
      "Z ": "You may define this activation function on your own. I think there is a great tutorial on pytorch website,However, I think your function is not differentiable, so you might have to be careful on using this function",
      "Y ": "You are free to create your own activation function. I believe there is a great tutorial on the Pytorch website. However, I believe your function is not differentiable, so you may need to be cautious when using this function."
    },
    {
      "X ": "function which outputs tensor by reference",
      "Z ": "qualix:it indeed duplicates the tensors, which breaks autograd.I am not sure this is true.You don't use detach or no grad so the gradients will be properly tracked through your convmatrix2d function (even though the output matrix does not share its memory with the input, if you backprop, the gradients will flow back all the way to the input.",
      "Y ": "You don't use detach or no grad, therefore your convmatrix2d function will track the gradients correctly"
    },
    {
      "X ": "whats wrong about the grad of softmax when i use just some of the inputs to do softmax",
      "Z ": "The small absolute error is most likely created by the limited numerical precision using flaot32 and you should get a smaller error using float64.Also note, that using torch.log(torch.softmax(...)) is numerically less stable than F.log_softmax, as the latter applies the log-sum-exp trick to increase the stability.",
      "Y ": "Because F.log softmax uses the log-sum-exp method to boost stability, torch.log(torch.softmax(...)) is quantitatively less stable than F.log softmax."
    },
    {
      "X ": "incorrect hook being used in register hook implementation",
      "Z ": "Thanks for the code snippet.The approach using a dict might work, alternatively, you could also use:h = module.weight.register_hook(lambda grad, gradient_mask=gradient_mask: grad.mul_(gradient_mask))Your current code runs into a known Python limitation of using lambda functions, since names in function bodies are evaluated when the function is executed.",
      "Y ": "Use h = module.weight.register_hook(lambda grad, gradient_mask=gradient_mask: grad.mul_(gradient_mask))"
    },
    {
      "X ": "simple categorical cross entropy model not learning",
      "Z ": "Thank you for the extensive answer. My problem was the missing bias unit. Rewriting the forward pass as the following made the model learn non-linear patterns:... b0 = torch.tensor(np.random.randn(64)) b1 = torch.tensor(np.random.randn(3))...out = torch.tensor(X, requires_grad=True).matmul(w0).add(b0).relu().matmul(w1).add(b1) out = torch.nn.functional.log_softmax(out, dim=1)",
      "Y ": "out = torch.tensor(X, requires_grad=True).matmul(w0).add(b0).relu().matmul(w1).add(b1) out = torch.nn.functional.log_softmax(out, dim=1)"
    },
    {
      "X ": "inplace parameter updation without torch no grad",
      "Z ": "Hi, It is illegal because it would mean that the weight update would be done in a differentiable manner. So that means that in the next iteration, when computing the gradients, it will flow back through multiple iterations of your training loop which is most likely not what you want to do !",
      "Y ": "It would be prohibited because the weight update would be done in a differentiable manner.So, when computing the gradients in the next iteration, it will loop back over numerous iterations of your training loop, which is most certainly not what you want!"
    },
    {
      "X ": "how to get the bounding boxes in feature map space,Since width and height have been reduced to 1/16 can I interpolate and estimate the coordinates of the objects in the feature map?",
      "Z ": "I wanted to do something similar. Yes you can do that. There are 2 main operations for this: ROI-pooling and ROI-aligning. Basically, each bounding box is a certain region of interest (ROI), which is first projected onto the feature map. They differ in the way features are computed for a certain ROI.I highly recommend this video with link https://www.youtube.com/watch?v=9AyMR4IhSWQ&amp;list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&amp;index=16 which explains how both operations work (it starts around the 20th minute). ROI-aligning is actually simpler and performs better than ROI-pooling, since it doesn’t snap the projected region of interest onto the grid cells. Both are implemented in torchvision ,I also recommend  this Stackoverflow post  with link https://stackoverflow.com/a/60062070/7762882 that explains how ROI-align works in  pytroch",
      "Y ": "There are 2 main operations for this: ROI-pooling and ROI-aligning. Basically, each bounding box is a certain region of interest (ROI), which is first projected onto the feature map. They differ in the way features are computed for a certain ROI."
    },
    {
      "X ": "how to downsample an image before feeding to generator",
      "Z ": "your use case sounds similar to the pix2pix models so you could check some implementations and reuse the logic for the generator",
      "Y ": "you can use  pix2pix models and chec  some implementations and reuse the logic for the generator"
    },
    {
      "X ": "why no dropout in last layer of rnn",
      "Z ": "So the dropout in the last layer would be operating on what is the output of the RNN.This means you can do it yourself on the output if needed, an option you don’t have for the inner layers.Note that the dropout implemented by the RNN is not the dropout using one random draw for all timesteps.",
      "Y ": "As a result, the dropout in the final layer would be based on the RNN output. This means you can do it yourself on the output if necessary, which is not an option for the inner layers.It should be noted that the dropout implemented by the RNN is not the dropout that uses a single random draw for all timesteps."
    },
    {
      "X ": "singular value decomposition svd,Computing Singular Value Decomposition (SVD) in Pytorch always give me this error",
      "Z ": "I think my inputs contain nan, thats why I got an invalid svd.",
      "Y ": " inputs contain nan, thats why I got an invalid svd."
    },
    {
      "X ": "using torchtext for a multiclass classification problem",
      "Z ": "for Mutli classification,LABEL = data.LabelField(), for multi classification, we donot need to set dtype as pytorch expects it be as numericalized long tensors,IMO, best place to start with link  https://github.com/bentrevett/pytorch-sentiment-analysis ",
      "Y ": "LABEL = data.LabelField()"
    },
    {
      "X ": "pretrained bert package for windows in anaconda",
      "Z ": "State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch with link  https://pypi.org/project/transformers/#description",
      "Y ": "use State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch"
    },
    {
      "X ": "convert array to tensor in c, I have to give a matrix as input to my network, which is a standard C++ 2d array, and I can’t find a method to transform my data to a tensor.",
      "Z ": "Found the solution. You can use the torch::from_blob() function to read an array and turn it to a tensor. Apparently you also need to call .clone() because of some memory issue",
      "Y ": "use torch::from_blob() function to read an array and turn it to a tensor"
    },
    {
      "X ": "libtorch statics library vs dynamicss library, What is the current state of statics libtorch statics library is it as stable as the dynamics one ?, It is advisable to use statics library instead of dynamics library?,Why there is no option to download the statics library from download menu at pytorch.org website ",
      "Z ": "Static linking libtorch doesn’t work well yet, please see, with link https://github.com/pytorch/pytorch/issues/21737",
      "Y ": "Static linking libtorch doesn’t work well "
    },
    {
      "X ": "converting vector of double to tensor issues,I am trying to convert an std::vector to torch:Tensor with the following code:",
      "Z ": "The module parameters interacting with your input tensor should probably have the same type as the input tensorI think it would be easier to work with torch: :kFloat tensor so you should convert your data to float before creating the input with from_blob",
      "Y ": "The module parameters interacting with your input tensor should probably be of the same type as the input tensor. I believe torch: : would be easier to work with. Because the kFloat tensor is used, you should convert your data to float before using from blob to create the input."
    },
    {
      "X ": "initializing a model instance before loading,would there be a way to pull that information from a saved file prior to loading so I could “dynamically” initialize a UNet object that will have a valid structure for storing the data from the saved file ?",
      "Z ": "You could use some utility funcitions from this topic or load your jitted model directly (which seems to be the recommended way) with link https://discuss.pytorch.org/t/how-to-copy-network-parameters-in-libtorch-c-api/32221/7",
      "Y ": "You could use some of the utility functions from this topic or directly load your jitted model (which seems to be the recommended way)"
    },
    {
      "X ": "using autograd grad function in c",
      "Z ": "I’m not sure this function was already added back then. You should try to upgrade to the latest version.",
      "Y ": "use latest version"
    },
    {
      "X ": "cuda extension install without rebuilding,How do I move a CUDA extension after it has been build?",
      "Z ": "You can specify the build directory if you want and share that directory between your docker images with link https://pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.load_inline",
      "Y ": "If you want, you can specify the build directory and share it among your Docker images."
    },
    {
      "X ": "changing specific elements in tensor,I’m looking for some function or any efficient way to change specific elements in a tensor.<b",
      "Z ": "if x is your tensor you can do torch.where(x &gt; 5, x+100, x)",
      "Y ": "torch.where(x &gt; 5, x+100, x)"
    },
    {
      "X ": "multi gpu training on single node with distributeddataparallel,When I train with DistributedDataParallel do I get the functionality of DataParallel, meaning can I assume that on a single node if there is more than one GPU then all GPUs will be utilized on that node?",
      "Z ": "Yep, DistributedDataParallel (DDP) can utilize multiple GPUs on the same node, but it works differently than DataParallel (DP). DDP uses multiple processes, one process per GPU, while DP is single-process multi-thread See this page for the comparison between the two and and this to get started with DDP with link https://pytorch.org/docs/stable/notes/ddp.html ",
      "Y ": "Yes, DistributedDataParallel (DDP) can use multiple GPUs on the same node, but it does so in a different way than DataParallel (DP). DDP employs multiple processes, one for each GPU, whereas DP employs a single multi-threaded process. See this page for a comparison of the two, and this for more information on getting started with DDP."
    },
    {
      "X ": "how to balance gpu memories in ddp,When training a model with DDP, GPU for rank 0 consumes much higher memory than others.<br>Because of that GPU, I cannot increase batch-size for training.",
      "Z ": "I haven’t set explicitly device cuda: 0 at any point.And even in the shows the same unbalanced GPU memory consumption with link https: //github.com/pytorch/examples/blob/master/distributed/ddp/main.py ,I solved the issue by setting torch.cuda.set_device(args.local_rank) which works the same as setting CUDA_VISIBLE_DEVICES.",
      "Y ": "I solved the issue by setting torch.cuda.set_device(args.local_rank) which works the same as setting CUDA_VISIBLE_DEVICES."
    },
    {
      "X ": "strange behavior in pytorch",
      "Z ": "Well, i moved the data from HDD to SSD and it solved my problem. I didn’t know HDD can be such a bottleneck. Anyway, now on SSD the training is very smooth as expected, both on single- and multi-GPU devices",
      "Y ": "i moved the data from HDD to SSD and it solved my problem"
    },
    {
      "X ": "initialize tensor in autograd function",
      "Z ": " cbd:Lets say i want to predict y = 2*x and initial weight = 1 which we want to be 2 as we want y_pred = w * x. Lets say every time we give the input value x=2 and learning_rate = 0.1I‚am sorry I can‚Äôt really follow this. What would be the pair (x, expected_y) that you want? What would be the initial weights w0?It means we should never take ReLU output as our final output ?It is not advised in general no. Mainly because it does not have any benefit (if you‚Äre value is positive, it will just be an identity) and has the drawback of potentially getting your training stuck if the value becomes negative (as you will get 0 gradient and nothing will move anymore).OR should i missed something?Be careful here that for the l1 loss, the gradient can be 1 or -1 depending on the sign of the difference! That will make sure you donôt keep going in one direction forever.",
      "Y ": "Keep in mind that the gradient for the l1 loss can be 1 or -1 depending on the sign of the difference! This will ensure that you do not continue in the same path indefinitely."
    },
    {
      "X ": "custom nn conv2d",
      "Z ": "There seem to be a few minor mistakes in the code: h and w should probably de defined as input.size(2) and input.size(3), respectively.Currently you are assigning the same input value to both. However, since you are passing input directly, they can also be removed completely.As already mentioned, you are passing input, so I‚Äôm not sure what image should be used for.You are also passing the weight to forward, which would make filt unnecessary. Also, you are currently re-initializing filt in each forward pass, which won‚Äôt store the already learned weights, so use weight instead for your operations.re you using the nested loop to compare the outputs of patches * filt or what is it used for?patches.sum(1) calculates the sum over dim1. I‚Äôm currently not sure, which shape patches would have at that point.Sami_Hassan:After that there is addition ‚Äúpatches = patches.sum(1)‚Äù i am not sure what is it doing , I would like to replace the addition as well.How would you like to replace the addition? Also, could you print the shape of patches before calculating sum(1) and name all dimensions, so that we can understand the use case better? ",
      "Y ": "Input.size(2) and input.size(3), respectively, should probably be defined for h and w.You're now giving both the same input value. They can, however, be fully deleted because you are giving information directly.As previously stated, you are passing input, therefore I'm not sure what image you should use.You're also moving the weight forward, which eliminates the need for filt. Additionally, you are now re-initializing filt in each forward pass, which will not save the previously learnt weights, so instead use weight for your operations."
    },
    {
      "X ": "accessing update values,Is there a way to access the update values computed by the optimizer?",
      "Z ": "I’m afraid that in the general case no, there isn’t any other way to get that As a fun fact, some optimizers like lbfgs might actually do the step as multiple “sub-steps”, so in that case, checking before and after is really the only thing you can do",
      "Y ": "I'm afraid that, in most cases, there isn't another way to get that. As an aside, some optimizers, such as lbfgs, may actually perform the step as multiple in which case checking before and after is really the only option."
    },
    {
      "X ": "propagation trough 2 identical networks but do not accumulate gradients w r t the second pass",
      "Z ": "Hm, even if you compute dDistanceLoss(E(x_hat),z.detach()) / dx_hat for your second loss, that may conflict with the reconstruction loss. Using non-detached z for second loss seems more promising to me… Anyway, manipulations suggested above may do what you want, or temporarily disable .requires_grad for all nn.Parameters.",
      "Y ": "Even if you compute dDistanceLoss(E(x hat),z.detach()) / dx hat for your second loss, it could conflict with the reconstruction loss. I believe that using non-detached z for the second loss is more promising... In any case, the above-mentioned manipulations may accomplish your goals or temporarily disable. For all nn.Parameters, requires grad is required."
    },
    {
      "X ": "grad attribute of a non leaf tensor being accessed",
      "Z ": "Hi,This warning only means that you are accessing the .grad field of a Tensor for which pytorch will never populate the .grad field.You can run your code with python -W error your_script.py to make python error out when the warning happens and so show you where it happens exactly.The gist of the problem here is that only leaf Tensors will have their .grad field populated. So if this warning happens, that means that you think something a leaf while it isn't. This usually happens if your perform operations on a Tensor that requires gradients, a common mistake is foo = torch.rand(10, requires_grad=True).to(device). In this case, foo won't be a leaf because it is the result of the .to() operation.",
      "Y ": "This warning only signifies that you're trying to access a Tensor's.grad field, which pytorch will never populate.You can execute your code with python -W error your script.py to have Python throw an error when the warning occurs, allowing you to see exactly where it occurs.The main issue here is that only leaf Tensors will have their.grad field filled in. So if you get this warning, it means you mistook anything for a leaf when it isn't. This typically occurs when you do operations on a Tensor that require gradients, such as foo = torch.rand(10, requires grad=True) (device). Because foo is the result of the.to() operation, it will not be a leaf in this case."
    },
    {
      "X ": "how to train domain adaptive model",
      "Z ": "Both approaches should work.The first one would use less memory and more compute, since you would be using two backward calls (more compute), but Autograd will be able to delete the intermediate tensors from the first forward pass, as they are not needed anymore (gradients were already computed using them)The latter approach would store both computation graphs and would thus use more memory",
      "Y ": "Both approaches should be effective. The first would use less memory and more compute because you'd be making two backward calls (more compute), but Autograd will be able to delete the intermediate tensors from the first forward pass because they're no longer needed (gradients were already computed using them)The latter approach would store both computation graphs, requiring more memory."
    },
    {
      "X ": "how to preprocess input for pre trained networks, when using the pretrained networks of torchvision.models module, what preprocessing should be done on the input images we give them ?",
      "Z ": "All pretrained torchvision models have the same preprocessing, which is to normalize using the following mean/std values with link https://github.com/pytorch/examples/blob/97304e232807082c2e7b54c597615dc0ad8f6173/imagenet/main.py#L197-L198  ",
      "Y ": "All pretrained torchvision models go through the same preprocessing, which is to normalise using the mean/std values ."
    },
    {
      "X ": "very small class weights",
      "Z ": "You could multiply the weights, if you are suspecting rounding errors or any other numerical issues.The weights are used relatively, so you should be able to add a constant offset to the tensor",
      "Y ": "If you suspect rounding errors or other numerical issues, you could multiply the weights. Because the weights are used in a relative manner, you should be able to add a constant offset to the tensor."
    },
    {
      "X ": "error loading state dict gpt2 finetuning",
      "Z ": "issuse already soved with  link https://github.com/huggingface/transformers/issues/4309 ,already or are you still running into these issues?",
      "Y ": "There masked_bias was introduced. "
    },
    {
      "X ": "how to plot saliency map from rnn model in nlp task",
      "Z ": "There’s still a small problem when I avoided to do optimizer step after calling backward() in training mode. the problem is: <strong>for the same input sentence and the same model, the saliencies are different between the first time and second time I called</strong>. If the parameters are totally fixed, the results of saliency should be consistent. I guess that because my model contains a <strong>dropout layer</strong> and the randomness is generated in model.train() from drop out layer. <strong>Any ideas about how to deactivate drop out layer in model.train() mode?,Saliency for the first time calling backward(): with link https://discuss.pytorch.org/uploads/default/original/3X/5/5/55d4cfcc8bfb72e818fc9a51700961e095560a02.png, Saliency for the second time calling backward() with link https://discuss.pytorch.org/uploads/default/original/3X/6/d/6d4b89fed4b3a6817dd5fda3967f1ab352fe7bdb.png,The Saliency for each word is getting bigger.Well actually I figured out how to do this, just set the drop out layer to eval by adding model.dropout.eval() , now the Saliency become consistent.",
      "Y ": "just set the drop out layer to eval by adding model.dropout.eval() "
    },
    {
      "X ": "pytorch cuda speed",
      "Z ": "As described in the tutorial, the next step after writing the extension is to write a custom CUDA kernel and reduce the kernel launch overheads A definite method of speeding things up is therefore to rewrite parts in C++ (or CUDA) and  fuse particular groups of operations. Fusing means combining the implementations of many functions into a single functions, which profits from fewer kernel launches as well as other optimizations we can perform with increased visibility of the global flow of data",
      "Y ": "Following the creation of the extension, the next step is to create a custom CUDA kernel and reduce the kernel launch overheads, as described in the tutorial. Rewriting parts in C++ (or CUDA) and fusing specific groups of operations is thus a definite way of speeding things up. Fusing is the process of combining the implementations of multiple functions into a single function, which benefits from fewer kernel launches as well as other optimizations we can perform with increased visibility of the global data flow."
    },
    {
      "X ": "use of lossclosure in optimizers",
      "Z ": "The test provide with link https://github.com/pytorch/pytorch/blob/856e8cf0288fe3c1701d11fae61b214c08635b9d/test/cpp/api/optim.cpp#L46-L55,Would that help ?",
      "Y ": "auto step = [&](OptimizerClass& optimizer, Sequential model, torch::Tensor inputs, torch::Tensor labels) "
    },
    {
      "X ": "torch from blob memory types,My question now is how does the situation differ if the memory is not linear but instead a volume as a Cuda Array.  This type of structure doesn’t really have a pointer in the same sense and under normal circumstances one would have to use surf3Dwrite() and surf3Dread() in Cuda kernels.  If the input is in this format, what is the recommended way to run inference on it?",
      "Z ": "Unfortuatly, from_blob can only take raw memory pointer (and size/stride informations). So you won’t be able to wrap complex data types inside a torch Tensor.But if one copy is ok, you can dump it into a new contiguous memory buffer and use that",
      "Y ": "Regrettably, from blob can only accept a raw memory pointer (along with the size and stride information). As a result, you won't be able to use a torch Tensor to wrap complex data types.However, if only one copy is acceptable, you can dump it into a new contiguous memory buffer and use it."
    },
    {
      "X ": "link libtorch with make instead of cmake,For a project I am forced to link LibTorch with a Makefile, I can’t use cmake. Now looking into the LibTorch",
      "Z ": "You can do it, but you have to keep in mind that pytorch code changed very fast, symbols might changed their .so, for eg. we might want to split some of the .so to make the size of those .so smaller.We do recommend to use cmake,If you want to use make, check this post with link  https: //discuss.pytorch.org/t/how-do-i-to-create-a-new-project-in-c-to-run-libtorch-without-cmake/79046/5, How do i to create a new project in C++ to run libtorch without cmake?And I believe now you need to add -ltorch_python after -lc10 as well. And if your code has symbols in other .so, you should add them as well. All the .so are in the {your_libtorch_root|/lib, and if you get and missing symbol when linking, you can try finding the missing symbol within the .so in that folder and then add them into your make line",
      "Y ": "You can do it, but keep in mind that the pytroch code changes frequently, and symbols may change. As an example, we may want to split some of the. so as to reduce the size of those.soWe strongly advise you to use cmake."
    },
    {
      "X ": "memory leaks in libtorch",
      "Z ": "Thanks for the analysis and I think that (same as with your last check) it would be worth creating an issue on GitHub to track it",
      "Y ": "Thanks for the analysis, and I believe it would be worthwhile (as with your previous check) to create an issue on GitHub to track it."
    },
    {
      "X ": "c10 macros cmake macros h not exists",
      "Z ": "I found a related post with link https://discuss.pytorch.org/t/compilation-error-in-jit-ir-ir-h/80694/2",
      "Y ": "I have a CPP android project. I’m compiling it using ndk-build (ndk21 and C++14).I have a file named “main.cpp” and I’m trying to #include “torch/script.h” ."
    },
    {
      "X ": "does distributeddataparallel split data at the batch dimension,How the batch will be split? Will the batch first split into two and thus each node will get a batch of data size 32, and finally, each node will split the data among the four GPUs, thus each GPU will get a batch of data size 8? is this the way the will be split in DistributedDataParallel mode?",
      "Z ": "In this case, as each node has 4 GPUs, each node will launch 4 processes with each process creating its own dataloader and DDP instance. So, each node will actually have 4 data loaders.If you would like to run batch size of 64 across 2 node (8 gpus), then each data loader should load data size of 64 / 8 = 8",
      "Y ": "Because each node has four GPUs, each node will launch four processes, each with its own dataloader and DDP instance. As a result, each node will have four data loaders. If you want to run 64 batch sizes across two nodes (8 gpus), each data loader should load 64 / 8 = 8 data sizes."
    },
    {
      "X ": "error occurs in rpc pipeline code in tutorial.AttributeError: ‘torch.distributed.rpc.TensorPipeRpcBackendOptions’ object has no attribute ‘num_send_recv_threads’Could anyone please tell me what possible reasons could be?",
      "Z ": "I solve this after adding this argumentwith link https: //discuss.pytorch.org/uploads/default/original/3X/0/3/037a56585139c67b25735853d41042b89b125842.png ",
      "Y": "backend = rpc.BackendType.TENSORPIPE"
    },
    {
      "X ": "distributed data parallel over the internet,how to ensure before running the training script on each remote node that the 2 remote nodes have access to each other?",
      "Z ": "No, you don’t need to ssh from node-1 to other nodes to launch the script. The ping/ssh I mentioned is only to check what IP would work. If you confirm that the IP of one node is accessible for all other nodes, you can set that node as master. This is only for rendezvous, and all nodes will use the rendezvous process to discover each other automatically.",
      "Y ": "To run the script, you don't need to ssh from node-1 to other nodes. I only mentioned ping/ssh to see what IP would work. You can set a node as master if you confirm that the IP of that node is accessible to all other nodes. This is only for rendezvous, and all nodes will automatically discover each other using the rendezvous process."
    },
    {
      "X ": "why are many batches loaded on each gpu,When I use DDP package to train imagenet, there are always OOM problem.<br>I check the GPU utilization and I found there are many processes on each GPU ？<br>What is the reason and how can I avoid this problem ?",
      "Z ": "This shouldn’t happen and each process should use one GPU and thus create one CUDA context.Are you calling CUDA operations on all devices in the script or did you write device-agnostic code, which only uses a single GPU?",
      "Y ": "This should not occur, and each process should use a single GPU, resulting in a single CUDA context.Is the script calling CUDA operations on all devices, or did you write device-agnostic code that only uses one GPU?"
    },
    {
      "X ": "modifying resnet18 architecture",
      "Z ": "Yes, you can use any valid shape and it shouldn’t break anything. Powers of two are often friendly for memory access pattern etc. so you might see performance plateaus or cliffs.Yes, same as 1,I’m not deeply familiar with the package, but it seems that  multiple inputs are supported with link https://github.com/TylerYep/torchinfo#multiple-inputs-w-different-data-types ",
      "Y ": "Yes, you may use any valid shape as long as it does not break anything. Because powers of two are often friendly for memory access patterns and so on, you may experience performance plateaus or cliffs. Yes, as in 1. I'm not intimately familiar with the package, but it appears that multiple inputs are supported."
    },
    {
      "X ": "freeze part of model parameters cause cuda error",
      "Z ": "ericlormul:Yeah, deg always contains zero elements. This explains why torch.autograd.set_detect_anomaly(True) triggers warning, but this warning shouldn‚Äôt be considered as an error, right?When you enable anomaly detection, the NaNs will trigger an error. If you are not concerned about it, you could disable anomaly detection and handle the NaN values separately.The idea of using this utility is to get a runtime error in order to debug the issue.ericlormul:What confuses me is for the first ~600 iteration, there is no error. Then, nan grad appears, causing certain parameters to be updated to nan. Thus, in next iteration, due to nan parameters, pred tensor contains nan and bce_loss errors. Any insights on how to debug this. Thanks.I guess that the first 600 iterations don‚Äôt contain an exact zero and would thus still create valid gradients. To void the NaN gradient, you could add a small eps value to the pow operation.",
      "Y ": "When you enable anomaly detection, the NaNs will trigger an error. If you are not concerned about it, you could disable anomaly detection and handle the NaN values separately.The idea of using this utility is to get a runtime error in order to debug the issue."
    },
    {
      "X ": "how to use zero grad with gans",
      "Z ": "Remember that the parameters are updated through the optimizer's step method.When the optimiser is initialised you pass it all the parameters it will modify. So yes, there are still some uncleared gradients in the discriminator parameters, however: loss_G.backward() optimizer_G.step() Wont be affected by them, as this will only modify the parameters of the generator based on its gradients wrt loss_G.",
      "Y ": "use loss_G.backward()optimizer_G.step()"
    },
    {
      "X ": "warning nan or inf found in input tensor but input tensors do not contain nan or inf",
      "Z ": "I got the fix here with link https://github.com/lanpa/tensorboardX/pull/469",
      "Y ": "def check_nan(array):tmp = np.sum(array) if np.isnan(tmp) or np.isinf(tmp):print('Warning: NaN or Inf found in input tensor.') return array"
    },
    {
      "X ": "running mean and running stats",
      "Z ": "When using batchnorm in training mode, the running stats are always updated yes.<br>You should be using the eval mode to use these stats and stop updating them when evaluating indeed.",
      "Y ": "The running stats are always updated when using batchnorm in training mode. To use these stats, you should use the eval mode and stop updating them when evaluating."
    },
    {
      "X ": "how could you manually stop backpropagation",
      "Z ": "You can use autograd.grad(loss, input) and it will return gradients for that input and only perform the necessary computations to get that value.If you prefer for the .grad fields to be populated, you can use loss.backward(inputs=input).",
      "Y ": "use autograd.grad(loss, input) and it will return gradients for that input and only perform the necessary computations to get that value.If you prefer for the .grad fields to be populated, you can use loss.backward(inputs=input)"
    },
    {
      "X ": "how to do forward on part of my model",
      "Z ": "I think creating a new custom model with a new forward method would be the cleanest approach. Alternatively, you could also use the modules by directly accessing them via: middle_activation = ... out = model.middle_layer1(middle_activation) out = model.middle_layer2(out)...but this would basically be equivalent to writing a new custom forward.",
      "Y ": "out = model.middle_layer1(middle_activation)out = model.middle_layer2(out)"
    },
    {
      "X ": "how torch calculates the grads for the scalar and non scalar tensors",
      "Z ": "The idea is that it always does a vector Jacobian product. It just happens that when the output is scalar, it is 1D and so the vector is of size 1 and can be replaced with just the value 1. that will give you the full jacobian (and thus gradients)",
      "Y ": "It always performs a vector Jacobian product, according to the theory. When the output is scalar, it is 1D, so the vector is of size 1 and can be replaced with just the value 1. This will provide you with the complete Jacobian (and thus gradients)"
    },
    {
      "X ": "torchvision faster r cnn model input arguments",
      "Z ": "The training only uses boxes and labels and ignores any other keys. As the tutorial says, the image_id, area and iscrowd are used in evaluation (but not during training).",
      "Y ": "The training only uses boxes and labels and ignores any other keys.  the image_id, area and iscrowd are used in evaluation"
    },
    {
      "X ": "typeerror tensor argument device must be torch device not type",
      "Z ": "Thank you…It works now ",
      "Y ": "test_iter = Iterator(tst, batch_size=64, device=torch.device, sort=False, sort_within_batch=False, repeat=False)"
    },
    {
      "X ": "remove padding dimension from output layer",
      "Z ": " I ended up not having a padding token in the vocabulary and then I padded the batch with whatever index was len(TRG.vocab)+1. Whenever I needed the pad_idx (for ignore_idx in the criterion for example) I just used len(TRG.vocab)+1. Note that I built my own Dataset and Loader so this may not work if you’re not padding your batches yourself",
      "Y ": "The workout uses only boxes and labels and does not use other keys. The image id, area and iscrowd are used to assess, as the tutorial states (but not during training)."
    },
    {
      "X ": "tensor to c vector , Is there a way to turn a 2d libtorch torch::Tensor into vector",
      "Z ": "This may not be sufficiently general, because it really relies on contiguous data storage: If you know the data type and sizes of the tensor, and the origin (available through data_ptr()), it’s straightforward to step through memory and build up row vectors, and then vectors of these as column vectors. (I might have it backward, but I’ve been able to get such a procedure to work for me.)",
      "Y ": "This may not be general enough because it really depends on the storage of contiguous data: It is easy to step through the memory and build row vectors and vectors from those as column vectors when you have a knowledge of the type and size of the tensor and the source (which can be accessed via data ptr(). (I could be backward, but I could have had a procedure of that kind to work for me.)"
    },
    {
      "X ": "std vector stores inherited torch module objects,I have created several modules derived from torch::nn::Module. How can I use std::vector to store them? ",
      "Z ": "I am not sure if you have any specific requirement to use vector of modules. In C++ API, we have which could be used for this purpose with link https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/nn/modules/container/modulelist.h ",
      "Y ": "If you have any particular module vector requirements, I'm not sure."
    },
    {
      "X ": "current status of automatic quantization support",
      "Z ": "Yes, graph mode quantization will be ready for testing after with link https://github.com/pytorch/pytorch/pull/32303 ,Next we are going to test more models and fix the issues we see in the process, this might take a few more weeks to a month or so, but in the meantime, feel free to checkout the master and test the graph mode quantization on your model as well",
      "Y ": "Yes, graph mode quantization will be ready for testing"
    },
    {
      "X ": "check if grad is enabled",
      "Z ": "You can check it by accessing torch.Tensor's requires_grad attribute, which returns True if gradient should be calculated on that Tensor. Note that it has a contagious behavior, that is, if A.requires_grad=True for some Tensor A, all Tensors computed from A have the requires_grad attribute True.",
      "Y ": "You may check it out by going to torch. The requires grad attribute of a Tensor returns True if it is necessary to calculate gradient on that Tensor. It's worth noting that it's contagious; if A.requires grad=True for some Tensor A, then all Tensors computed from A have the requires grad property set to True."
    },
    {
      "X ": "non gradient trackable convolutions",
      "Z ": "Yes, that's possible.You could either use the functional API with a kernel specified as a plain tensor (not an nn.Parameter): kernel = torch.randn(...) out = F.conv2d(input, kernel, ...) or set the requires_grad attribute of the weight (and bias) in the conv layer to False.",
      "Y ": "You could either use the functional API with a kernel specified as a plain tensor (not an nn.Parameter):kernel = torch.randn(...) out = F.conv2d(input, kernel, ...)or set the requires_grad attribute of the weight (and bias) in the conv layer to False"
    },
    {
      "X ": "best way to encode latent variable with cnn",
      "Z ": "I think you might need a flatten in 1, too,We have a short discussion of the more general number of channels and width in our book (Stevens/Antiga/Viehmann: Deep Learning with PyTorch, Section “8.3.1 Our network as an nn.Module”).So the idea behind “channels should increase while width should decrease” is mainly “total size should decrease but not too much”. That the channels increase is more a way to achieve “not too much” in general (so halving the spatial resolution + doubling channels halves the number of elements).Aside from the size (196 vs 128), the key difference between 1 and 2 is that the first evicts all explicitly spatial information while the second keeps it but reduces the information “per pixel”. Which is more appropriate is likely task dependent. (And you could mix, too, if you feel like it.)For the VAE, it might make sense to to keep some spatial information as might want to be able reconstruct flipped/rotated images (as a vague intuition), but the typical thing might be to keep more than one channel. But I would recommend to check out the common VAE architectures. It is still good to ask “why” they did it this way, but it’d give you a starting point",
      "Y ": "I think you might need a flatten in 1, too."
    },
    {
      "X ": "custom color mapping in data loader for unet image segmentation",
      "Z ": "Figured it out! The fill tool I was using on GIMP had anti-aliasing which added additional pixel values to the mask boundary in order to smooth the edges. Thank you.",
      "Y ": "The GIMP fill tool I was using had anti-aliasing, which added extra pixel values to the mask boundary to smooth the edges."
    }
  ]
}