Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad mix with apply in DataParallelTable ? #262

Open
fmassa opened this issue Apr 28, 2016 · 9 comments
Open

Bad mix with apply in DataParallelTable ? #262

fmassa opened this issue Apr 28, 2016 · 9 comments

Comments

@fmassa
Copy link
Contributor

fmassa commented Apr 28, 2016

There seems to be a problem (or at least a behaviour which is not compatible with other modules) with apply in DataParallelTable.
The following snippet reproduces the problem:

require 'cunn'
model = nn.DataParallelTable(1)
model:add(nn.SpatialConvolution(3,96,7,7,3,3),1)
model:add(nn.SpatialConvolution(3,96,7,7,3,3),2)
model:cuda()
input = torch.randn(32,3,224,224):cuda()
function f(m)
  local ff = m.updateOutput
  m.updateOutput = function(self, i)
    return ff(self, i)
  end
end
model:apply(f)
model:forward(input);

The error is the following:

/opt/rocks/distro/install/share/lua/5.1/torch/File.lua:141: Unwritable object <function> at <?>.<?>.updateOutput.ff.THNN.getState.ffi.abi
stack traceback:
    [C]: in function 'error'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:141: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:200: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:200: in function 'writeObject'
    /opt/rocks/distro/install/share/lua/5.1/torch/File.lua:235: in function 'writeObject'
    ...
    /opt/rocks/distro/install/share/lua/5.1/nn/Module.lua:107: in function 'clone'
    .../distro/install/share/lua/5.1/cunn/DataParallelTable.lua:634: in function 'applyChanges'
    .../distro/install/share/lua/5.1/cunn/DataParallelTable.lua:644: in function 'setup'
    .../distro/install/share/lua/5.1/cunn/DataParallelTable.lua:651: in function 'exec'
    .../distro/install/share/lua/5.1/cunn/DataParallelTable.lua:194: in function 'forward'
    [string "model:forward(input);"]:1: in main chunk
    [C]: in function 'xpcall'
    /opt/rocks/distro/install/share/lua/5.1/trepl/init.lua:669: in function 'repl'
    ...cks/distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:199: in main chunk
    [C]: at 0x00406670  

Here is a related issue fmassa/optimize-net#12

Is it an abusive use of apply (and should not be supported) ?

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2016

Can't reproduce here. The error looks as if you were passing that function to a thread, but you aren't using the threaded Impl in your example.

EDIT: it works for me even with threads enabled

@fmassa
Copy link
Contributor Author

fmassa commented Apr 28, 2016

Ok , thanks for trying it out. I'll try this snippet on another machine to see if I manage to reproduce it.

@szagoruyko
Copy link
Member

@apaszke are you using lua5.2? I can repro on several machines with luajit2.1 and freshly installed torch

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2016

@szagoruyko It was LuaJIT. I reinstalled everything and I can confirm the issue now.

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2016

I got it to work, but it's a bit of a hack 😜

require 'cunn'                                                      
model = nn.DataParallelTable(1)                                     
model:add(nn.SpatialConvolution(3,96,7,7,3,3),1)                    
model:add(nn.SpatialConvolution(3,96,7,7,3,3),2)                    
model:cuda()                                                        
input = torch.randn(32,3,224,224):cuda()                            
function f(m)                                                       
  m.updateOutput = function(self, i)                                
     return torch.getmetatable(torch.type(m)).updateOutput(self, i) 
  end                                                               
end                                                                 
model:apply(f)                                                      
print(model:forward(input):size())                                  

I don't really know why it tries to serialise THNN along with the closure in your example. Also it seems to work with older Torch versions, which is even weirder.

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2016

Still, this is a problem with serialization, not with DPT. Below is a snippet that reproduces it:

require 'nn'                                                                                   
model = nn.Sequential()                          
model:add(nn.SpatialConvolution(3,96,7,7,3,3),1) 
function f(m)                                    
  m._uO = m.updateOutput                         
  m.updateOutput = function(self, i)             
     return m:_uO(i)                             
  end                                            
end                                              
model:apply(f)                                   
model:clone()                                    

@colesbury
Copy link
Contributor

It serializes THNN because the reference chain:

m.updateOutput -> ff -> THNN.optionalTensor

i.e.

  1. the original updateOutput is now an upvalue (ff) of the new updateOutput
    function
  2. the original updateOutput has THNN as an upvalue because of the call to
    THNN.updateOutput

On Thu, Apr 28, 2016 at 4:14 PM, Adam Paszke [email protected]
wrote:

I got it to work, but it's a bit of a hack 😜

require 'cunn'
model = nn.DataParallelTable(1)
model:add(nn.SpatialConvolution(3,96,7,7,3,3),1)
model:add(nn.SpatialConvolution(3,96,7,7,3,3),2)
model:cuda()
input = torch.randn(32,3,224,224):cuda() function f(m)
m.updateOutput = function(self, i)
return torch.getmetatable(torch.type(m)).updateOutput(self, i)
end end
model:apply(f) print(model:forward(input):size())

I don't really know why it tries to serialise THNN along with the closure
in your example. Also it seems to work with older Torch versions, which is
even weirder.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#262 (comment)

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2016

@colesbury But, if you don't do :clone(), then THNN is still an upvalue of the original updateOutput. It doesn't get serialized then, becase it's a metatable method, right?

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2016

If so, this is probably a better workaround:

require 'nn'                                     
model = nn.Sequential()                          
model:add(nn.SpatialConvolution(3,96,7,7,3,3),1) 
function f(m)                                    
  local mt = getmetatable(m)                     
  mt._updateOutput = mt.updateOutput             
  m.updateOutput = function(self, i)             
     return self:_updateOutput(i)                
  end                                            
end                                              
model:apply(f)                                   
model:clone()                                    

The only downside is that it will break if you save the model to a file and load it without _updateOutput set in all modules' metatables. You could override the write method as well, but it starts to be a chain of hacks 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants