Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky UnpicklingError in test_engine.py #3149

Closed
kwlzn opened this issue Apr 6, 2016 · 14 comments
Closed

Flaky UnpicklingError in test_engine.py #3149

kwlzn opened this issue Apr 6, 2016 · 14 comments
Assignees
Labels

Comments

@kwlzn
Copy link
Member

kwlzn commented Apr 6, 2016

placeholder issue for CI hang investigation on the test_engine.py suite.

@kwlzn
Copy link
Member Author

kwlzn commented Apr 6, 2016

caught with PANTS_TEST_PYTEST_OPTIONS='-vvs' in https://travis-ci.org/pantsbuild/pants/jobs/120777207:

tests/python/pants_test/engine/exp/test_engine.py::EngineTest::test_rerun_with_cache Process Process-25:
Traceback (most recent call last):
 File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
   self.run()
 File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
   self._target(*self._args, **self._kwargs)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/b06d768230d95db8675f2b397f5d6c843bdc341a/pants/engine/exp/processing.py", line 22, in _stateful_pool_loop
   result = function(states, item)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/b06d768230d95db8675f2b397f5d6c843bdc341a/pants/engine/exp/engine.py", line 159, in _execute_step
   resolved_request = storage.resolve_request(step)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/b06d768230d95db8675f2b397f5d6c843bdc341a/pants/engine/exp/storage.py", line 269, in resolve_request
   dependencies[self._from_key(dep)] = self._from_key(state)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/b06d768230d95db8675f2b397f5d6c843bdc341a/pants/engine/exp/storage.py", line 285, in _from_key
   return self.get(obj)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/b06d768230d95db8675f2b397f5d6c843bdc341a/pants/engine/exp/storage.py", line 210, in get
   return self._assert_type_matches(pickle.load(value), key.type)
UnpicklingError: pickle data was truncated

No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

@kwlzn
Copy link
Member Author

kwlzn commented Apr 6, 2016

error #2:

tests/python/pants_test/engine/exp/test_engine.py::EngineTest::test_rerun_with_cache Process Process-9:
Traceback (most recent call last):
 File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
   self.run()
 File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
   self._target(*self._args, **self._kwargs)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/0c6332080616d4c8330c1eca31765c0909b53394/pants/engine/exp/processing.py", line 22, in _stateful_pool_loop
   result = function(states, item)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/0c6332080616d4c8330c1eca31765c0909b53394/pants/engine/exp/engine.py", line 159, in _execute_step
   resolved_request = storage.resolve_request(step)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/0c6332080616d4c8330c1eca31765c0909b53394/pants/engine/exp/storage.py", line 269, in resolve_request
   dependencies[self._from_key(dep)] = self._from_key(state)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/0c6332080616d4c8330c1eca31765c0909b53394/pants/engine/exp/storage.py", line 285, in _from_key
   return self.get(obj)
 File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/0c6332080616d4c8330c1eca31765c0909b53394/pants/engine/exp/storage.py", line 210, in get
   return self._assert_type_matches(pickle.load(value), key.type)
UnpicklingError: invalid load key, 'x'.
ivy_resolve executed for (Jar(address=3rdparty/jvm:scala-library),), returned: Classpath(creator=u'ivy_resolve')
scalac executed for (ScalaSources(dependencies=[3rdparty/jvm:scala-library], files=[u'Scrooge.scala'], spec_path=u'src/scala/scrooge', type_alias=u'scala'), [Classpath(creator=u'ivy_resolve')]), returned: Classpath(creator=u'scalac')

No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

@kwlzn kwlzn changed the title CI Hang: TBD CI hangs in test_engine.py Apr 6, 2016
@peiyuwang peiyuwang self-assigned this Apr 6, 2016
@kwlzn
Copy link
Member Author

kwlzn commented Apr 6, 2016

#3:

 tests/python/pants_test/engine/exp/test_engine.py::EngineTest::test_rerun_with_cache Process Process-49:
 Traceback (most recent call last):
   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
     self.run()
   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
     self._target(*self._args, **self._kwargs)
   File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/559ff7f5812767a0527d5dcd1470ecb1a2d89127/pants/engine/exp/processing.py", line 22, in _stateful_pool_loop
     result = function(states, item)
   File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/559ff7f5812767a0527d5dcd1470ecb1a2d89127/pants/engine/exp/engine.py", line 159, in _execute_step
     resolved_request = storage.resolve_request(step)
   File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/559ff7f5812767a0527d5dcd1470ecb1a2d89127/pants/engine/exp/storage.py", line 269, in resolve_request
     dependencies[self._from_key(dep)] = self._from_key(state)
   File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/559ff7f5812767a0527d5dcd1470ecb1a2d89127/pants/engine/exp/storage.py", line 285, in _from_key
     return self.get(obj)
   File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/559ff7f5812767a0527d5dcd1470ecb1a2d89127/pants/engine/exp/storage.py", line 210, in get
     return self._assert_type_matches(pickle.load(value), key.type)
 UnpicklingError: unpickling stack underflow

peiyuwang added a commit that referenced this issue Apr 6, 2016
We intended to catch all subprocesses exceptions, for example the test case
EngineTest.test_multiprocess_unpickleable checks SerializationError, but some
later added code wasn't protected, including line #159 that caused
#3149 test to hang.

This review fixes this by moving everything into the same try block.

This review does not fix the root cause of #3149
but next time it happens the subprocess won't die and will report the
exception back to the engine.

Testing Done:
https://travis-ci.org/peiyuwang/pants/builds/121229553 passed.

Bugs closed: 3149, 3155

Reviewed at https://rbcommons.com/s/twitter/r/3656/
@peiyuwang
Copy link
Contributor

After https://rbcommons.com/s/twitter/r/3656/ EngineTest continues to fail intermittently on master, it no longer hangs but also not as we would have hoped to return the exception from subprocess.

https://travis-ci.org/pantsbuild/pants/jobs/121272283 simply says FAILURE

                     tests/python/pants_test/engine/exp/test_engine.py .WARN] No .coverage file was found! Skipping coverage reporting.
FAILURE
21:38:46 01:00   [complete]
               FAILURE

@peiyuwang
Copy link
Contributor

Happening again https://travis-ci.org/pantsbuild/pants/jobs/121598230

This time we have assertion failure as expected

                     ==================== FAILURES ====================
                     ___ EngineTest.test_multiprocess_engine_multi ____

                     self = <pants_test.engine.exp.test_engine.EngineTest testMethod=test_multiprocess_engine_multi>

                         def test_multiprocess_engine_multi(self):
                           with self.multiprocessing_engine() as engine:
                     >       self.assert_engine(engine)

                     tests/python/pants_test/engine/exp/test_engine.py:51: 
                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                     tests/python/pants_test/engine/exp/test_engine.py:32: in assert_engine
                         result = engine.execute(self.request(['compile'], self.java))
                     .pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/engine.py:86: in execute
                         self.reduce(execution_request)
                     .pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/engine.py:273: in reduce
                         await_one()
                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

                         def await_one():
                           """Await one completed step, and remove it from in_flight."""
                           if not in_flight:
                             raise Exception('Awaited an empty pool!')
                           step_id, result = self._pool.await_one_result()
                           if isinstance(result, Exception):
                     >       raise result
                     E       UnpicklingError: unpickling stack underflow

                     .pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/engine.py:251: UnpicklingError
                     -------------- Captured stderr call --------------
                     WARN:pants.engine.exp.engine:Traceback (most recent call last):
                       File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/engine.py", line 168, in _execute_step
                         return (step_id, execute())
                       File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/engine.py", line 160, in execute
                         resolved_request = storage.resolve_request(step)
                       File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/storage.py", line 269, in resolve_request
                         dependencies[self._from_key(dep)] = self._from_key(state)
                       File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/storage.py", line 285, in _from_key
                         return self.get(obj)
                       File "/home/travis/build/pantsbuild/pants/.pants.d/python-setup/chroots/17d386dc1cd74f32935c4fbb76dc141b7645ddd8/pants/engine/exp/storage.py", line 210, in get
                         return self._assert_type_matches(pickle.load(value), key.type)
                     UnpicklingError: unpickling stack underflow

                      1 failed, 872 passed, 1 xfailed, 2 xpassed, 4 warnings in 344.32 seconds 

@peiyuwang
Copy link
Contributor

@patricklaw saw another failure from test_multiprocess_engine_multi, will skip this one as well.

@jsirois
Copy link
Contributor

jsirois commented Apr 20, 2016

@peiyuwang looks like a skip for test_multiprocess_engine_multi never landed. This hit me just now in master: https://travis-ci.org/pantsbuild/pants/jobs/124332823:

                     ==================== FAILURES ====================
                     ___ EngineTest.test_multiprocess_engine_multi ____

                     self = <pants_test.engine.exp.test_engine.EngineTest testMethod=test_multiprocess_engine_multi>

                         def test_multiprocess_engine_multi(self):
                           with self.multiprocessing_engine() as engine:
                     >       self.assert_engine(engine)

                     tests/python/pants_test/engine/exp/test_engine.py:51: 
                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                     tests/python/pants_test/engine/exp/test_engine.py:32: in assert_engine
                         result = engine.execute(self.request(['compile'], self.java))
                     .pants.d/python-setup/chroots/c8c69da81bb70e263e0c94ed9592106c754a1ff5/pants/engine/exp/engine.py:86: in execute
                         self.reduce(execution_request)
                     .pants.d/python-setup/chroots/c8c69da81bb70e263e0c94ed9592106c754a1ff5/pants/engine/exp/engine.py:273: in reduce
                         await_one()
                     .pants.d/python-setup/chroots/c8c69da81bb70e263e0c94ed9592106c754a1ff5/pants/engine/exp/engine.py:252: in await_one
                         result = self._storage.resolve_result(result)
                     .pants.d/python-setup/chroots/c8c69da81bb70e263e0c94ed9592106c754a1ff5/pants/engine/exp/storage.py:276: in resolve_result
                         return StepResult(state=self._from_key(step_result.state))
                     .pants.d/python-setup/chroots/c8c69da81bb70e263e0c94ed9592106c754a1ff5/pants/engine/exp/storage.py:285: in _from_key
                         return self.get(obj)
                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

                     self = <pants.engine.exp.storage.Storage object at 0x566d110>
                     key = Key(8aca88c6552bf7455757f337e50bd27a29fc37b8:[Waiting(dependencies=[SelectNode...e.exp.examples.planners.ResourceSources'>, variants=None, variant_key=None)])])

                         def get(self, key):
                           """Given a key, return its deserialized content.

                             Note that since this is not a cache, if we do not have the content for the object, this
                             operation fails noisily.
                             """
                           if not isinstance(key, Key):
                             raise InvalidKeyError('Not a valid key: {}'.format(key))

                           value = self._contents.get(key.digest)
                           if isinstance(value, six.binary_type):
                             # loads for string-like values
                             return self._assert_type_matches(pickle.loads(value), key.type)
                           # load for file-like value from buffers
                     >     return self._assert_type_matches(pickle.load(value), key.type)
                     E     UnpicklingError: invalid load key, '
                     E     '.

                     .pants.d/python-setup/chroots/c8c69da81bb70e263e0c94ed9592106c754a1ff5/pants/engine/exp/storage.py:210: UnpicklingError

Logs attached for posterity. I'll send up an RB now skipping `test_multiprocess_engine_multi

.

@peiyuwang peiyuwang changed the title CI hangs in test_engine.py Flaky UnpicklingError in test_engine.py Apr 20, 2016
@peiyuwang
Copy link
Contributor

So I added a catch block and print out the hexlified unpicklable data in hex, and unhexlify them offline:

Key(6ed4db33098fb4e81c1d71aaa0fcc91bd1f2b670:[Noop(msg=u"No source of SelectNode(subject=Dir(path=u'src/thrift/codegen/simple'), product=<class 'pants.engine.exp.fs.PathDirWildcard'>, variants=None, variant_key=None).")])

Decoded unpickable data:

X?Was missing (at least) input SelectNode(subject=Dir(path=u'src/thrift/codegen/simple'), product=<class 'pants.base.specs.DescendantAddresses'>, variants=None, variant_key=None).??}b.?iv?Ӟ
Key(cb6c9166aa17bc9dc81806deee3d6d23c15329fb:[SelectNode(subject=3rdparty/jvm:scala-library, product=<class 'pants.engine.exp.struct.Variants'>, variants=None, variant_key=None)])

Decoded unpickable data:

.examples.planners.ScalaInferredDepsSources'>, variants=None, variant_key=None).??}b.???ϝ<?6?Ңй?W?????cpants.engine.exp.nodes
FilesystemNode
cpants.engine.exp.fs
Dir
Xsrc/scala/scrooge??}bcpan
Key(9f37f6656f1c91fd0df876016735f9d233f0069f:[SelectNode(subject=3rdparty/jvm:guava, product=<class 'pants.engine.exp.struct.Struct'>, variants=None, variant_key=None)])/<class 'pants.engine.exp.nodes.SelectNode'>

Decoded unpickable data:

}?cpants.engine.exp.nodes?  ??c?
SelectNode
(cpants.engine.exp.fs
Links
)??}bcpants.engine.exp.examples.planners
JVMPackageName
????d??n?o␌

Observations:

  • Since we store Key in verbose mode, it appears the unpickleable values are completely irrelevant data for the key.
  • I also tried to unpickle these data after unhexlify, not surprisingly, still unpickable, although in some different UnpicklingError this time.

I remember tried to switch to in memory mode in test and still seeing the error. I will try again just to eliminate this is not lmdb's fault.

@peiyuwang
Copy link
Contributor

Confirmed, not lmdb's fault, tried switched to InMemoryDb(f2810af), still seeing UnpicklingError

key=Key(62cf9446dc7cba7145a55cebdfb2428d9ef331cb:[Return(value=Target(address=src/scala/scrooge))])/<class 'pants.engine.exp.nodes.Return'>

Decoded unpickable data:

Y?;?*???cpants.engine.exp.nodes
Waiting
]cpants.engine.exp.nodes
ProjectionNode
(cpants.build_graph.address
Address
)?}(U_hashI-5142004407247440001
U
 _target_nameXscroogeU
_spec_pathXsrc/scala/scroogeubcpants.engine.exp.fs
FilesContent
Ncpants.engine.exp.fs
PathGlobs
X
path_globs?cpants.engine.exp.examples.planners
ScalaInferredDepsSources
t?}ba??}b.?2Υ }Aڥ*]V{??8V??U?cpants.engine.exp.nodes
Waiting
]cp

@peiyuwang
Copy link
Contributor

Also ruled out is cPickle, switched to pickle makes no difference.

Then cStringIO makes the difference. We do have some code that gets a buffer from cStringIO. (The other place is used only by lmdb so can't be it since I switched to InMemoryDb during investigation) The usage looks alright to me though.

See https://travis-ci.org/peiyuwang/pants/builds/ for pickle/cPickle, StringIO/CStringIO combinations

cStringIO, cPickle: job 457 to 467 all failed except for 1
cStringIO, pickle:  job 469 to 479 all failed 
StringIO, cPickle:  job 480 to 490 all passed
StringIO, pickle:   job 491 to 501 all passed

One possibility is its memory space has been GC-ed and reallocated, the reason it hits travis could be due to 1) multiprocessing has higher memory overhead, 2) we use #cpu to allocate multiprocessing pool size, 3) travis has 32 cpu compared to local mac, so on travis cStringIO GC is more likely to happen.

@peiyuwang
Copy link
Contributor

@kwlzn suggests to use BytesIO, which works for me locally and performance is comparable with cStringIO.

However we have two places using BytesIO, , the other place now failed on travis, this appears to be a 2.7.9 bug, used by travis, I have 2.7.10. Possibly this issue 22113

                           with self._env.begin(db=self._db, buffers=True) as txn:
                             value = txn.get(key)
                             if value is not None:
                     >         return BytesIO(value)
                     E         TypeError: 'buffer' does not have the buffer interface

BytesIO-log.txt

So I ended up using BytesIO for write, but StringIO for read. There is some performance loss but not as big as both using StringIO.

@jsirois
Copy link
Contributor

jsirois commented Apr 25, 2016

From .travis.osx.yml:

before_install:
  # We need a sane python and as of 3/5/2015 on OSX 10.9 the provided python 2.7.5 encounters
  # strange issues with absolute imports in the coverage module whereas the brew python 2.7.9 has
  # no such issues.
  - brew update
  - brew install python

How about a similar approach for the linux build? I personally use pyenv on my linux machine to manage user-specific python interpreters. This is free from a sudo requirement. This would allow install of python 2.7.11 for example, though at a CI shard speed hit. It may be worth an experiment to determine what the CI shard setup speed hit is in practice.

@peiyuwang
Copy link
Contributor

Saw #3271, thanks! @jsirois

@peiyuwang
Copy link
Contributor

Reviewed at https://rbcommons.com/s/twitter/r/3761/
Merged as f22f8cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants