Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

window_resize_rewalk: traceback #6325

Open
oliver-sanders opened this issue Aug 23, 2024 · 5 comments
Open

window_resize_rewalk: traceback #6325

oliver-sanders opened this issue Aug 23, 2024 · 5 comments
Labels
bug Something is wrong :( needs reproducing A bug report that does not yet have a reproducible example
Milestone

Comments

@oliver-sanders
Copy link
Member

Spotted in the wild:

INFO - Command "set_graph_window_extent" received.    
    set_graph_window_extent(n_edge_distance=2)   
CRITICAL - An uncaught error caused Cylc to shut down.    
    If you think this was an issue in Cylc, please report the following traceback to the developers.    
    https://github.com/cylc/cylc-flow/issues/new?assignees=&labels=bug&template=bug.md&title=;    
ERROR - 'bool' object has no attribute 'flow_nums'    
    Traceback (most recent call last):    
      File "cylc/flow/scheduler.py", line 652, in run_scheduler    
        await self._main_loop()    
      File "cylc/flow/scheduler.py", line 1557, in _main_loop    
        await self.update_data_structure()    
      File "cylc/flow/scheduler.py", line 1639, in update_data_structure
        self.data_store_mgr.update_data_structure()    
      File "cylc/flow/data_store_mgr.py", line 1717, in update_data_structure
        self.window_resize_rewalk()    
      File "cylc/flow/data_store_mgr.py", line 1788, in window_resize_rewalk        deserialise_set(tproxy.flow_nums)
    AttributeError: 'bool' object has no attribute 'flow_nums'
CRITICAL - Workflow shutting down - 'bool' object has no attribute 'flow_nums'

There was no previous set_graph_window_extent command so it must have been 1 before.

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Aug 23, 2024
@oliver-sanders oliver-sanders added this to the 8.3.x milestone Aug 23, 2024
@MetRonnie MetRonnie self-assigned this Aug 23, 2024
@MetRonnie MetRonnie modified the milestones: 8.3.x, 8.3.4 Aug 23, 2024
@MetRonnie
Copy link
Member

MetRonnie commented Aug 23, 2024

for tp_id in self.all_task_pool:
tokens = Tokens(tp_id)
tproxy: PbTaskProxy
_, tproxy = self.store_node_fetcher(tokens)
self.increment_graph_window(
tokens,
get_point(tokens['cycle']),
deserialise_set(tproxy.flow_nums)
)

A PbTaskProxy was not found in the store despite the task ID being added to self.all_task_pool

def store_node_fetcher(self, tokens: Tokens) -> Tuple[str, Any]:
"""Check that task proxy is in or being added to the store"""
node_type = {
'task': TASK_PROXIES,
'job': JOBS,
}[tokens.lowest_token]
node_id = tokens.id
if node_id in self.added[node_type]:
return (node_id, self.added[node_type][node_id])
elif node_id in self.data[self.workflow_id][node_type]:
return (node_id, self.data[self.workflow_id][node_type][node_id])
return (node_id, False)

Do we have a copy of this workflow?

@oliver-sanders
Copy link
Member Author

Yes, but it's non-trivial, will PM you.

@MetRonnie
Copy link
Member

I had a quick go at reproducing using a copy of the workflow in sim mode; no luck.

A PbTaskProxy was not found in the store despite the task ID being added to self.all_task_pool

I'm not sure how this happens, or what should be done about it

@MetRonnie MetRonnie removed their assignment Aug 23, 2024
@oliver-sanders oliver-sanders modified the milestones: 8.3.4, 8.3.x Aug 27, 2024
@dwsutherland
Copy link
Member

dwsutherland commented Aug 28, 2024

I'm not entirely sure why it's happening, and if I can't reproduce it, it's hard to pinpoint...
The self.all_task_pool is created by the task pool:

def create_data_store_elements(self, itask):
"""Create the node window elements about given task proxy."""
# Register pool node reference
self.data_store_mgr.add_pool_node(itask.tdef.name, itask.point)
# Create new data-store n-distance graph window about this task
self.data_store_mgr.increment_graph_window(
itask.tokens,
itask.point,
itask.flow_nums,
is_manual_submit=itask.is_manual_submit,
itask=itask
)
self.data_store_mgr.delta_task_state(itask)
self.data_store_mgr.delta_task_held(itask)
self.data_store_mgr.delta_task_queued(itask)
self.data_store_mgr.delta_task_runahead(itask)

So can only happen here if the data_store_mgr.increment_graph_window doesn't create it .. (which can only happen if it's already in the store)

and removed by:

def remove(self, itask, reason=None):
"""Remove a task from the pool."""
if itask.state.is_runahead and itask.flow_nums:
# If removing a parentless runahead-limited task
# auto-spawn its next instance first.
self.spawn_if_parentless(
itask.tdef,
itask.tdef.next_point(itask.point),
itask.flow_nums
)
msg = "removed from active task pool"
if reason is None:
msg += ": completed"
else:
msg += f": {reason}"
if itask.is_xtrigger_sequential:
self.xtrigger_mgr.sequential_spawn_next.discard(itask.identity)
self.xtrigger_mgr.sequential_has_spawned_next.discard(
itask.identity
)
try:
del self.active_tasks[itask.point][itask.identity]
except KeyError:
pass
else:
self.tasks_removed = True
self.active_tasks_changed = True
if not self.active_tasks[itask.point]:
del self.active_tasks[itask.point]
self.task_queue_mgr.remove_task(itask)
if itask.tdef.max_future_prereq_offset is not None:
self.set_max_future_offset()
# Notify the data-store manager of their removal
# (the manager uses window boundary tracking for pruning).
self.data_store_mgr.remove_pool_node(itask.tdef.name, itask.point)
# Event-driven final update of task_states table.
# TODO: same for datastore (still updated by scheduler loop)
self.workflow_db_mgr.put_update_task_state(itask)
level = logging.DEBUG
if itask.state(
TASK_STATUS_PREPARING,
TASK_STATUS_SUBMITTED,
TASK_STATUS_RUNNING,
):
level = logging.WARNING
msg += " - active job orphaned"
LOG.log(level, f"[{itask}] {msg}")
del itask

(if the try/except is triggered then it shouldn't be removed from both the store and self.all_task_pool)

And the window resize happens before any pruning.

One thing we can say:

  • if it's in the TaskPool then it should be in both self.all_task_pool and the data-store.
  • if it's not in the TaskPool then it shouldn't be in self.all_task_pool..

So we can put a workaround it if needed .. but yeah doesn't properly "solve" the issue..

@dwsutherland
Copy link
Member

It cannot happen due to reload .. because all the data-store attributes are reset (including all_task_pool):

        # Reset attributes/data-store on reload:
        if reloaded:
            self.__init__(self.schd, self.n_edge_distance)

@oliver-sanders oliver-sanders added needs reproducing A bug report that does not yet have a reproducible example and removed investigation labels Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( needs reproducing A bug report that does not yet have a reproducible example
Projects
None yet
Development

No branches or pull requests

3 participants