Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: dstack server fails to process a run if it failed during provision by shim #1720

Closed
un-def opened this issue Sep 24, 2024 · 0 comments · Fixed by #1721
Closed

[Bug]: dstack server fails to process a run if it failed during provision by shim #1720

un-def opened this issue Sep 24, 2024 · 0 comments · Fixed by #1721
Labels
bug Something isn't working

Comments

@un-def
Copy link
Collaborator

un-def commented Sep 24, 2024

Steps to reproduce

Prerequisites:

  • PostgreSQL (cannot be reproduced with SQLite as SQLite ignores text encoding)
  • VM-based or on-prem instance (that is, the dstack-shim is used)
  1. Submit a run, which will fail during the provision step, more specifically in the init shell script. For example, use a non-DEB/RPM-based distro (Alpine — has no neither apt-get nor yum) or a distro with a broken package manager (CentOS — all versions already reached EOL, repositories are shut down).
  2. Check server logs.

Actual behaviour

The server falls into a loop with sqlalchemy.dialects.postgresql.asyncpg.AsyncAdapt_asyncpg_dbapi.Error: <class 'asyncpg.exceptions.CharacterNotInRepertoireError'>: invalid byte sequence for encoding "UTF8": 0x00

Expected behaviour

No response

dstack version

0.18.15rc1

Server logs

[14:04:35] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(230d0b)horrible-cobra-1: processing run
[14:04:36] DEBUG    dstack._internal.server.background.tasks.process_running_jobs:227 job(0729b2)horrible-cobra-1-0-0: process pulling job
                    with shim, age=0:00:54.131827
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:446 shim failed to execute job horrible-cobra-1-0-0:
                    CONTAINER_EXITED_WITH_ERROR ($/bin/sh: apt-get: command not found
                    PCentOS Linux 8 - AppStream                      0.0  B/s |   0  B     00:00
                    ?Errors during downloading metadata for repository 'appstream':
                    �  - Curl error (6): Couldn't resolve host name for
                    http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=container [Could not resolve host:
                    mirrorlist.centos.org]
                    Error: Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: Curl error (6): Couldn't
                    resolve host name for http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=container [Could not
                    resolve host: mirrorlist.centos.org])
           DEBUG    dstack._internal.server.background.tasks.process_running_jobs:452 shim status: {'state': 'pending', 'executor_error': '',
                    'container_name': 'horrible-cobra-1-0-0', 'status': 'exited', 'running': False, 'oom_killed': False, 'dead': False,
                    'exit_code': 1, 'error': '', 'result': {'reason': 'CONTAINER_EXITED_WITH_ERROR', 'reason_message':
                    "\x02\x00\x00\x00\x00\x00\x00$/bin/sh: apt-get: command not found\n\x01\x00\x00\x00\x00\x00\x00PCentOS Linux 8 - AppStream
                    0.0  B/s |   0  B     00:00    \n\x02\x00\x00\x00\x00\x00\x00?Errors during downloading metadata for repository
                    'appstream':\n\x02\x00\x00\x00\x00\x00\x00�  - Curl error (6): Couldn't resolve host name for
                    http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=container [Could not resolve host:
                    mirrorlist.centos.org]\n\x02\x00\x00\x00\x00\x00\x01\x0fError: Failed to download metadata for repo 'appstream': Cannot
                    prepare internal mirrorlist: Curl error (6): Couldn't resolve host name for
                    http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=container [Could not resolve host:
                    mirrorlist.centos.org]"}}
           WARNING  dstack._internal.server.background.tasks.process_running_jobs:261 job(0729b2)horrible-cobra-1-0-0: failed because runner
                    is not available or return an error,  age=0:00:54.390122
           ERROR    apscheduler.executors.default:36 Job "process_running_jobs (trigger: interval[0:00:04], next run at: 2024-09-24 14:04:41
                    UTC)" raised an exception
                    Traceback (most recent call last):
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 538, in _prepare_and_execute
                        self._rows = deque(await prepared_stmt.fetch(*parameters))
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/asyncpg/prepared_stmt.py", line 176, in
                    fetch
                        data = await self.__bind_execute(args, 0, timeout)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/asyncpg/prepared_stmt.py", line 241, in
                    __bind_execute
                        data, status, _ = await self.__do_execute(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/asyncpg/prepared_stmt.py", line 230, in
                    __do_execute
                        return await executor(protocol)
                               ^^^^^^^^^^^^^^^^^^^^^^^^
                      File "asyncpg/protocol/protocol.pyx", line 207, in bind_execute
                    asyncpg.exceptions.CharacterNotInRepertoireError: invalid byte sequence for encoding "UTF8": 0x00

                    The above exception was the direct cause of the following exception:

                    Traceback (most recent call last):
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in
                    _exec_single_context
                        self.dialect.do_execute(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in
                    do_execute
                        cursor.execute(statement, parameters)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 572, in execute
                        self._adapt_connection.await_(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line
                    132, in await_only
                        return current.parent.switch(awaitable)  # type: ignore # noqa: E501
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line
                    196, in greenlet_spawn
                        value = await result
                                ^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 550, in _prepare_and_execute
                        self._handle_exception(error)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 501, in _handle_exception
                        self._adapt_connection._handle_exception(error)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 784, in _handle_exception
                        raise translated_error from error
                    sqlalchemy.dialects.postgresql.asyncpg.AsyncAdapt_asyncpg_dbapi.Error: <class
                    'asyncpg.exceptions.CharacterNotInRepertoireError'>: invalid byte sequence for encoding "UTF8": 0x00

                    The above exception was the direct cause of the following exception:

                    Traceback (most recent call last):
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/apscheduler/executors/base_py3.py", line
                    30, in run_coroutine_job
                        retval = await job.func(*job.args, **job.kwargs)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/dev/dstack/src/dstack/_internal/server/background/tasks/process_running_jobs.py", line 76, in
                    process_running_jobs
                        await _process_running_job(session=session, job_model=job_model)
                      File "/home/def/dev/dstack/src/dstack/_internal/server/background/tasks/process_running_jobs.py", line 290, in
                    _process_running_job
                        await session.commit()
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/ext/asyncio/session.py", line
                    1009, in commit
                        await greenlet_spawn(self.sync_session.commit)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line
                    203, in greenlet_spawn
                        result = context.switch(value)
                                 ^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2017, in
                    commit
                        trans.commit(_to_root=True)
                      File "<string>", line 2, in commit
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/state_changes.py", line 139,
                    in _go
                        ret_value = fn(self, *arg, **kw)
                                    ^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1302, in
                    commit
                        self._prepare_impl()
                      File "<string>", line 2, in _prepare_impl
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/state_changes.py", line 139,
                    in _go
                        ret_value = fn(self, *arg, **kw)
                                    ^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1277, in
                    _prepare_impl
                        self.session.flush()
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 4341, in
                    flush
                        self._flush(objects)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 4476, in
                    _flush
                        with util.safe_reraise():
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/util/langhelpers.py", line 146,
                    in __exit__
                        raise exc_value.with_traceback(exc_tb)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 4437, in
                    _flush
                        flush_context.execute()
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/unitofwork.py", line 466, in
                    execute
                        rec.execute(self)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/unitofwork.py", line 642, in
                    execute
                        util.preloaded.orm_persistence.save_obj(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/persistence.py", line 85, in
                    save_obj
                        _emit_update_statements(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/orm/persistence.py", line 912,
                    in _emit_update_statements
                        c = connection.execute(
                            ^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in
                    execute
                        return meth(
                               ^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in
                    _execute_on_connection
                        return connection._execute_clauseelement(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in
                    _execute_clauseelement
                        ret = self._execute_context(
                              ^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in
                    _execute_context
                        return self._exec_single_context(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in
                    _exec_single_context
                        self._handle_dbapi_exception(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2353, in
                    _handle_dbapi_exception
                        raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in
                    _exec_single_context
                        self.dialect.do_execute(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in
                    do_execute
                        cursor.execute(statement, parameters)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 572, in execute
                        self._adapt_connection.await_(
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line
                    132, in await_only
                        return current.parent.switch(awaitable)  # type: ignore # noqa: E501
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line
                    196, in greenlet_spawn
                        value = await result
                                ^^^^^^^^^^^^
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 550, in _prepare_and_execute
                        self._handle_exception(error)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 501, in _handle_exception
                        self._adapt_connection._handle_exception(error)
                      File "/home/def/.local/share/virtualenvs/dstack/lib/python3.12/site-packages/sqlalchemy/dialects/postgresql/asyncpg.py",
                    line 784, in _handle_exception
                        raise translated_error from error
                    sqlalchemy.exc.DBAPIError: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class
                    'asyncpg.exceptions.CharacterNotInRepertoireError'>: invalid byte sequence for encoding "UTF8": 0x00
                    [SQL: UPDATE jobs SET last_processed_at=$1::TIMESTAMP WITHOUT TIME ZONE, status=$2::jobstatus,
                    termination_reason=$3::jobterminationreason, termination_reason_message=$4::VARCHAR WHERE jobs.id = $5::UUID]
                    [parameters: (datetime.datetime(2024, 9, 24, 14, 4, 36, 864066), 'TERMINATING', 'CONTAINER_EXITED_WITH_ERROR',
                    "\x02\x00\x00\x00\x00\x00\x00$/bin/sh: apt-get: command not found\n\x01\x00\x00\x00\x00\x00\x00PCentOS Linux 8 - AppStream
                    0.0  B ... (485 characters truncated) ... olve host name for
                    http://mirrorlist.centos.org/?release=8&arch=x86_64&repo=AppStream&infra=container [Could not resolve host:
                    mirrorlist.centos.org]", '0729b23a-b990-4354-8bfa-38c52c223695')]
                    (Background on this error at: https://sqlalche.me/e/20/dbapi)

Additional information

No response

@un-def un-def added the bug Something isn't working label Sep 24, 2024
un-def added a commit that referenced this issue Sep 24, 2024
As the shim creates the container with Tty: false (the default),
Client.ContainerLogs returns the logs in the the multiplexed format,
which requires additional processing (demultiplexing).

See ContainerLogs documentation for details.

Fixes: #1720
@un-def un-def closed this as completed in 42b9bcb Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant