backup: async backup-compact #296

SimonVrouwe · 2021-09-02T14:41:56Z

Continues #286

With this PR, calling backup-compact now returns (almost) immediate, without blocking c-lightning. Compaction runs in separate threat, while CL operates as normal.

basic principle:

record the latest data_version of current backup
compact up-to data_version into clone (here is the bulk of work and thread switching to handle db_write hook)
block add_change in main thread (this blocks CL), refresh latest data_version and
make the clone catch-up with it by adding missing versions
atomically move clone-->backup and unblock

Before this PR, backup-compact call would return a dict stats when compaction completed.
Now it returns immediately {"result": "compaction started"}, {"result": "compaction still in progress"} or
{"result": {"backupsize": <size_in_bytes>, "version_count": 2}} when there is nothing to compact.
The stats can be found in log, no idea who uses these.

All pretty much self-contained in FileBackend, so it also works with SocketBackend (which uses FileBackend on server side). The double logging issue #232 is also fixed, maybe this can be improved further, see task list.
Also includes some minor fixes/cleanups and comments, mostly for my own understanding.

I manual tested compacting + restoring a backup and sqldiff with original and also tested with remote SocketBackend.
With ld versions (v0.10.2+) that support "shutdown", an unfinished compaction (.compacting file) is cleaned up when shutting down. For users running v0.10.2 before fix #4959, the plugin waits to be killed as last by the timeout in shutdown, seems prudent.

Also retired the deprecated --backup-destination option.

Todo:

add tests
~~make entire FileBackend thread safe, i.e. also rewind~~ in retrospect, is overkill
~~make all logs in FileBackend being forwarded by SocketBackend~~ -> out of scope, server should handle these
documentation, include example systemd script for weekly compacting

Other ideas:

benchmark backup.py and compare with other backup methods
replace threading with multiprocessing
autocompact when backup size passes a threshold

SimonVrouwe · 2021-12-30T11:19:51Z

548dd09 fixes pip error: pytest-xdist 2.2.1 requires pytest>=6.0.0, but you have pytest 5.4.3 which is incompatible.
when installing requirements-dev.txt, because pyln-testing-0.9.3 needed pytest-5.*

after 548dd09 and python3 -m pip install -r requirements.txt and python3 -m pip install -r requirements-dev.txt
then pip list includes:

pyln-bolt7                    1.0.2.186
pyln-client                   0.10.2.post0
pyln-proto                    0.10.2.post0
pyln-testing                  0.10.1

The failing integration tests (not backup.py) doesn't seem to be caused by the last commit, but by newer CL version post190 checked-out (previous post171 passed all tests).

SimonVrouwe · 2021-12-30T17:12:08Z

test_drain_and_refill fails on this, not raising an error:

plugins/drain/test_drain.py

Lines 66 to 70 in 2292372

    
           # refill again with 100% should not be possible in a line_graph circle, 
        
           # this is not because of ln routing fees (turned off) but because of 
        
           # HTLC commitment tx fee margin that applies for the funder. 
        
           with pytest.raises(RpcError, match=r"Outgoing capacity problem"): 
        
               l1.rpc.fill(scid12)

Initially I thought maybe option_zero_htlc_tx_fee got activated, making above not raising the error.
The test start failing at commit 0aad222c2 in #4972, so I assume that enabled it?

edit: option_zero_htlc_tx_fee is not enabled in l1's nodes feautures bool(0x80008808226aa2 & ((1<<22) | (1<<23))) = False

SimonVrouwe · 2022-01-03T09:45:44Z

@cdecker I think this is ready for review/merge (if desired). The failing integration tests is not caused by this PR, see above.

cdecker

ACK 548dd09

Just two very minor things about comparing ints and Nones and a clarification for my benefit.

On the design: this makes both the compaction as well as the RPC result asynchronous, i.e., the RPC command returns useful information only in very few cases, i.e., when the compaction is not needed, otherwise it just states "doing something". Wouldn't it be better to make the response synchronous, and report size in bytes and number of changes before and after, which should be more informative. Keeping the request pending even for long times is practically free and we wouldn't have to poll.

backup/filebackend.py

cdecker · 2022-01-04T17:12:18Z

@Mergifyio rebase

mergify · 2022-01-04T17:13:00Z

rebase

✅ Branch has been successfully rebased

Pull request has been modified.

SimonVrouwe · 2022-01-06T12:44:29Z

Rebased on master, not sure what the mergify rebase did.

Only added two commits, to address type safety and make it easier to review (I hope!).

Wouldn't it be better to make the response synchronous

I looked into that, but it probably requires yet another thread (to keep handling db_write's) and modifications to SocketBackend also.

Keeping the request pending even for long times is practically free and we wouldn't have to poll.

Is there maybe an example of this with Plugin class? I don't fully understand how it
works, can Plugin's io loop returning pending request later?

            if not method.background:
                # Only if this is not an async (background) call do we need to
                # return the result, otherwise the callee will eventually need
                # to call request.set_result or request.set_exception to
                # return a result or raise an exception.
                request.set_result(result)

But above suggestions are interesting to explore, however lets first see how this performs in comparison to other backup methods. Any idea how broadly backup.py is used?

cdecker · 2022-01-07T13:05:57Z

Rebased on master, not sure what the mergify rebase did.

No worries mergify just took your changes and rebased them on the current master branch. This is so we don't create merge commits on merge, rather just fast-forward the changes on the branch.

Only added two commits, to address type safety and make it easier to review (I hope!).

Definitely good to review. Personally I go with fixups, but our review bot helps here: https://bot.bitcoinstats.com/lightningd/plugins/pull/296/range_diff/548dd09a2b00589a8921f3b9b134d49ddc5b870b..df1c96ade1709e31d371a5be4b4579293f656c65

Wouldn't it be better to make the response synchronous

I looked into that, but it probably requires yet another thread (to keep handling db_write's) and modifications to SocketBackend also.

So you are already creating a new thread here:

https://github.com/lightningd/plugins/pull/296/files#diff-c03113fbe9914787cc6f747d5d0ba792c352b16c6cf7dba7f85d40e179acd090R153

Making the response asynchronous would just mean that we pass the request object to compact(). Then compact() proceeds to do its magic, and when it's done it'll call set_result() or set_exception() to report success or failure. The @plugin.async_method decorator basically means "don't use the return value of the annotated function to return the result to the caller, instead defer that to the request object being used", so when the command handler (the annotated function) returns, pyln will not return the result, instead it'll wait until it get notified by the request handle, freeing the control flow from the simple call-response that @plugin.method has.

Keeping the request pending even for long times is practically free and we wouldn't have to poll.

Is there maybe an example of this with Plugin class? I don't fully understand how it works, can Plugin's io loop returning pending request later?
            if not method.background:
                # Only if this is not an async (background) call do we need to
                # return the result, otherwise the callee will eventually need
                # to call request.set_result or request.set_exception to
                # return a result or raise an exception.
                request.set_result(result)
But above suggestions are interesting to explore, however lets first see how this performs in comparison to other backup methods. Any idea how broadly backup.py is used?

One example could be the recvmsg method of the noise plugin:

plugins/noise/noise.py

Lines 152 to 160 in 6f2b8fb

    
           @plugin.async_method('recvmsg') 
        
           def recvmsg(plugin, request, last_id=None, **kwargs): 
        
               next_id = int(last_id) + 1 if last_id is not None else len(plugin.messages) 
        
               if next_id < len(plugin.messages): 
        
                   res = plugin.messages[int(last_id)].to_dict() 
        
                   res['total_messages'] = len(plugin.messages) 
        
                   request.set_result(res) 
        
               else: 
        
                   plugin.receive_waiters.append(request)

It shows both pseudo-sync flow and async flow: if we already have a message ready to return when we get called we set the result immediately, effectively emulating a @plugin.method since we don't defer resolution, but also don't use the return value of the method as the result:

plugins/noise/noise.py

Line 158 in 6f2b8fb

request.set_result(res)

If on the other hand we don't yet have a message to return, we need to defer returning the result, and free the main thread so we can get hook calls, notification etc, which are used to fill the message queue. We do this by simply remembering that there is a pending request in a list of subscribers.

plugins/noise/noise.py

Line 160 in 6f2b8fb

plugin.receive_waiters.append(request)

When we later get an htlc_accepted hook call that tells us about an incoming message we then iterate through the waitiing requests and return the result:

plugins/noise/noise.py

Lines 205 to 212 in 6f2b8fb

    
           print("Delivering message to {c} waiters".format( 
        
               c=len(plugin.receive_waiters) 
        
           )) 
        
           for r in plugin.receive_waiters: 
        
               m = msg.to_dict() 
        
               m['total_messages'] = len(plugin.messages) 
        
               r.set_result(m) 
        
           plugin.receive_waiters = []

This is with just a single thread, because we either return immediately, or we stash the requests away, waiting for a hook call. In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread? Maybe a wrapper function that takes care of driving the compaction thread, and handles request, might be a good idea (testing the compaction code is likely easier when it doesn't spawn threads internally).

SimonVrouwe · 2022-01-08T13:25:00Z

Thanks a lot for the explanation.

Making the response asynchronous would just mean that we pass the request object to compact()

In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)

For a local FileBackend above seems possible. But SocketBackend's is fully synchronous and also cannot pass a request object to the remote server (where its FileBackend is located).

plugins/backup/socketbackend.py

Lines 215 to 219 in b65fbc5

    
           def compact(self): 
        
               self._send_packet(PacketType.COMPACT, b'') 
        
               (typ, payload) = self._recv_packet() 
        
               assert(typ == PacketType.COMPACT_RES) 
        
               return json.loads(payload.decode())

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?

Yes, just lock the backup file when writing data, or when reading metadata.

But all above requires SocketBackend's add_change() and compact() methods to be asynchronous. Maybe it can be done with async module or such, but to me it doesn't seem worth the effort and added complexity.

report size in bytes and number of changes before and after, which should be more informative

Agreed, but who needs this info? As a user (myself) I just want my backup not fill up my usb or ssd drive, which got to 7GB after a few weeks running clboss. Maybe the backup-compact method can be removed entirely and instead add an auto-compact option with a size parameter. So that FileBackend start a compacting process when backup passes certain size.

cdecker · 2022-01-09T11:39:33Z

In your case we'd simply pass the request to the thread doing the compaction and notifying the caller when we're done :-)

For a local FileBackend above seems possible. But SocketBackend's is fully synchronous and also cannot pass a request object to the remote server (where its FileBackend is located).

Good point, since the backend itself is controlling storage medium access we need to have it also drive the async compaction, I hadn't thought about it. So the SocketBackend either doesn't support async compression, or it has to be adapted in order to do so. Still the client side would just be a thread holding the incoming request open, and waiting for the server to signal that it finished the compaction.

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?

Yes, just lock the backup file when writing data, or when reading metadata.

But all above requires SocketBackend's add_change() and compact() methods to be asynchronous. Maybe it can be done with async module or such, but to me it doesn't seem worth the effort and added complexity.

That makes the compaction synchronous again, i.e., the db_write hook will get locked out from the file until the compaction is done. I was thinking of having something like this:

class AsyncBackend(Backend):
  def __init__(self, inner: Backend):
    self.inner = inner  # The actual backend we're using to store the data
    self.lock = threading.Lock()  # Allows the compaction to grab control of the file while doing the atomic swap

  def add_change(self, change):
    with self.lock:
      self.inner.add_change(change)

  def compact(self, request):
    def do_compact():
      self.inner.compact()
      request.set_result("All done")
    threading.Thread(target=do_compact, daemon=True).start()

Though maybe the compaction must be split into two steps: prepare_compact() which can be run in parallel while accepting db_writes and a second much shorter step compact() that needs to have exclusive control over the storage medium, i.e., atomic file swap in the FileBackend case.

Anyway, I like discussing these more architectural things, but happy to merge as is to unblock others waiting on this ^^

report size in bytes and number of changes before and after, which should be more informative

Agreed, but who needs this info? As a user (myself) I just want my backup not fill up my usb or ssd drive, which got to 7GB after a few weeks running clboss. Maybe the backup-compact method can be removed entirely and instead add an auto-compact option with a size parameter. So that FileBackend start a compacting process when backup passes certain size.

Yeah, just thought that it'd be nice for the user to see progress (there's the custom notification support that can report progress to the CLI), or indicate that the compaction is still pending. I'm very suspicious if my laptop burns through its battery despite the command I'd blame for it returned in just a second. Just cosmetics at this point really :-)

SimonVrouwe · 2022-01-09T15:26:16Z

Any chance we could keep the compact() method agnostic of being run in a thread or on the main thread?

Yes, just lock the backup file when writing data, or when reading metadata.

That makes the compaction synchronous again, i.e., the db_write hook will get locked out from the file until the compaction is done.

That is incorrect, above comment should be read more precise and look at code inside the with self.lock: context.

lock the backup file when writing data,

plugins/backup/filebackend.py

Lines 76 to 87 in df1c96a

with self.lock: # _compact_async threat has concurrent access

with open(self.url.path, 'ab') as f:

f.seek(self.offsets[0])

f.write(length)

f.write(version)

f.write(typ)

f.write(payload)

self.prev_version, self.offsets[1] = self.version, self.offsets[0]

self.version = entry.version

self.offsets[0] += 9 + len(payload)

self.version_count += 1

self.write_metadata()

, or when reading metadata.

plugins/backup/filebackend.py

Lines 171 to 175 in df1c96a

with self.lock:

self.read_metadata()

stop_ver = self.version

stop_offset = self.offsets[1] # keep for catch-up

stats['before'] = self.stats()

But I forgot to mention the last part where it also locks:

when adding catch-up changes to the clone that where added (to the backup) while compacting

during atomic swap

plugins/backup/filebackend.py

Lines 228 to 235 in df1c96a

    
           # Refresh view of latest state so our clone can catch-up, then atomically 
        
           # move clone-->self, all while blocking on_db_write 
        
           # FIXME: depending on the backlog, catch-up (blocking c-lightning) can still 
        
           # take long, the critical section can probably be reduced. 
        
           with self.lock: 
        
               log = [] 
        
               # fast-forward to where we left cloning and add versions we missed while compacting 
        
               for change in self.stream_changes(offset=stop_offset):

Though maybe the compaction must be split into two steps

I think it is already about what you suggest here :-)

happy to merge as is to unblock others waiting on this ^^

As you wish, I already run this on my own node. But I was a little worried about users running current backup.py on v0.10.2 pre #4945 maybe getting backup is out of date errors but haven't seen these yet. The ones in issue 4994 seem causes by other reasons (power failure).

…n restore, cleanup of some redundant init values in FileBackend, improve comments and error msg

…tBackend Minor fixes: - fix duplicate log entries, modify the existing root handler instead of adding one - remove obsolete flush, i.e. leaving `with` context already ensures close + flush - fix typos, improve comments Async/concurrent `compact` method using threading, all contained in FileBackend class. Some wrapping and minor API change (see README.md) was needed to keep it compatible with SocketBackend (which uses FileBackend on its remote side). A lock is used to make backup file access thread-safe. The main-thread handled db_write's, which modifies meta-data but otherwise only _appends_ to backup file. The compact-thread only reads, up-to what it knows from meta-data. Calculation of stats diff `before` vs `after` is removed, it doesn't make sense with moving parts. Early version check is done to maybe subscribe "shutdown" notification (v0.10.2+) to cleanup a running compaction when we shutdown, but older versions (v0.9.2+) still work.

A single Change can contain a snapshot and a transaction

SimonVrouwe · 2022-02-03T10:26:30Z

Rebased on latest master (nothing else).
There always appears to be some problem with the testing system, this time a network error?

Collecting pip
  WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e6d0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e820>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01e970>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01eac0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
  WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01ec10>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl
ERROR: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/83/b5/df8640236faa5a3cb80bfafd68e9fb4b22578208b8398c032ccff803f9e0/pip-22.0.2-py3-none-any.whl (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fea8c01ed60>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Error: Process completed with exit code 1.

chrisguida · 2024-02-07T16:30:40Z

Needs rebase, CI should be working now

chrisguida · 2024-06-04T01:38:08Z

@cdecker @SimonVrouwe are we still interested in merging this? definitely seems useful...

chrisguida · 2024-09-27T17:18:51Z

Closing for now as it seems dead.

Feel free to reopen if work resumes!

SimonVrouwe force-pushed the 2021-08-26_backup_autocompact branch from 7e06d43 to 51b1dc9 Compare September 3, 2021 10:02

This was referenced Sep 9, 2021

python plugin: topic 'shutdown' is not a known notification topic ElementsProject/lightning#4778

Closed

lightningd question: Can we expect db_write's after a "shutdown" notification? ElementsProject/lightning#4785

Closed

SimonVrouwe force-pushed the 2021-08-26_backup_autocompact branch 2 times, most recently from 8af90fc to 61409c8 Compare December 29, 2021 11:37

SimonVrouwe changed the title ~~WIP backup FileBackend: async backup-compact~~ backup: async backup-compact Jan 3, 2022

cdecker previously approved these changes Jan 4, 2022

View reviewed changes

backup/filebackend.py Show resolved Hide resolved

backup/filebackend.py Show resolved Hide resolved

cdecker force-pushed the 2021-08-26_backup_autocompact branch from 548dd09 to 6577bf0 Compare January 4, 2022 17:12

cdecker force-pushed the master branch from 533e8de to 3c2ac14 Compare January 4, 2022 17:17

SimonVrouwe force-pushed the 2021-08-26_backup_autocompact branch from 6577bf0 to df1c96a Compare January 6, 2022 11:51

SimonVrouwe added 9 commits February 3, 2022 09:17

backup: add version_count member to Backend class, for progress bar o…

de6904d

…n restore, cleanup of some redundant init values in FileBackend, improve comments and error msg

backup: add test_shutdown

549f447

backup: modify test_compact

1786781

backup: update documentation

dda0d0d

backup: example script for weekly backup-compact, using systemd

1c31267

backup: remove deprecated (1y+) --backup-destination option, cleanup

bdfd803

backup: bump version requirements

3ad8fa9

backup: more cleanup, renaming for reviewability

2642577

backup: improve (type) safety

d5ffe90

A single Change can contain a snapshot and a transaction

SimonVrouwe force-pushed the 2021-08-26_backup_autocompact branch from df1c96a to d5ffe90 Compare February 3, 2022 07:37

chrisguida closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup: async backup-compact #296

backup: async backup-compact #296

SimonVrouwe commented Sep 2, 2021 •

edited

Loading

SimonVrouwe commented Dec 30, 2021

SimonVrouwe commented Dec 30, 2021 •

edited

Loading

SimonVrouwe commented Jan 3, 2022

cdecker left a comment

cdecker commented Jan 4, 2022

mergify bot commented Jan 4, 2022

SimonVrouwe commented Jan 6, 2022

cdecker commented Jan 7, 2022

SimonVrouwe commented Jan 8, 2022

cdecker commented Jan 9, 2022

SimonVrouwe commented Jan 9, 2022

SimonVrouwe commented Feb 3, 2022

chrisguida commented Feb 7, 2024

chrisguida commented Jun 4, 2024

chrisguida commented Sep 27, 2024

backup: async backup-compact #296

backup: async backup-compact #296

Conversation

SimonVrouwe commented Sep 2, 2021 • edited Loading

SimonVrouwe commented Dec 30, 2021

SimonVrouwe commented Dec 30, 2021 • edited Loading

SimonVrouwe commented Jan 3, 2022

cdecker left a comment

Choose a reason for hiding this comment

cdecker commented Jan 4, 2022

mergify bot commented Jan 4, 2022

✅ Branch has been successfully rebased

SimonVrouwe commented Jan 6, 2022

cdecker commented Jan 7, 2022

SimonVrouwe commented Jan 8, 2022

cdecker commented Jan 9, 2022

SimonVrouwe commented Jan 9, 2022

SimonVrouwe commented Feb 3, 2022

chrisguida commented Feb 7, 2024

chrisguida commented Jun 4, 2024

chrisguida commented Sep 27, 2024

SimonVrouwe commented Sep 2, 2021 •

edited

Loading

SimonVrouwe commented Dec 30, 2021 •

edited

Loading