Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace MirrorBrain by MirrorCache or Mirrorbits #239

Open
kelson42 opened this issue Jan 29, 2023 · 20 comments
Open

Replace MirrorBrain by MirrorCache or Mirrorbits #239

kelson42 opened this issue Jan 29, 2023 · 20 comments

Comments

@kelson42
Copy link
Contributor

Mirrorbrain is deprecated and there is a replacement http://www.mirrorcache.org/. We should probably migrate our architecture

@rgaudin
Copy link
Member

rgaudin commented Mar 21, 2023

This is becoming more important with one of our mirror (https://mirror.accum.se/mirror/kiwix.org/) hosting our files on multiple servers (it's a mirror frontend itself) making use of redirections which are not supported by mirrorbrain.

I don't know if that's supported in mirrorcache though but I know mb is not worth it.

In the mean time, I've duplicated the mirror entry so we point independently to the two offloaders I've seen files in. This wastes a lot of requests in scanMirror step but at least we can use the mirror…

@rgaudin
Copy link
Member

rgaudin commented Apr 27, 2023

See https://github.com/etix/mirrorbits as well

@kelson42
Copy link
Contributor Author

kelson42 commented Oct 1, 2023

I had tested MirrorCache a long time ago and without remembering the details it was really too short on the features.

Mirrorbits seems more mature and deserve probably to give a try.

Here are the features we like or rely on:

  • Metalink (Metalink headers)
  • Bittorrent files
  • Magnet links
  • Mirror mgmt via ftp/http/rsync
  • Priorisation of mirrors
  • Auto choice of mirrors based on client geo location
  • Multiple hashes of files
  • Easy update of mirrors file database (at file/directory level)
  • Support of very large files >100GB
  • IPV6 support

@benoit74 @rgaudin Do you see other features which are important to us?

Now what needs to be decided is when and how we will proceed to move forward with this POC with Mirrorbits.

@kelson42 kelson42 changed the title Replace MirrorBrain by MirrorCache Replace MirrorBrain by MirrorCache or Mirrorbits Oct 1, 2023
@kelson42 kelson42 pinned this issue Oct 1, 2023
@benoit74
Copy link
Collaborator

benoit74 commented Oct 2, 2023

I have very little experience on this part of the stack.

One thing we struggle with currently is the scans of mirrors to refresh individual assets status. Currently this process has to be made one mirror at a time, it is not possible to run in parallel (at least we failed). The new solution must be able to run this scan in parallel, otherwise it is not scalable. As the number of mirrors grows, the time to scan all of them grows as well and our refresh period if getting bigger and bigger.

Currently the refresh period is getting pretty high, more than 2 hours at least: https://kiwixorg.grafana.net/d/bb0f0990-04c5-4314-8afc-6185ac49c668/mirrorbrain?orgId=1&from=1695625815425&to=1696230615425

@rgaudin
Copy link
Member

rgaudin commented Oct 2, 2023

We've decided that @benoit74 will assess mirrorbits in regard to our needs. What we want to know is:

  • whether mirrorbits is suited for the job
  • which mb features (that we are relying on) are missing from it:
    • if we could bring them back
    • how (what king of effort)
      The goal is to be able to assess whether it's doable with our resources.

@benoit74
Copy link
Collaborator

benoit74 commented Oct 3, 2023

This is my comparison chart so far.

❌ Not Supported, bad thing
✅ Supported
❓Unknown (meaning probably not)

Feature MirrorCache MirrorBits
Metalink (Metalink headers) ❓JSON file mentioned, but not compatible with aria obviously
Bittorrent files
Magnet links
Mirror mgmt via ftp/http/rsync HTTP only❓ (no access to file) FTP and RSYNC only
Priorisation of mirrors
Auto choice of mirrors based on client geo location Geo only Geo + AS number + custom rules
Multiple hashes of files
Easy update of mirrors file database (at file/directory level) ✅ Mirrorcache has been designed to fix Mirrorbrain issues around parallel scans and scans taking ages to update the DB
Support of very large files >100GB ✅ Probably ✅ Probably 
IPV6 support ✅ 
Documentation ❌ Too limited ❌ Too limited
Programming Language Perl GoLang
Database PostgreSQL Redis (with persistence)
Project liveness Project updated regularly ; Multiple PR closed on a regular basis, including last days / weeks No update since at least one year, no code change since 2020, many very simple pending PR without responses, still based on Golang 1.13 (Sept 2019)
Developers One main dev (Andrii Nikitin), working at openSuse (project supporter) ; another person helped a bit in the past One single dev (etix), based in Paris, former Videolan Ops + developer, no more activity on Github / personal blog / twitter
 Usage openSuse only ? Many websites mentioned, including some which have stopped using it

I'm really not convinced by those two solutions. I would probably prefer to stay with MirrorBrain for now until we find a better solution.

If we are forced to choose one now, I will try MirrorCache for:

  • its new design which solves something our scan issues
  • its liveness and hence ability for us to fix things / submit patches (if anyone is willing to contribute in Perl, I don't)
  • its PostgreSQL DB, I'm still absolutely not convinced by NoSQL databases + it probably means we will have to learn how to install it properly, backup it, maintain it, ...

Effort to implement MirrorCache given all missing features is however probably significant (1 month?). I have too limited experience of Bittorrent / Magnet links to say something very pertinent on that point. But since it is written in Perl, we probably need to hire an external developer to do our stuff.

@rgaudin
Copy link
Member

rgaudin commented Oct 3, 2023

Thank you ; very useful 👍

In this case, we're probably better off keeping mirrorbrain until we're forced out. Main concern is security obviously. Our data is not completely safe as we mount the downloads folder in rw in order to write the mirrors.html file in the update-mb-db job. We canshould find a way around that.

More concerning would be the possibility of altering mirrorbrain's response to inject redirections to our users.

Should we close this ticket for now?

@rgaudin
Copy link
Member

rgaudin commented Oct 3, 2023

Couple notes:

  • mirrobits do have hashes. See the JSON Payload
  • mirrorbits supports parallel scan (only one scan per mirror at a time obviously). Both rsync and FTP are efficient : rsync works off the list of files returned by rsync (uses the rsync bin) and FTP recursively CWD and ls in all folders.
  • Is this project still maintained? etix/mirrorbits#138

@benoit74
Copy link
Collaborator

benoit74 commented Oct 3, 2023

I did not noticed the last issue regarding the fact that jbkempf is maintaining mirrorbits live, it is indeed quite an important information.
And your other points are important as well.
I'm really puzzled by all this information.

@kelson42
Copy link
Contributor Author

kelson42 commented Oct 3, 2023

We should gather the problems/challenges we have with mb to be able to complete comparaison.

@benoit74
Copy link
Collaborator

benoit74 commented Oct 5, 2023

❌ Not Supported, bad thing
✅ Supported
❓Unknown (meaning probably not)

Feature MirrorCache MirrorBits MirrorBrain
HTML list of mirrors
Metalink (Metalink headers) ❓JSON file mentioned, but not compatible with aria obviously
Bittorrent files ✅ (but only torrent creation, not announced to tracker to validate torrent file - working only thanks to our "custom" tracker)
Magnet links ❌ (supported but buggy)
Mirror mgmt via ftp/http/rsync HTTP only❓ (no access to file) FTP and RSYNC only FTP, RSYNC and HTTP
Priorisation of mirrors
Auto choice of mirrors based on client geo location Geo only Geo + AS number + custom rules Geo + AS number
Multiple hashes of files ✅ (found in JSON file)
Easy update of mirrors file database (at file/directory level) ✅ Mirrorcache has been designed to fix Mirrorbrain issues around parallel scans and scans taking ages to update the DB ✅ mirrorbits supports parallel scan (only one scan per mirror at a time obviously). Both rsync and FTP are efficient : rsync works off the list of files returned by rsync (uses the rsync bin) and FTP recursively CWD and ls in all folders. ❌ (no parallel scan, lock issue)
Support of very large files >100GB ✅ Probably ✅ Probably 
IPV6 support ✅ 
Documentation ❌ Too limited ❌ Too limited
Programming Language Perl GoLang Python (admin/management) + C (runtime HTTP)
Database PostgreSQL Redis (with persistence) PostgreSQL
Project liveness Project updated regularly ; Multiple PR closed on a regular basis, including last days / weeks No update since at least one year, no code change since 2020, many very simple pending PR without responses, still based on Golang 1.13 (Sept 2019), but some oversight by jbkempf (VLC) + some potential contributions from Jenkins team Dead
Developers One main dev (Andrii Nikitin), working at openSuse (project supporter) ; another person helped a bit in the past One single dev (etix), based in Paris, former Videolan Ops + developer, no more activity on Github / personal blog / twitter No more
 Usage openSuse only ? Many websites mentioned, including some which have stopped using it ?

@benoit74
Copy link
Collaborator

benoit74 commented Oct 5, 2023

Just updated with Mirrobrain column + fixes to Mirrobits details + new line regarding HTML home page

@kelson42
Copy link
Contributor Author

@benoit74 Thank you very much for this analysis. Looking at the results, it tends to confirm my first opinion that the easiest path would be to continue (by fixing a few details) with Mirrorbrain (at least for the moment).
@rgaudin What is your analysis and proposal?

@rgaudin
Copy link
Member

rgaudin commented Oct 14, 2023

As discussed with @benoit74 my opinion is to continue with MB until we're forced out. In that case, should the environnement be the same, I support patching mirrorbits to add metalink support and hashes on same paths (both very easy). As for BT, it's relatively easy as well but whether it would be integrated upstream is another question.

@kelson42
Copy link
Contributor Author

OK, then I guess this ticket is implemented (at least for the short term), we will need to fork Mirrorbrain to fix the most urgent stuff.

@rgaudin
Copy link
Member

rgaudin commented Oct 14, 2023

I think we can just patch a couple things in our image without adding the burden of a fork. This guy's patch is a line in a perl script

@kelson42 kelson42 unpinned this issue Oct 14, 2023
@lemeurherve
Copy link

lemeurherve commented Oct 26, 2023

@benoit74 FWIW and while etix/mirrorbits#138 is in progress, we're using mirrorbits on Jenkins Infrastructure, with our own docker image and helm chart, that might interest you:

@benoit74
Copy link
Collaborator

Thanks a lot @lemeurherve for the pointers

@rgaudin
Copy link
Member

rgaudin commented Oct 7, 2024

@kelson42 please take another look

@rgaudin rgaudin reopened this Oct 7, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Oct 8, 2024

And have a look especially at etix/mirrorbits#138 and etix/mirrorbits#179 which shows that maintenance of mirrorbits is "getting better"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants