-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for wasmedge models? #1
Comments
I had tasked @Coregod360 with translating the python ipfs_huggingface model manager into nodejs, derived from a very lazy gpt4 translation that I made quickly, however I ended up making changes to the model manager to implement orbitdb database integration into the model manager, and he didn't feel like having to rewrite the code twice and he went to go work on one of his front end dev ideas since then. endomorphosis/ipfs_model_manager_js@74823fb The idea behind it being that it will be used to store the results from the huggingface model scraper, and also that I can later use it when I decide to work on encapsulating the hugging face transformers_js library, which I believe is a GPU accelerated onnx pipeline like what you have described. The intention is then to have each of the peers on this libp2p subnet (or in the case of Fireproof DB it would be Cloudflare's Partykit), advertise what models that they have on their local cache and offer to execute jobs for others when it is not busy sort of like bittorrent. https://github.com/endomorphosis/ipfs_model_manager_js Moreover that when a user uses the huggingface Agents library, the tool list is defined by call functions pointing to api endpoints, but ultimately I want it populated to include what models people in the peer to peer swarm have listed as being in their local cache and willing to infer. They do that by using the libp2p 'pubsub' to subscribe to a 'topic', whereby they broadcast their request, and agents listening to that request can use the dynamic batching to maximize their seed / leech value. Orbitdb is one of the CRDT databases I mentioned to you, and it uses libp2p/ipfs for peer discovery and block storage, whereas Fireproof DB relies more on S3/cloudflare infrastructure because they are already running the physical nodes, rather than a BYOB P2P infrastructure. Both of these run projects run in both nodejs and vanilla JavaScript browser runtimes, and benefit from the "local first" database paradigm. The reason why I have chosen libp2p / ipfs is because I want a decentralized first approach and no vendor lock-in, and I was focusing on the people who want to use a local first large language model as a "tool using agent" to execute multiple chains of ml model inference, and how to arbitrarily compose those sets of actions based on the pool of data about what's available to run on the huggingface api itself. Maybe you already have 10 models downloaded, or maybe you want to call a friend (in agentic terms), to help the agent answer the question. |
ok, i really like this direction! and from what limited understanding i have of ipfs it sounds like a wise choice. is there a design doc or diagram to help understand the architecture? as long as it is compatible with:
then i think we are good to go. this is very vague for me, so any "explain it like i am five" summaries would help. i've also been going off of initial first passes from gpt4, but have found @anthropics first pass code to be much better (and context limit longer than chatgpt4o). i use this to help speed it up: https://gist.github.com/jaanli/5def01b7bd674efd6d9008cf1125986d in case it helps, alongside lots of use of https://github.com/p0deje/Maccy or similar software. not sure how best to organize this given the ambitious scope (not a bad thing! really cool work, and necessary for the bottom billion / low- and middle-income country health systems that do not use english, for which training data will not be available. more here). so perhaps a diagram using https://app.diagrams.net/?src=about or mermaidjs (or figma, happy to send you a link to one) is a good start? |
okay one concrete goal for us:
does that make sense as a concrete use case? trying to read between the lines here, have not had time to dig into background. if there is a blog post or something framing this work it would help us @onefact understand :) |
I will try to work on getting better documentation out there, but I am normally focused on trying to build out features, and didn't anticipate anyone other than our Trio of dev's to make any contributions to the codebase. Huggingface Transformers.js uses the ONNX runtime, which does support WebGPU, and runs both in the browser and in the client. Whatever UXL does is sort of more about the operating system internals, than the user space code that calls it, for example the pytorch library. WasmEdge claims that it compiles nodejs / npm functions to wasm, so that should also be fine, but I don't know what the size of the compiled binaries should be. I see that you listed 100MB as the maximum size of the allowed by github / wasm edge , I don't actually understand what model size that you had intended on using because of the infrastructure you are using, the limits of edge based device inference, or both. I will note that 100MB is the maximum size of the CAR archive standard for IPFS, and normally files larger than this will be chunked into 100MB, and likewise people also split their models into parts, although the maximum file size for huggingface git repositories is 50GB (using git lfs). The intended use case of the ipfs_agent module was to have a local LLM (larger than 100mb), for example the gemma / phi models, generate a tool plan and make calls to other 'tools', which are being advertised as available over the p2p network. Whereas the intention of using the ipfs_model_manager, is that it sources the AI models / datasets themselves from the fastest source (ipfs / s3 / huggingface, local), and help host those models / datasets on the IPFS network. So I don't think that you will a functional limitation from the infrastructure side, but perhaps from the client hardware side, which can also be alleviated if your client software can make a libp2p request for the inference from other peers, and there is a peer on the network that trusts the client enough (because of reputation or because of their identity has been added to a list of trusted peers) With regards to every hospital needing their own model, that can either be achieved through creating LORA layers ( which can often be 100MB), and alternatively every hospital download one of these models, train it on whatever private data is sitting in their S3 bucket, and then publish their model weights to hugging face / ipfs. |
I updated this readme.md to reflect the latest pushes that I've made. Monorepo: Other Repositories Touched: https://github.com/endomorphosis/ipfs_model_manager https://github.com/endomorphosis/ipfs_model_manager_js https://github.com/endomorphosis/orbitdb_kit https://github.com/endomorphosis/ipfs_kit New Repositories Created: |
@endomorphosis - still not fully grokking it, but will leave thoughts here. It’s a lot of info so appreciate your patience! Quick thoughts:
Analogies help! My 2 cents right now (just woke up) is to use data classes in python for most of it. Fairseq is good for this, as is Mosaic’s code for their libraries like MosaicBERT, as is torchtune happy to elaborate if this isn’t clear, but I would prioritize this a lot before doing any UX research. |
The code that I have written here are extensions to the huggingface libraries, including : transformers, transformers.js, datasets, faiss. One repository which is called the "model manager" maintains the list of models and their metadata for mlops (hw_requirements, api parameters, and locations of the files needed to run inference). The places where the model manager will get files (e.g. code, models, etc is either : a) local b) huggingface c) s3 d) ipfs, and it will try to download readme.md to find which source is the fastest to load the model by overloading the huggingface transformers methods with updated parameters. There is another module (ipfs_datasets) that downloads huggingface datasets in a similar fashion, and another that loads the datasets into an orbitdb database so that K nearest neighbor search can be performed (ipfs_faiss) The intent is that each computer with the huggingface model/dataset manager can network to other computers using libp2p, then the computers can offer to provide assistance to other peers with executing any model which is in their local filesystem cache. Once this is done a user agent can then collect all the information about the model lists, and the services offered by the other peers, to generate a list of "tools" that are available to the Huggingface agents library (ipfs_agents), which is a local llm uses that tool list put into its system prompt, (which is actually just connected to a python call function to some remote api service), to generate a plan for how it wants to use those tools to accomplish the task that the user asked it to do. Then the users agent executes the plan by making libp2p calls to other peers that have already downloaded the models that the specific tool needs and awaits the results before returning it to the user. |
I do have some typechecking for calls that go between models, in one respect the huggingface agents library defines the interfaces between the agent and the tool involved, and in addition i have a more traditional mlops infrastructure which is not open source, where I define each of the skills (e.g. llama_cpp, diffusion, etc) in terms of input / output schema including type definitions.
The huggingface library is supposed to abstract most of this away, there are instances where a new model architecture dropped, but normally a huggingface repository includes a config.json which give the huggingface AutoModel library the needed information it needs to choose which other library it needs to call e.g. whether to invoke diffusionModel or Bert, and what configuration parameters to pass to those libraries, which themselves are responsible for calling pytorch, llama_cpp, onnx, etc e.g. https://huggingface.co/jondurbin/airoboros-70b-3.3/blob/main/config.json
https://github.com/endomorphosis/ipfs_transformers/blob/main/config/config%20template.toml
I haven't yet built a docker container around this, so far how this specific package, ipfs_transformers thing works, is more like a python library that you would call, and change a line or two in your existing script, and it will just choose to download the model from the fastest source available. (local,ipfs,s3,huggingface). I have not finished with making the model manager into a service, including the agentic functions of being able to address other peers to do computation, after that is done I was intending to build docker containers and have a package that will deploy peers along with a pre-chosen identity, and then that docker container will always trust every other peer with the same identity and improve the reputation amongst the swarm of peers, for when an agent bearing that identity needs some bursty ml inference traffic. I had some continual integration hooked up into a more traditional mlops package that I built, and have not yet included a continual integration pipeline for this yet, i did begin but I decided i wanted to have a better idea of the scope of each package and something with stable interfaces to test against. |
incredibly helpful @endomorphosis tysm!! will review. i agree on stable interfaces being preferable, especially those that are backed by large amounts of capital (like huggingface over jax, linux foundation over - pytorch over jax, etc). one first step could be documenting the interface of methods you plan to support? e.g. a friend wrote https://arxiv.org/abs/2404.02258 recently, but this is likely hard to support. so clearly delineating what architectures, nonlinearities, modes (fully-sharded data parallel, quantization, etc) are or will be supported could be helpful. not sure what use cases you have in mind; we have been focused on health care as it is 20% of the GDP in the US and hospitals elsewhere have little to no access to AI that is easy to run in-browser. |
I would if there was some immediate benefit over writing features to get all the code working i.e. if there was someone else others contributing to the code, or the code was nearing a polished state. The scope of the project is to build some decentralized compute infrastructure, something similar to BOINC, which provided compute for seti@home, but instead for huggingface agents / libraries, to provide failover for the non decentralized infrastructure that I've already been building, and use that to power 3d animated avatar chatbots in hope I can bring the latency from the chains of inference down further by moving from a hub and spoke networking model to a peer to peer model. |
This is incredibly helpful!! I am finally starting to get it lol, thank you for bearing with me (slow on the uptake, busy week with travel to Sweden and back). For the 3d animated avatar chatbots, what are the requirements? Both in terms of the avatars (resolution: measured in voxels? Derived from: photogrammetry or an STL file or an app like poly.cam?), and in terms of the chatbots (tokens/second latency, perplexity on a web scale corpus, etc)... For writing new features, I have been seeding with this: https://gist.github.com/jaanli/5def01b7bd674efd6d9008cf1125986d And mainly rely on Zed for editing (it has Claude support), and claude.ai otherwise. Are there open issues in the repositories you linked to that are high priority? I think it is indeed too early for me to think of writing assessment/continuous integration code (like GitHub Actions but decentralized), but it really, really helps to start understanding what it would take to scale this to hospitals in low- and middle-income countries (that may or may not have access to cloud environments). |
https://youtu.be/TDitkDKbqbk This is the demo of the avatar. The issue is that the traditional mlops infrastructure was hub and spoke with the master server as the intermediary to the worker, and in addition the latency is a result of running many models in serial e.g. speech 2 text -> RAG -> LLM inference -> text to speech -> speech to animation. Right now the skeleton is a VRM model, and the rendering is done in three.js entirely on the client side. |
in today's news |
Related: jupyterlab/jupyter-ai#822 (comment) |
I recently started refactoring stuff so that I can break up all the mono repos into individual GitHub repositories which will allow you to better decide what modules make sense for your needs, and today I'm going to start thoroughly checking the integration of all of the new modules into a new branch. https://github.com/endomorphosis/ipfs_transformers/tree/dependency-fix Once I have finished bug testing all of the behavior of these branches, i will be pushing those repositories to main, and this will import the following modules, each can be ran separately. 'orbitdb_kit@git+https://github.com/endomorphosis/orbitdb_kit.git', Moreover my @Coregod360 has been working on some photo album program, and he has been bug checking code that was ported over some stuff from python to nodejs https://github.com/endomorphosis/ipfs_kit_js/commits/main/ipfs_kit_js |
I have refactored these libraries as promised, and am now working on the nodejs parts of the libraries. In addition I have been collaborating with LAION on some project to create a better voice assistant. |
That's amazing @endomorphosis ! Here is what we have been focused on to try to help your efforts with the database technology needed: We have also started potentially collaborating with usb.club (@whosnorman specifically) who need what you are building too - we chatted about it briefly and I'm happy to write a more formal design doc or hackmd.io file. Would a hackmd.io file help to understand how these many moving components fit together? |
Do you have discord, I would like to get you in touch with the LAION AI group, I think that you and Christoph would really get along, and I think that the neural 3d avatar i built some months ago would be great for doctors as well. |
I have ported over https://github.com/endomorphosis/ipfs_kit_js and it passed initial unit tests, Next module that I will be porting over is https://github.com/endomorphosis/ipfs_model_manager_js/ then subsequently https://github.com/endomorphosis/ipfs_transformers_js |
This will be very fun to get working with WebAssembly :) If we can help @onefact please let us know!! |
if you are feeling enthusiastic, I have added you to the repository, for client side javascript. I intended to follow the motif that I had previously established, but instead of command line interfaces to manage system files and services, i was going to rely on these libraries to manage them as objects and rest APIs https://github.com/nftstorage/ipfs-cluster |
Thank you @endomorphosis ! Have you seen this library yet? https://github.com/Econify/generative-ts Curious to hear your take on what compatibility would take if you think that might be feasible, may be able to get some resources to make it interoperate with IPFS and your work :) just let me know! |
it looks like is a library that makes API calls to external API's in typescript. you can run these in nodejs, along with the rest of the nodejs code that I am writing. But there is no need for a "model manager", and storing files on ipfs, if you are just running API services. Rather what I'm aiming to create is a peer to peer swarm inference system for the edge based on libp2p/ipfs, in lieu of using cloud service providers that would charge money. |
@endomorphosis what help do you need? What headcount or FTE across the stacks you operate in? Let me know. I can add it to the deck: https://www.figma.com/slides/a7mcpAmv2qTomxt8NK2WfQ |
Yeah, sorry that I have been slacking on the commits on this lately, I had a meeting with dave grantham at libp2p about two days ago, and otherwise my time has been spent managing college students for LAION.AI + Intel Center for Excellence, to get their cloud infrastructure to a place where both meaningful work is getting done, and the college students stop burning $30k/ month in cloud bills while 1/2 of the servers are bricked (im not joking). We are working on processing 2 petabytes of youtube videos, so they can process the datasets and train llama 3.1 8b to take in audio tokens and produce audio tokens (like gpt4o), and otherwise I am working on processing wikipedia and the caselaw by reimplementing the back end of microsoft graphrag with OPEA. Last night I ran a job quantizing llama 3.1 8 / 70B to fp8, and then i subsequenly tonight will try to quantize it with one box, but try to use parameter offloading (lets hope it works), because I was unable to get a second gaudi2 box to spread the weights while it runs the quanitzation script. If that works I will then be able to use speculative decoding with the llama3.1 405b and the llama 8b on gaudi for microsoft graphrag. I will take time this next week to work on this project, thanks for reminding me that I need to get back to this. |
@endomorphosis i won’t have time to review, but someone @onefact might. Can you please share a UML or markdown diagram over hackmd (or diagrams.net, or figma.com - I like figma best). That would help understand how we can transition from our @google TPU grant to IPFS fastest. Happy to help if I can make a draft for the talk slides? |
I had two partners ( a sr dev and a jr dev), the sr dev got cancer and the jr dev got a new job with edgerunner.AI (edge compute project that will probably incorporate some of this code eventually), and I started working with LAION, so I could get the 3d avatar stuff fully polished with low latency. Part of that low latency strategy is to reduce the models (mixing audio tokens into the llama training), and another part of that was this new MLops stack I've been working on, it would be nice to have another 2 engineers working on the mlops stack, and I can continue to manage stuff going on at LAION, to get their cloud infrastructure to a point where the data keeps moving through the processing / training pipeline. |
yeah, let me try to do this over the weekend, in addition to the llama 405b model quantization |
Got it happy to help! Re: quantization, I recommend using @leycec's |
I have finished with my tasks for OPEA at the linux foundation, I talked to @Mwni and told him what I have in mind, I dont know how much work he is going to be able to do on this, but he claims he feels "1% better every day". I am going to take a nap, do some laundry, and probably from the hours of 10pm-4am PST work on the UML diagrams we discussed. |
omg. 😞
omg. 🤗 Let me know if @beartype can do anything to support everybody's exciting MLops IPFS use cases that I understand absolutely nothing about. All of this sounds like a revolutionary miracle in AI model delivery. I support this miracle. Our upcoming @beartype 0.19.0 release supports a crude form of type-hinting "AI" I colloquially dub BeartypeAI™. Tongue-in-cheek, of course. It's all just caveman-style ad-hoc heuristics and undocumented algorithms I made up a few weeks ago. Basically, @beartype can now write your tensor type hints for you if you'd rather programmatically hand that off to an API: e.g., # Define the greatest NumPy array that has ever existed.
>>> from numpy import asarray
>>> best_array_is_best = asarray((1, 0, 3, 5, 2, 6, 4, 9, 2, 3, 8, 4, 1, 3, 7, 7, 5, 0,))
# Create a type hint validating that array, @beartype! Look. Just do it.
>>> from beartype.door import infer_hint
>>> infer_hint(best_array_is_best)
typing.Annotated[numpy.ndarray[typing.Any, numpy.dtype[dtype('int64')]], IsAttr['ndim', IsEqual[1]]] # <-- wtf, @beartype And... that's the type hint. Nobody's writing that sort of gruelling bracket hell on their own. Not even @leycec. Now, just let somebody else do all the suffering for you. That somebody is @beartype. Who knew? 🤔 |
Here is the 1/2 of the framework that I have already implemented (but somethings have yet to be ported and refactored in nodejs/clientjs) I have an idea of how to implement the libp2p / hf agents / hf accelerate parts, whereby other peers are discovered, they report what models they have, their identity, and compute can be offloaded to the network, using some sort of trust based system, but i have not yet started to design and test the implementation. |
@Mwni decided to start on this repository for the libp2p based task inference. Its like 2 days old so expect it to be a WIP. I'll do another UML diagram tomorrow for K nearest neighbors module, I have a very good idea of how it needs to be implemented, its not that complicated but I haven't gotten around to it yet, beyond some simple prototyping of some of the components |
https://discord.gg/hugging-face-879548962464493619 Huggingface discord transformers.js channel. |
https://github.com/endomorphosis/ipfs_kit_js/tree/main/tests This module is fully tested, and continuous integration testing done, ipfs_model_manager_js is next. |
at @onefact we have been using wasm, but this won't work for the encoder-only or encoder-decoder models i've built (e.g. http://arxiv.org/abs/1904.05342). that's because the wasm vm is for the cpu (has simd) but no gpu. breaking this out of memory sandbox means we can get near native preformance with libraries like:
https://github.com/LlamaEdge/LlamaEdge
https://github.com/GaiaNet-AI/gaianet-node
any plans to support these? i have no background here so sorry if this is already supported.
s3 + cloudfront is amazing infrastructure for us: we published 4000+ hospitals' price sheets in this manner and only pay a few dollars a month to maintain it - https://data.payless.health/#hospital_price_transparency/
we will be doing the same for a leaderboard for LLMs applied to health care use cases.
The text was updated successfully, but these errors were encountered: