-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search #50
Search #50
Comments
@sshivaditya2019 rfc |
This can be easily accomplished with embeddings and a capable LLM that has a substantial context. The main challenge will be to maintain a large vector database where comments and conversations are stored and readily available. Instead of simply dumping entire message chains, these would need to be selectively curated. |
What's a good time estimate? |
If the comments have to be cleaned and cherry picked, for a good Context Retrieval, it should be around 1 week. |
/start |
Tip
|
@0x4007 Could you share the comments corpus or text? How will this work? |
One consideration I just realized we have to resolve a design problem first. @Keyrxng is building a plugin where we can ask ChatGPT (with full linked issue conversation and pull context) any questions with the same syntax. Perhaps it makes the most sense to also look up similar conversations (embeddings) and appending their text to the LLM context window. If there's some high match then append. If no high percentage match then don't append. I think this should make the user experience seamless when asking questions. In any case though, technically this should not be a new plugin. Look for the I can't transfer this issue to that repository easily from my phone will need to from my computer in a bit |
If that's the case then I think this is blocked by /gpt PR. But, this could be a separate plugin as it deals with large troves of textual data, and probably could be extended for chats over other platforms, not related to code like |
Well perhaps we can have you take it over mid this week if its not done. They are supposed to be focusing on the Telegram bridge plugin as a top priority and that also isn't done.
This is the philosophy we are taking with these plugins, especially the ones that focus on working with text. For example, my vision is to have our Then UbiquityOS can be context aware of every work input to the organization, which makes it more generally intelligent of everything happening. |
I think they are are almost done with it. So, I can probably focus on the textual content/corpus required for this task in the mean time.
Could you please share the links to the old issues/ conversation threads, I can write a script for seeding them into the database. |
Most of everything is within the @ubiquity organization (created in 2020) but we did break off our recent efforts to the new orgs @ubiquity-os and @ubiquity-os-marketplace Or if its easier, please use the aggregated issues JSON in our directory. It contains, at least, all the URLs to all the issues that we are monitoring for tasks/proposals. It does not include all their conversation contexts though. I suppose the script can extract those URLs, and query the GitHub REST API for the conversations within each. |
Both are held back by review only. As I understand this task:
So this task really involves two rather simple parts:
|
You’re correct that we are backfilling the database with conversation threads from various organizations. However, we will be selectively choosing content that is relevant to a central theme, which will be extracted using n-gram frequency analysis and topic modeling techniques. This approach is necessary to avoid including random discussions that don’t pertain to the actual concepts or topics, thereby helping to prevent model hallucination.
Pretty much that, except we need to apply stemming and lemmatization to ensure the LLM captures the context. We should also apply re-ranking techniques, as I am almost certain there will be overlapping contexts, so a reranking method like BM25 would be required.
It would involve fetching the issues -> processing them -> dumping them into a SQL file -> and then executing the migrations.
And incorporate some form of re-ranker and context distillation techniques to improve overall response quality while reducing model costs. |
I think this would function very effectively as a separate plugin with a higher request or execution limit. It is likely to be used more frequently than the |
Much better UX to consolidate #50 (comment) |
Could you please explain further? I’m not able to understand what you mean by "better UX." |
Refer to that part of the spec. Their
Imagine asking a senior colleague any question on a pull request or an issue. They will have context on all the other historical issues/pulls, as well as general knowledge from other projects. My vision consolidates both into a single natural interface of tagging the colleague and asking your question. Footnotes
|
This sounds like a whole new classification/approach to the current single-comment-body embeddings that are happening. Does this imply that the curated text (and the embeddings for that whole body of aggregated chat) is going to be an amalgamation of only the relevant text from across the issue minus any noise? So the embedding will paint a more well rounded picture of more info or will text be chunked as it is now on a per-comment basis?
I'd have thought that the task is the theme for any given body of text, then it's parent theme would be the repository, the parentmost theme being the org in which it belongs.
Why do we need to lobotomize and generalize the text when models are more than capable of comprehension without having to squash things into layman's terms. I fear that this would not be ideal given the highly nuanced and specialized topics that get discussed across tasks and PRs. I don't think we should do that.
What is it you are re-ranking I do not understand that sorry. You mean re-ranking the relevance of the set of embeddings you obtained on your first search or obtaining a new set from the DB? Looked into BM25, bag-of-words and I understand the basics of the concepts and how they'd be beneficial here actually. Very different from the current embeddings approach though, is the idea ultimately to perform the search across the single-body-embeddings as-is or only on the amalgamated "blocks"/"themes"?
Currently ubiquity-os-marketplace/text-vector-embeddings#16 This PR consolidates everything into a single table which we can use to search across all embeddings at once or we can use both Will this new style of embedding you are using have it's own sort of classification for use? Is that what you meant by themes? Last question I promise;
Are you manually processing all 700 tracked tasks (and however many associated PRs)? 👀👀 Or are you automating somehow?
Srsly last one. Only old conversations are going to receive this treatment or are we going to apply this same treatment to our current/future tasks also? |
The single-comment-body embeddings are the same; the only change is in how they are selected. It’s more appropriate to view this as a temporary step in the process. Rather than dumping everything into the database, we are carefully selecting conversations that are genuinely relevant, which is crucial for preventing model hallucinations.
A single issue or PR thread may have multiple themes that may not directly relate to the task at hand. For instance, a PR thread might include discussions about best practices that aren't relevant to the specific task or organization. Therefore, we need to adopt a nuanced approach to textual context rather than relying solely on code, as there is often significant implicit context and meaning involved.
In stemming and lemmatization, we reduce words to their root forms, which decreases the overall number of tokens without losing context. For instance, "running" becomes "run" through stemming, while lemmatization would convert both "running" and "ran" to "run." This process is particularly helpful for improving relevance scoring. I assume there may be conversations in other languages, but if that’s not the case, then this approach might not be very useful for English conversations.
When comments are ranked based on cosine similarity, we can enhance retrieval performance by using rerankers. These rerankers simply reorder the results to help the LLM focus on the most important concepts. I’ve attached a reference[1] that highlights the significance of rerankers in RAG applications.
Everything is exactly the same; the themes serve merely as a way to refine the data, acting as a new feature extraction stage. The themes will not be utilized or included in the database.
There won’t be any manual processing; it will be automated, but I will need to review at least some parts of it. This is a two-step process: first, we group the comment-issue-body into topics. While some manual work is necessary to ensure the tags are correct, this could potentially be automated using an LLM.
Only the old conversations will be used, although they can also be applied to newer tasks if needed. Essentially, this is a stable corpus that can serve as a source of truth, which is why "this treatment" is being implemented here. [1] https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/ |
So we're filtering out irrelevant comments and embedding only those related to the task spec. And would these be embedded as individual comments or combined as an overview based on the "theme", like a focused md doc for each task kind of?
We previously agreed that GitHub is primarily English-based. We’ve added translation support in the plugin-template, but that’s the only time we’ve addressed multilingual support afaik.
This seems more like a hammer when we really need a scalpel, but I’m curious to see how it works, especially given our large dataset.
I’m concerned about managing multiple AI providers. Using a provider that aggregates different AI models might simplify things, as we’d avoid juggling multiple API keys. We used to rely on GPT to summarize and rank everything. The latest models are more than capable of handling entire raw text task convos (multiple actually), PRs, and diffs, generating structured, consistent Markdown summaries. We could make a structured template for a MD doc for each task, improving RAG, and help us build a clean dataset for fine-tuning/training our own model, which should be our end goal ultimately.
I’m unclear here but will wait for the implementation since you have a clear direction.
If this creates a clear, focused overview of tasks, we should apply it to all tasks, not just old ones, as it seems more efficient than embedding every comment like we do now. We should do something like this https://chatgpt.com/share/66fbe1e7-9240-8000-aa8d-f2b68f9ca142 in my opinion and treat each task like a document in the knowledgebase of UbiquityOS, the same as big companies do with their in-house AI models like SalesForce, TelCom, etc.
|
I just reviewed the updated database schema and noticed we could add a nullable topic column. This would allow us to create a topic for queries, match it, and use the results to generate context for the LLM. However, with this approach, we’re primarily filtering out irrelevant comments and embeddings. Once we retrieve data from the database, we could summarize it, but I don’t think that’s necessary.
This would be very useful for n-gram models or topic models like LDA. Several papers have noted improvements in LLM retrieval performance as well [1]. However, this is optional and would depend on the specific context and the information regarding its performance.
We could utilize something like
We can use GPT, which processes context strictly in the order it's provided, without applying any intelligent reasoning. As a result, the responses will rely heavily on the sequence of information retrieved through cosine similarity-based vector search. [1] Optimizing LLM Queries in Relational Workloads |
afaik queries are not saved to the DB only our embeddings which would mean we'd need to determine a topic for every embedding we create to be able to match it right? This is in addition to classifying it like
I feel like we have no current baseline for our own comparison as we do not even have a simple RAG chatbot yet. Also afaik n-gram and topic models need quality embeddings/source docs to begin with, they work with documents/bodies of text but our embedding system right now is so granular that our embedding content is likely the literal strings and we are going to have 10s/100s of thousands of these embeddings in a very short time. What I thought you were doing is improving the embeddings being stored so they are of higher quality to work with.
I'd be in favour of this as we'll be drowning in AI API keys in no time if we intend to continue to introduce lots of other models.
I believe that these modern models bring their own intelligent reasoning and that's why they are so much better than only a couple of years ago. My suggestion was that we use GPT to create a stable corpus of literal MD documents that cover the entire contents of a task succinctly but effectively ranging back to 2020 and continue to do so when a task is completed. This way we are creating a very structured and uniform dataset (good for RAG, good for fine-tuning, good for training) which is not some kind of blackbox (as these models all are) that we can actually review and read and manually edit if necessary (highly doubtful). So we structure a template, feed the entire task and all of it's contents and have GPT do this for one repo or 30-40 tasks. They can be very easily QA'd, prompt refined, and chatbot tested and then we do it for every tracked task (via batch api as that'll cost large using the best models). The DB type is 9/10 RAG chatbots all operate on documentized datasets, so rather than try to re-create the wheel for our V1 chatbot, let's do what is normally done and documentize our org via tasks to create a chronological timeline of evolution for our org as the foundation of our chatbot's knowledge base. Won't these n-gram models and topic models perform most optimally with a better collection of base embeddings to begin with? Additionally with these complete task doc summaries, we could begin to fine-tune a model right away and even offer it as a service to partners. https://community.openai.com/t/scaling-rag-chatbot-system-to-millions-of-documents/615386/2 https://chatgpt.com/share/66fd61e4-b41c-8000-bde7-ecbdbff51b2e - 1 question 1 answer convo
We intend on embedding our entire codebases so when we need to handle technical queries about code we will pull from there and chances are we will never ask about a specific comment but more likely the "why" or "how" which would pertain to a task/codebase/repo etc, contextually a document covering the evolution of the task would contain the "why" and "how". |
Our aim should be improving the effectiveness of our embeddings so we can reduce the amount of embeddings we store because that makes them easier to work with, less overheads and improved responses at least for our use-case. avg of 15 comments per task (being extremely generous here as including PR and review, more like 50-100), 700 x 15 = 10,500 embeddings. That's wild and contextually they do not include which repo they belong or task etc. They are just whatever the comment says and it's out of context of the task, repo and org. 1 doc per task = 700 embeddings. All contain the why, what, where, who and how. We could even chunk each doc into 4 which would improve our granularity and still keep us under 3k embeddings while capturing all the context that we actually need. rfc all |
I've compiled a comprehensive list of all tracked issues and their corresponding PRs, totaling nearly 1,752 comments. Here are the issues I've identified:
I believe we can remove the empty issues. I'm currently checking if specific issues are mentioned in the PRs to determine whether to include them in the context. This may lead to overlaps if a single PR is linked to more than one issue. Next, step would be to remove overlapping issues, and add the further processing part. Should pull requests that don’t reference a specific issue using the format "Resolves #<Issue_number>" or "#<link_to_issue>" be kept or removed? |
In recent times, we've been pretty good about this so just take whatever uses the proper keywords and only include whatever is linked to a single issue because that is also a rule that we enforce. Toss out anything that isn't perfect. I think we have a lot of good sample data. |
@0x4007, I just saw the updated issue spec. Are we building on the command-ask plugin, or making a new one? Is the /search feature part of command-ask, or is it its own plugin? |
Whatever is easier. The idea is that you can get started on making the logic and we can worry about consolidating the user interface later |
Note This output has been truncated due to the comment length limit.
|
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Task | 1 | 800 |
Issue | Comment | 12 | 0 |
Review | Comment | 25 | 0 |
[ 105.066 WXDAI ]
@0x4007
Contributions Overview
View | Contribution | Count | Reward |
---|---|---|---|
Issue | Specification | 1 | 55.41 |
Issue | Comment | 9 | 21.216 |
Review | Comment | 38 | 28.44 |
@sshivaditya2019 can you paste the plugin configs under the completed issues so we can install and test? |
! Failed to run comment evaluation. Relevance / Comment length mismatch! |
! Failed to run comment evaluation. Relevance / Comment length mismatch! |
Description:
We have two goals that are closely aligned:
Embedding-Based Search for Prior Conversations:
We want a plugin that enables us to naturally ask questions related to previous conversations on GitHub using embeddings. This should ideally work across multiple GitHub organizations and repositories. For example, if someone asks, “What was the original reason for moving the LP tokens?”, the system should be able to search through all conversations and provide relevant information.
This feature should include org-wide default search (with options to extend the search to multiple organizations as arguments).
Example Context:
Next Step:
Temporary Slash Command for Context-Aware Search:
As a stepping stone to the above, we propose a dedicated slash command
/search
to help contributors quickly search through existing threads and add value. The logic of this command would mimic that of the natural language embedding search, and eventually, it will merge into the @UbiquityOS question-based syntax.Temporary Fix:
Future Vision:
References
Originally posted by @rndquu in ubiquity/ubiquity-dollar#939 (comment)
Originally posted by @gentlementlegen in ubiquity-os-marketplace/text-conversation-rewards#132 (comment)
The text was updated successfully, but these errors were encountered: