Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor db #16

Closed

Conversation

Keyrxng
Copy link
Member

@Keyrxng Keyrxng commented Sep 18, 2024

Resolves #13
Resolves #15

  • combined into a single table for embedding storage
  • combine adapters into one single embeddings entity
  • Added mentlegen's proxy with my types to get rid of lots the need for lots of typeguards for each webhook callback
  • no longer storing body > title, so to make up for that I simSearch using the title and body separately
  • refactored tests to use the mswjs DB and handlers like other plugins tend to do (mocking modules kills me for some reason, but the db and handlers I understand 😂)

QA:

ignore the row id 8 with the "::", I added body > title logic and removed it immediately.

I have another pull with the setup_instructions logic included, left out as it was not in scope for this PR.

image

Copy link
Member

@gentlementlegen gentlementlegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the code changes themselves, but I am not sure what to test here. @Keyrxng could you give me a use case?

@Keyrxng
Copy link
Member Author

Keyrxng commented Sep 21, 2024

I am fine with the code changes themselves, but I am not sure what to test here. @Keyrxng could you give me a use case?

I'm unsure what you mean a use case? Like to practically work with the embeddings in unison with an LLM?

If I was testing this PR I'd ensure that the comments and specs are being stored properly. You could create embeddings for a spec or a readme then create a query embedding and see far fruitful it is with that alone. Although chatbot/embedding effectiveness I'd think is sort of out of scope for this. The intention of the PR was to make it easier to work with the embeddings from more than one angle.

The class system should make it easier for us to refine our embedding scope before we compare, that way we improve it's accuracy because we can remove entire "sections" of data and have it only consider what we tell it to. The infra isn't in place yet of course, but the class system for e.g we install a chatbot on ubq.fi, easy to do and good value for money, it should be scoped to high level org stuff and guard railed.

Moreover, the class system can be used as a weighted tier system for prioritizing certain sections over others under certain conditions.

#18 introduces logic that'll gather readmes which would give us something that we could effectively test in terms of having a well-rounded sort of chatbot with actionable milestones for confirming it effectiveness.

I'm not sure I answered your question or not please let me know

@gentlementlegen
Copy link
Member

@Keyrxng thank you for the clarification. To ask simply, I wondered what I should be looking for when testing, the comments that contain HTML should be stripped from their tags and stored as raw text in the db is that correct?

@0x4007
Copy link
Member

0x4007 commented Sep 21, 2024

I think we should store with full source code and then stripped just text and compare the results.

But eventually we should settle on one way and I'm pretty sure it's gonna end up being with the full markup

src/adapters/utils/markdown-to-plaintext.ts Outdated Show resolved Hide resolved
src/handlers/create-comment-embedding.ts Outdated Show resolved Hide resolved
@Keyrxng
Copy link
Member Author

Keyrxng commented Sep 21, 2024

I think we should store with full source code and then stripped just text

I agree, I think we ought to refer to others in the space and see how they are doing it.

Data preparation
Chunking the application sources into smaller parts is a non-trivial task. In general, functions, class methods, structs, enums, and all the other language-specific constructs are good candidates for chunks. They are big enough to contain some meaningful information, but small enough to be processed by embedding models with a limited context window. You can also use docstrings, comments, and other metadata can be used to enrich the chunks with additional information.

Parsing the codebase
While our example uses Rust, you can use our approach with any other language. You can parse code with a Language Server Protocol (LSP) compatible tool. You can use an LSP to build a graph of the codebase, and then extract chunks. We did our work with the rust-analyzer. We exported the parsed codebase into the LSIF format, a standard for code intelligence data. Next, we used the LSIF data to navigate the codebase and extract the chunks. For details, see our code search demo.

Everyone does something similar but different although they all chunk their codebase, split it into semantically effective sections. So we need to determine how we are going to prep our codebase(s) for chunking, if we'll do it all manually or involve some other recommended tooling or whatever.

I've mentioned this before but we need to effectively parse multiple different programming languages. We have our top 5 that we use within our org so we should aim to cover at least those with whatever implementation we go with, others should not be considered a priority imo unless whatever tools we use easily support multiple languages.

@Keyrxng
Copy link
Member Author

Keyrxng commented Sep 30, 2024

image


I can't help but notice the last two features did not come with tests included despite them being quite pricey in terms of reward. I thought that we were enforcing that new features extend test suites to at least cover the basic added functionality?

I'd appreciate being included in code review for AI plugins/features going forward if that is alright with the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle HTML input Store issue if not stored already
3 participants