Skip to content

Efficiently process large code repositories to enable effective code reviews with GPT. The tool streamlines the task by filtering relevant content and preparing it in manageable, tokenized chunks for GPT analysis.

License

Notifications You must be signed in to change notification settings

adithyan-ak/GPT-Code-Review-Tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT Code Review Tokenizer

GPT Code Review Tokenizer is designed for efficiently processing and tokenizing large code repositories for performing Code Reviews through GPT. It consists of two main components: load_repository.py and tokenizer.py. The first component scans a specified repository, filters files based on ignore patterns, and outputs its contents into a single file. The second component then tokenizes this output into smaller, manageable chunks suitable to perform code reviews with GPTs.

Features

  • Repository Scanning: Traverses every file in a specified repository path.
  • Ignore Pattern Filtering: Optionally ignores files matching specified patterns.
  • Output Consolidation: Aggregates repository contents into a single file.
  • Tokenization: Splits the repository's contents into tokenized chunks.
  • Customizable Tokenization: Allows setting of maximum tokens per chunk.

Installation

  1. Clone the repository: git clone https://github.com/adithyan-ak/GPT-Code-Review-Tokenizer.git
  2. Navigate to the project directory: cd GPT-Code-Review-Tokenizer
  3. Ensure Python 3.x is installed.
  4. Install required packages: pip3 install -r requirements.txt

Usage

  1. Configure the config.yaml file according to your requirements.
    • Set the repo_path to the target repository.
    • Optionally set ignore_file_path to specify patterns to ignore. By default, .gptignore contains commonly ignored files.
    • Configure tokenizer_config for tokenization settings.
  2. Run python3 load_repository.py to process the repository and generate the output.txt file.
  3. Run python3 tokenizer.py to tokenize the processed output.

Contributions

We welcome contributions! If you have a feature to add or find a bug, please submit a pull request or open an issue.

The following are some of the features we are working on and would love help with:

  • Configurable Output Formats: Allow users to choose different output formats (e.g., JSON, XML) for the tokenized data, catering to various use cases.
  • Error Handling and Logging: Improve error handling and add comprehensive logging to help users troubleshoot issues during repository processing and tokenization.
  • Testing and Quality Assurance: Develop a suite of tests (unit, integration, performance) to ensure code quality and facilitate safe refactoring.
  • Documentation Enhancements: Improve and expand the documentation, including detailed setup guides, use case examples, and FAQs.
  • Dockerization: Containerize the application using Docker for easy deployment and execution across different environments.

Authors

Acknowledgements

About

Efficiently process large code repositories to enable effective code reviews with GPT. The tool streamlines the task by filtering relevant content and preparing it in manageable, tokenized chunks for GPT analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages