GPT Code Review Tokenizer

GPT Code Review Tokenizer is designed for efficiently processing and tokenizing large code repositories for performing Code Reviews through GPT. It consists of two main components: load_repository.py and tokenizer.py. The first component scans a specified repository, filters files based on ignore patterns, and outputs its contents into a single file. The second component then tokenizes this output into smaller, manageable chunks suitable to perform code reviews with GPTs.

Features

Repository Scanning: Traverses every file in a specified repository path.
Ignore Pattern Filtering: Optionally ignores files matching specified patterns.
Output Consolidation: Aggregates repository contents into a single file.
Tokenization: Splits the repository's contents into tokenized chunks.
Customizable Tokenization: Allows setting of maximum tokens per chunk.

Installation

Clone the repository: git clone https://github.com/adithyan-ak/GPT-Code-Review-Tokenizer.git
Navigate to the project directory: cd GPT-Code-Review-Tokenizer
Ensure Python 3.x is installed.
Install required packages: pip3 install -r requirements.txt

Usage

Configure the config.yaml file according to your requirements.
- Set the repo_path to the target repository.
- Optionally set ignore_file_path to specify patterns to ignore. By default, .gptignore contains commonly ignored files.
- Configure tokenizer_config for tokenization settings.
Run python3 load_repository.py to process the repository and generate the output.txt file.
Run python3 tokenizer.py to tokenize the processed output.

Contributions

We welcome contributions! If you have a feature to add or find a bug, please submit a pull request or open an issue.

The following are some of the features we are working on and would love help with:

Configurable Output Formats: Allow users to choose different output formats (e.g., JSON, XML) for the tokenized data, catering to various use cases.
Error Handling and Logging: Improve error handling and add comprehensive logging to help users troubleshoot issues during repository processing and tokenization.
Testing and Quality Assurance: Develop a suite of tests (unit, integration, performance) to ensure code quality and facilitate safe refactoring.
Documentation Enhancements: Improve and expand the documentation, including detailed setup guides, use case examples, and FAQs.
Dockerization: Containerize the application using Docker for easy deployment and execution across different environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Code Review Tokenizer

Features

Installation

Usage

Contributions

Authors

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gptignore		.gptignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
load_repository.py		load_repository.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py

License

adithyan-ak/GPT-Code-Review-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

GPT Code Review Tokenizer

Features

Installation

Usage

Contributions

Authors

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages