GPT Query

Craw content from a website an all its subpages and store it in a database. Then use GPT create your own custom GPT and generate new content based on the crawled content.

screenshots:

Demo Link

https://chat.openai.com/g/g-RskOOlLFp-sumato-assistant

How to use

Prerequisites

Python 3.11

Setup

Clone this repo git clone https://github.com/sonpython/GPTSiteCrawler
craete a virtual environment: python3 -m venv venv
activate the virtual environment: source venv/bin/activate
Install dependencies: pip install -r requirements.txt
Run the crawler: python src/main.py https://example.com --selectors .main --annotate-size 2

Usage

> python src/main.py -h
usage: main.py [-h] [--output OUTPUT] [--stats STATS] [--selectors SELECTORS [SELECTORS ...]] [--max-links MAX_LINKS] [--annotate-size ANNOTATE_SIZE] url

Asynchronous web crawler

positional arguments:
  url                   The starting URL for the crawl

options:
  -h, --help            show this help message and exit
  --output OUTPUT       Output file for crawled data
  --stats STATS         Output file for crawl statistics
  --selectors SELECTORS [SELECTORS ...]
                        List of CSS selectors to extract text
  --max-links MAX_LINKS
                        Maximum number of visited links to allow
  --annotate-size ANNOTATE_SIZE
                        Chunk data.json to this file size in MB

Chart

I've created a chart to help you understand how the crawler works. It's a bit of a simplification, but it should help you understand the basics. You can run the chart with python src/chart.py in another terminal window to get the realtime chart updating the crawl progress.

Docker

env vars

CRAWLER_URL=https://example.com 
CRAWLER_SELECTOR=.main 
CRAWLER_CHUNK_SIZE=2 # in MB

Build

docker build -t gpt-site-crawler .

Run

docker run -it --rm gpt-site-crawler

(I borrow the bellows docs from @BuilderIO)

Upload your data to OpenAI

The crawl will generate a file called output.json at the root of this project. Upload that to OpenAI to create your custom assistant or custom GPT.

Create a custom GPT

Use this option for UI access to your generated knowledge that you can easily share with others

Note: you may need a paid ChatGPT plan to create and use custom GPTs right now

Go to https://chat.openai.com/
Click your name in the bottom left corner
Choose "My GPTs" in the menu
Choose "Create a GPT"
Choose "Configure"
Under "Knowledge" choose "Upload a file" and upload the file you generated

Create a custom assistant

Use this option for API access to your generated knowledge that you can integrate into your product.

Go to https://platform.openai.com/assistants
Click "+ Create"
Choose "upload" and upload the file you generated

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
screenshot.png		screenshot.png
screenshot1.png		screenshot1.png
screenshot2.png		screenshot2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Query

Demo Link

How to use

Prerequisites

Setup

Usage

Chart

Docker

Build

Run

Upload your data to OpenAI

Create a custom GPT

Create a custom assistant

About

Releases

Packages

Languages

License

sonpython/GPTSiteCrawler

Folders and files

Latest commit

History

Repository files navigation

GPT Query

Demo Link

How to use

Prerequisites

Setup

Usage

Chart

Docker

Build

Run

Upload your data to OpenAI

Create a custom GPT

Create a custom assistant

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages