-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DocsScraper sub repo #5
base: main
Are you sure you want to change the base?
Conversation
Great start! It's very clear. I just had a couple of stylistic comments - let me know what you think. |
It would also be good to add some minimal tests in folder It would be good to cover at least the main pathways of the core functions |
@splendidbug How are you thinking about the scope of this PR before we merge it? I think we can have it open for a while, but I'd like to see the following prior to merge:
It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon). Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR. WDYT? |
Sounds good and I agree that keeping the PR open to add other functionalities is a good idea Also when we start implementing the crawler, we'll have to take care of memory overflow right? |
Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think
Or did you have a different concern? |
Separately, could you please go through the review and click resolve on the feedback you tackled already? Also, there were some suggestions - in the future, you can just accept those and it will make the changes for you :) |
That was my concern. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Some minor changes requested to add more modular functions and to simplify the logic
Btw as mentioned on Slack, an easy way to accumulate strings is to pass around an |
As discussed, I haven't looked at the parser yet. One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. module DocsScraper
using AbstractTrees
using Gumbo
using HTTP
using URIs
export x,y,z
include("parser.jl")
export x,y,z
include("...")
end |
On a separate note, I took it for a spin and parsed docs across several packages -- I haven't verified all in detail, but at least it runs across several doc site types (it required small tweaks). It integrated nicely into the PromptingTools RAGTools: ## Load up all Makie docs
dirs = ["makie/Makie.jl-gh-pages/dev",
"makie/AlgebraOfGraphics.jl-gh-pages/dev",
"makie/GeoMakie.jl-gh-pages/dev",
"makie/GraphMakie.jl-gh-pages/dev",
"makie/MakieThemes.jl-gh-pages/dev",
"makie/TopoPlots.jl-gh-pages/dev",
"makie/Tyler.jl-gh-pages/dev"
]
output_chunks = Vector{SubString{String}}()
output_sources = Vector{String}()
for dir in dirs
dir = dirs[1]
@info ">> Directory: $dir"
files = mapreduce(x -> joinpath.(Ref(x[1]), x[3]), vcat, walkdir(dir))
files = filter(x -> endswith(x, ".html"), files)
chunks, sources = RT.get_chunks(DocParserChunker(), files)
append!(output_chunks, chunks)
append!(output_sources, sources)
end
length(output_chunks), length(output_sources) I can share the full script of methods added, but what's relevant is probably this. ## HTML parser from txt -> vector of dict (ie, skips the download)
"Parses an HTML string into a vector of Dicts with text and metadata. Returns: `parsed_blocks` and `title` of the document."
function parse_html_to_blocks(txt::String)
parsed_blocks = Vector{Dict{String,Any}}()
heading_hierarchy = Dict{Symbol,Any}()
r_parsed = parsehtml(txt)
# Getting title of the document
title = [el
for el in AbstractTrees.PreOrderDFS(r_parsed.root)
if el isa HTMLElement && tag(el) == :title] .|> text |> Base.Fix2(join, " / ")
# Content markers:
# Documenter: div:docs-main, article: content (within div:#documenter)
# Franklin: div:main -> div:franklin-content (within div:#main)
# Vitepress: div:#VPContent
## Look for element ID (for Vitepress only)
content_ = [el
for el in AbstractTrees.PreOrderDFS(r_parsed.root)
if el isa HTMLElement && getattr(el, "id", nothing) in ["VPContent"]]
if length(content_) == 0
## Fallback, looking for a class
content_ = [el
for el in AbstractTrees.PreOrderDFS(r_parsed.root)
if el isa HTMLElement && getattr(el, "class", nothing) in ["content", "franklin-content"]]
end
if length(content_) > 0
process_node!(only(content_), heading_hierarchy, parsed_blocks)
end
return parsed_blocks, title
end
This is not a reference implementation, just a quick hack to test it out. I'm saving here for future reference. |
Gotcha, will make changes |
This is a new subrepo dedicated to crawling, scraping, parsing and chunking of Julia's documentation (GSOC project).
DocsScraper contains DocsScraper.jl file where the code for parser is implemented.
Usage:
parsed_text = parse_url("https://docs.julialang.org/en/v1/base/multi-threading/")
Returns:
A Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata
Requirements:
HTTP, Gumbo, AbstractTrees, URIs