DocsScraper sub repo #5

splendidbug · 2024-03-19T07:35:23Z

This is a new subrepo dedicated to crawling, scraping, parsing and chunking of Julia's documentation (GSOC project).

DocsScraper contains DocsScraper.jl file where the code for parser is implemented.

Usage:
parsed_text = parse_url("https://docs.julialang.org/en/v1/base/multi-threading/")

Returns:
A Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata

Requirements:
HTTP, Gumbo, AbstractTrees, URIs

svilupp · 2024-03-19T19:40:44Z

Great start! It's very clear.

I just had a couple of stylistic comments - let me know what you think.

svilupp · 2024-03-19T19:42:18Z

It would also be good to add some minimal tests in folder test/ -> the entry point is usually file test/runtests.jl and you'll need to add [extras] to your Project.toml - see the parent project for example.

It would be good to cover at least the main pathways of the core functions

DocsScraper/Project.toml

DocsScraper/src/DocsScraper.jl

svilupp · 2024-03-20T08:55:56Z

@splendidbug How are you thinking about the scope of this PR before we merge it?

I think we can have it open for a while, but I'd like to see the following prior to merge:

functionality:
- download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only)
- parsing HTML page
- extracting links from HTML page
- filtering pages to only the same domain(?, should be an argument
- top-level functionality to run the scrape->links->parse & repeat on other links in scope
tests for the main behaviours

It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).

Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.

WDYT?

…ompat]

splendidbug · 2024-03-20T10:53:53Z

@splendidbug How are you thinking about the scope of this PR before we merge it?

I think we can have it open for a while, but I'd like to see the following prior to merge:
* functionality:
  
  * download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only)
  * parsing HTML page
  * extracting links from HTML page
  * filtering pages to only the same domain(?, should be an argument
  * top-level functionality to run the scrape->links->parse & repeat on other links in scope

* tests for the main behaviours
It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).

Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.

WDYT?

Sounds good and I agree that keeping the PR open to add other functionalities is a good idea
Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Also when we start implementing the crawler, we'll have to take care of memory overflow right?

svilupp · 2024-03-22T06:40:18Z

Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think

Also when we start implementing the crawler, we'll have to take care of memory overflow right?
Interesting. Why is that a concern for you?
Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most.
We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.

Or did you have a different concern?

svilupp · 2024-03-22T06:41:23Z

Separately, could you please go through the review and click resolve on the feedback you tackled already? Also, there were some suggestions - in the future, you can just accept those and it will make the changes for you :)

splendidbug · 2024-03-23T09:14:03Z

Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think

Also when we start implementing the crawler, we'll have to take care of memory overflow right?
Interesting. Why is that a concern for you?
Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most.
We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.

Or did you have a different concern?

That was my concern. Thanks!

svilupp

Good stuff! Some minor changes requested to add more modular functions and to simplify the logic

DocsScraper/src/DocsScraper.jl

svilupp · 2024-03-24T10:02:26Z

Btw as mentioned on Slack, an easy way to accumulate strings is to pass around an io=IOBuffer() which is io::IO in the process_node function, that the child nodes can add into (eg, print(io,…) or write(io,…)) I prefer print somehow - it avoids having to instantiate all the intermediate strings..
You extract them from io via str = String(take!(io)) (which resets the positions in io / removes all its content, ie, you can do it only once)

svilupp · 2024-04-09T19:13:47Z

As discussed, I haven't looked at the parser yet.

One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there.
Ie,

module DocsScraper

using AbstractTrees
using Gumbo
using HTTP
using URIs

export x,y,z
include("parser.jl")

export x,y,z
include("...")

end

svilupp · 2024-04-09T19:39:28Z

On a separate note, I took it for a spin and parsed docs across several packages -- I haven't verified all in detail, but at least it runs across several doc site types (it required small tweaks).

It integrated nicely into the PromptingTools RAGTools:

## Load up all Makie docs
dirs = ["makie/Makie.jl-gh-pages/dev",
    "makie/AlgebraOfGraphics.jl-gh-pages/dev",
    "makie/GeoMakie.jl-gh-pages/dev",
    "makie/GraphMakie.jl-gh-pages/dev",
    "makie/MakieThemes.jl-gh-pages/dev",
    "makie/TopoPlots.jl-gh-pages/dev",
    "makie/Tyler.jl-gh-pages/dev"
]
output_chunks = Vector{SubString{String}}()
output_sources = Vector{String}()

for dir in dirs
    dir = dirs[1]
    @info ">> Directory: $dir"
    files = mapreduce(x -> joinpath.(Ref(x[1]), x[3]), vcat, walkdir(dir))
    files = filter(x -> endswith(x, ".html"), files)
    chunks, sources = RT.get_chunks(DocParserChunker(), files)
    append!(output_chunks, chunks)
    append!(output_sources, sources)
end

length(output_chunks), length(output_sources)

I can share the full script of methods added, but what's relevant is probably this.
You probably recognize your HTML parser. I had to add support for Documenter, Franklin, VitePress (each catching slightly different HTML object as "content" node). Documented in the comments.

## HTML parser from txt -> vector of dict (ie, skips the download)
"Parses an HTML string into a vector of Dicts with text and metadata. Returns: `parsed_blocks` and `title` of the document."
function parse_html_to_blocks(txt::String)
    parsed_blocks = Vector{Dict{String,Any}}()
    heading_hierarchy = Dict{Symbol,Any}()

    r_parsed = parsehtml(txt)

    # Getting title of the document 
    title = [el
             for el in AbstractTrees.PreOrderDFS(r_parsed.root)
             if el isa HTMLElement && tag(el) == :title] .|> text |> Base.Fix2(join, " / ")

    # Content markers:
    # Documenter: div:docs-main, article: content (within div:#documenter)
    # Franklin: div:main -> div:franklin-content (within div:#main)
    # Vitepress: div:#VPContent

    ## Look for element ID (for Vitepress only)
    content_ = [el
                for el in AbstractTrees.PreOrderDFS(r_parsed.root)
                if el isa HTMLElement && getattr(el, "id", nothing) in ["VPContent"]]
    if length(content_) == 0
        ## Fallback, looking for a class
        content_ = [el
                    for el in AbstractTrees.PreOrderDFS(r_parsed.root)
                    if el isa HTMLElement && getattr(el, "class", nothing) in ["content", "franklin-content"]]
    end

    if length(content_) > 0
        process_node!(only(content_), heading_hierarchy, parsed_blocks)
    end

    return parsed_blocks, title
end

This is not a reference implementation, just a quick hack to test it out. I'm saving here for future reference.

splendidbug · 2024-04-11T22:43:01Z

As discussed, I haven't looked at the parser yet.

One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. Ie,
module DocsScraper

using AbstractTrees
using Gumbo
using HTTP
using URIs

export x,y,z
include("parser.jl")

export x,y,z
include("...")

end

Gotcha, will make changes

splendidbug added 2 commits March 19, 2024 00:02

adding DocsScraper

be9d3ab

adding DocsScraper

73a81c5

svilupp self-requested a review March 19, 2024 13:15

svilupp requested changes Mar 20, 2024

View reviewed changes

made minor code changes in DocScraper.jl, added runtests.jl, added [c…

079481d

…ompat]

added backticks to code snippets

348d144

svilupp requested changes Mar 24, 2024

View reviewed changes

made visual changes for better readability

dd0a08b

splendidbug added 3 commits April 15, 2024 00:29

Added support for multiple packages

eec5881

Bug fix

304766b

Minor improvements

3398d2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocsScraper sub repo #5

DocsScraper sub repo #5

splendidbug commented Mar 19, 2024

svilupp commented Mar 19, 2024

svilupp commented Mar 19, 2024

svilupp commented Mar 20, 2024

splendidbug commented Mar 20, 2024 •

edited

Loading

svilupp commented Mar 22, 2024

svilupp commented Mar 22, 2024

splendidbug commented Mar 23, 2024

svilupp left a comment

svilupp commented Mar 24, 2024

svilupp commented Apr 9, 2024

svilupp commented Apr 9, 2024 •

edited

Loading

splendidbug commented Apr 11, 2024

DocsScraper sub repo #5

Are you sure you want to change the base?

DocsScraper sub repo #5

Conversation

splendidbug commented Mar 19, 2024

svilupp commented Mar 19, 2024

svilupp commented Mar 19, 2024

svilupp commented Mar 20, 2024

splendidbug commented Mar 20, 2024 • edited Loading

svilupp commented Mar 22, 2024

svilupp commented Mar 22, 2024

splendidbug commented Mar 23, 2024

svilupp left a comment

Choose a reason for hiding this comment

svilupp commented Mar 24, 2024

svilupp commented Apr 9, 2024

svilupp commented Apr 9, 2024 • edited Loading

splendidbug commented Apr 11, 2024

splendidbug commented Mar 20, 2024 •

edited

Loading

svilupp commented Apr 9, 2024 •

edited

Loading