Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocsScraper sub repo #5

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

splendidbug
Copy link

This is a new subrepo dedicated to crawling, scraping, parsing and chunking of Julia's documentation (GSOC project).

DocsScraper contains DocsScraper.jl file where the code for parser is implemented.

Usage:
parsed_text = parse_url("https://docs.julialang.org/en/v1/base/multi-threading/")

Returns:
A Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata

Requirements:
HTTP, Gumbo, AbstractTrees, URIs

@svilupp svilupp self-requested a review March 19, 2024 13:15
@svilupp
Copy link
Owner

svilupp commented Mar 19, 2024

Great start! It's very clear.

I just had a couple of stylistic comments - let me know what you think.

@svilupp
Copy link
Owner

svilupp commented Mar 19, 2024

It would also be good to add some minimal tests in folder test/ -> the entry point is usually file test/runtests.jl and you'll need to add [extras] to your Project.toml - see the parent project for example.

It would be good to cover at least the main pathways of the core functions

DocsScraper/Project.toml Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
@svilupp
Copy link
Owner

svilupp commented Mar 20, 2024

@splendidbug How are you thinking about the scope of this PR before we merge it?

I think we can have it open for a while, but I'd like to see the following prior to merge:

  • functionality:

    • download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only)
    • parsing HTML page
    • extracting links from HTML page
    • filtering pages to only the same domain(?, should be an argument
    • top-level functionality to run the scrape->links->parse & repeat on other links in scope
  • tests for the main behaviours

It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).

Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.

WDYT?

@splendidbug
Copy link
Author

splendidbug commented Mar 20, 2024

@splendidbug How are you thinking about the scope of this PR before we merge it?

I think we can have it open for a while, but I'd like to see the following prior to merge:

* functionality:
  
  * download web page (can be in memory, or saved to a local file for future reference/re-parsing, it should reflect the webpage structure. HTML file only)
  * parsing HTML page
  * extracting links from HTML page
  * filtering pages to only the same domain(?, should be an argument
  * top-level functionality to run the scrape->links->parse & repeat on other links in scope

* tests for the main behaviours

It doesn't have to cover all edge cases, but it would be good to cover all these steps, because it will force us to design a good interface/API from the start (and not change it soon).

Eg, in separate PRs we should consider checking anti-scraping measures, eg, robots.txt, and make sure we adhere to it. But that's extra and not the core scope of this PR.

WDYT?

Sounds good and I agree that keeping the PR open to add other functionalities is a good idea
Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Also when we start implementing the crawler, we'll have to take care of memory overflow right?

@svilupp
Copy link
Owner

svilupp commented Mar 22, 2024

Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think

Also when we start implementing the crawler, we'll have to take care of memory overflow right?
Interesting. Why is that a concern for you?
Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most.
We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.

Or did you have a different concern?

@svilupp
Copy link
Owner

svilupp commented Mar 22, 2024

Separately, could you please go through the review and click resolve on the feedback you tackled already? Also, there were some suggestions - in the future, you can just accept those and it will make the changes for you :)

@splendidbug
Copy link
Author

Regarding, filtering pages to only the same domain, isn't that necessary to avoid virtually infinite urls? Is there a use-case where it's beneficial to go out of domain?

Yes, we will always need to have some filter. The value in having it user provided when required is that you will be able to capture multidoc sites, ie, sites that cover multiple packages (so you would want to filter against a list of domains). Example is sciml.ai I think

Also when we start implementing the crawler, we'll have to take care of memory overflow right?
Interesting. Why is that a concern for you?
Assuming scraping one site at a time, we shouldn’t get anywhere near RAM limits on a standard laptop (I’d expect MBs of data at most.
We can add serialization step when we run the scraper for 10000s websites in a loop, but that won’t happen and user can do it themselves if they have such huge task.

Or did you have a different concern?

That was my concern. Thanks!

Copy link
Owner

@svilupp svilupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Some minor changes requested to add more modular functions and to simplify the logic

DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
DocsScraper/src/DocsScraper.jl Outdated Show resolved Hide resolved
@svilupp
Copy link
Owner

svilupp commented Mar 24, 2024

Btw as mentioned on Slack, an easy way to accumulate strings is to pass around an io=IOBuffer() which is io::IO in the process_node function, that the child nodes can add into (eg, print(io,…) or write(io,…)) I prefer print somehow - it avoids having to instantiate all the intermediate strings..
You extract them from io via str = String(take!(io)) (which resets the positions in io / removes all its content, ie, you can do it only once)

@svilupp
Copy link
Owner

svilupp commented Apr 9, 2024

As discussed, I haven't looked at the parser yet.

One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there.
Ie,

module DocsScraper

using AbstractTrees
using Gumbo
using HTTP
using URIs

export x,y,z
include("parser.jl")

export x,y,z
include("...")

end

@svilupp
Copy link
Owner

svilupp commented Apr 9, 2024

On a separate note, I took it for a spin and parsed docs across several packages -- I haven't verified all in detail, but at least it runs across several doc site types (it required small tweaks).

It integrated nicely into the PromptingTools RAGTools:

## Load up all Makie docs
dirs = ["makie/Makie.jl-gh-pages/dev",
    "makie/AlgebraOfGraphics.jl-gh-pages/dev",
    "makie/GeoMakie.jl-gh-pages/dev",
    "makie/GraphMakie.jl-gh-pages/dev",
    "makie/MakieThemes.jl-gh-pages/dev",
    "makie/TopoPlots.jl-gh-pages/dev",
    "makie/Tyler.jl-gh-pages/dev"
]
output_chunks = Vector{SubString{String}}()
output_sources = Vector{String}()

for dir in dirs
    dir = dirs[1]
    @info ">> Directory: $dir"
    files = mapreduce(x -> joinpath.(Ref(x[1]), x[3]), vcat, walkdir(dir))
    files = filter(x -> endswith(x, ".html"), files)
    chunks, sources = RT.get_chunks(DocParserChunker(), files)
    append!(output_chunks, chunks)
    append!(output_sources, sources)
end

length(output_chunks), length(output_sources)

I can share the full script of methods added, but what's relevant is probably this.
You probably recognize your HTML parser. I had to add support for Documenter, Franklin, VitePress (each catching slightly different HTML object as "content" node). Documented in the comments.

## HTML parser from txt -> vector of dict (ie, skips the download)
"Parses an HTML string into a vector of Dicts with text and metadata. Returns: `parsed_blocks` and `title` of the document."
function parse_html_to_blocks(txt::String)
    parsed_blocks = Vector{Dict{String,Any}}()
    heading_hierarchy = Dict{Symbol,Any}()

    r_parsed = parsehtml(txt)

    # Getting title of the document 
    title = [el
             for el in AbstractTrees.PreOrderDFS(r_parsed.root)
             if el isa HTMLElement && tag(el) == :title] .|> text |> Base.Fix2(join, " / ")

    # Content markers:
    # Documenter: div:docs-main, article: content (within div:#documenter)
    # Franklin: div:main -> div:franklin-content (within div:#main)
    # Vitepress: div:#VPContent

    ## Look for element ID (for Vitepress only)
    content_ = [el
                for el in AbstractTrees.PreOrderDFS(r_parsed.root)
                if el isa HTMLElement && getattr(el, "id", nothing) in ["VPContent"]]
    if length(content_) == 0
        ## Fallback, looking for a class
        content_ = [el
                    for el in AbstractTrees.PreOrderDFS(r_parsed.root)
                    if el isa HTMLElement && getattr(el, "class", nothing) in ["content", "franklin-content"]]
    end

    if length(content_) > 0
        process_node!(only(content_), heading_hierarchy, parsed_blocks)
    end

    return parsed_blocks, title
end

This is not a reference implementation, just a quick hack to test it out. I'm saving here for future reference.

@splendidbug
Copy link
Author

As discussed, I haven't looked at the parser yet.

One small observation - DocsScraper should be its own module and only importing/exporting things. No code definitions in there. Ie,

module DocsScraper

using AbstractTrees
using Gumbo
using HTTP
using URIs

export x,y,z
include("parser.jl")

export x,y,z
include("...")

end

Gotcha, will make changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants