Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster HTML tokenization #109

Open
fimad opened this issue Jul 15, 2024 · 0 comments
Open

Faster HTML tokenization #109

fimad opened this issue Jul 15, 2024 · 0 comments

Comments

@fimad
Copy link
Owner

fimad commented Jul 15, 2024

While debugging some slow scraping recently I realized that the vast majority of the time was spent in TagSoup.parseTags. Naively swapping to the fast-tagsoup package results in a ~20x speedup for the case that I was debugging.

Example of using the fast-tagsoup parser (note that it only works with strict ByteStrings):

import Text.HTML.Scalpel.Core
import qualified Text.HTML.TagSoup as TagSoup
import qualified Text.HTML.TagSoup.Fast as TagSoupFast
import qualified Data.ByteString as BS

scrapeByteStringT :: Monad m => BS.ByteString -> ScraperT BS.ByteString m a -> m (Maybe a)
scrapeByteStringT html scraper = scrapeT scraper tags
    where
        tags = TagSoupFast.parseTags html

Given this ridiculous speedup, it probably makes sense to try and make this the default HTML parser for scalpel. Doing that would require figuring out how to work this into the existing API where the scraper class is parameterized by the underlying string type.

Maybe we do a breaking API change and finally do away with all the StringLike code once and for all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant