Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class for things that can be parsed by Scalpel #78

Open
tysonzero opened this issue Mar 6, 2019 · 5 comments
Open

Class for things that can be parsed by Scalpel #78

tysonzero opened this issue Mar 6, 2019 · 5 comments

Comments

@tysonzero
Copy link

This would be very useful for things like interacting with Servant. I also like the way Aeson uses this for things like polymorphic .:, so that could be worth looking into.

See this for the current class we are using for this purpose and this for the way we are integrating it with Servant.

@fimad
Copy link
Owner

fimad commented Mar 6, 2019

For something analogous to .: in Aeson, are you envisioning a function with a type like Scalpable a => Selector -> Scraper a that can be used in place of existing primitives like text and attr?

This sounds useful to me, if you are interested in implementing this feel free to fire up a PR.

@tysonzero
Copy link
Author

tysonzero commented Mar 6, 2019

Yes that is basically what I was thinking.

After thinking about it some more there is a key difference between Aeson and Scalpel that makes this change in primitives hard:

In Aeson a Parser (as a side note the name is rather misleading) has already had it's input applied, so its approximately Maybe a. In Scalpel a Scraper takes in input and is more like Html -> Maybe a, whereas string parsing libraries like parsec are more like String -> Maybe (String, a) as they need to move along the string.

String parsing libraries are more or less stuck with the State-like approach, but JSON/HTML/XML etc. libraries get to make a choice:

The advantage of the former approach is that it is significantly more powerful and allows you to basically build everything on top of polymorphic primitives like .:, as long as you have a suitable instance that basically just returns the input as-is: instance FromJSON Value, as well as a Monad instance for the "Parser".

The advantage of the latter is that because of it's reduced power you can potentially implement optimizations not otherwise possible, and allow for things like printing out the structure of the parser itself (e.g. BNF form).

So with that said I think a class would be nice either way, but .: equivalents are probably only worthwhile if Scalpel changed it's approach to the Aeson approach and allowed you to directly pass around the Html (TagSpec-ish thing) object and apply various parsers to it at any time.

This would IMO be a very nice and intuitive interface. SerialScraper would probably still have a monadic interface since unlike Scraper it IS stateful just like a Parsec parser.

@tysonzero
Copy link
Author

tysonzero commented Mar 7, 2019

One potential interface could be something like this:

data Node str = -- single node
data Html str = -- zero or more nodes

class StringLike str => FromNode str a where
    fromNode :: Node str -> Maybe a

class FromNode str a => FromHtml str a where
    fromHtml :: Html str -> Maybe a
    -- fromHtml =<< fromNode n = fromHtml n

-- always succeed
instance FromNode str (Node str)
instance FromNode str (Html str)
instance FromHtml str (Html str)

-- wouldn't always succeed, and would probably be better to intentionally leave off
-- instance FromHtml str (Node str)

prepare :: StringLike str => [Tag str] -> Html str

attr :: StringLike str => String -> Node str -> Maybe str

text :: StringLike str => Node str -> str

select :: FromNode str a => Selector -> Html str -> [a]

inSerial :: StringLike str => Serial str a -> Html str -> Maybe a

stepNext :: FromNode str a => Serial str a

seekNext :: FromHtml str b => (Node str -> Maybe a) -> Serial str (a, b)

@fimad
Copy link
Owner

fimad commented Mar 10, 2019

I might be misunderstanding .: but it seems like the main benefit it provides would be that it allows to extract more complicated values than inner text or attributes without having to resort to explicitly chroot'ing into a sub-tree.

For example:

class Scrapable str a where
    scraper :: Scraper str a

extract :: (Scrapable str a, StringLike str) => Selector -> Scraper str a
extract selector = chroot selector scraper

extracts :: (Scrapable str a, StringLike str) => Selector -> Scraper str a
extracts selector = chroots selector scraper

Internally scalpel basically uses an interface like the one you propose, the context is just implicitly passed via a Monad in the public API. Seems like you could get something similar by partially applying the existing scraping functions and passing around a Scraper str a -> Maybe a.

Right now, I think the parsing will probably be redone for each application, but I think you could probably restructure the internals such that the parsing is only performed once.

@tysonzero
Copy link
Author

I might be misunderstanding .: but it seems like the main benefit it provides would be that it allows to extract more complicated values than inner text or attributes without having to resort to explicitly chroot'ing into a sub-tree.

That is essentially true. However what makes this benefit so substantial in Aeson is that.: allows you to parse both fully defined end objects (like your extract above) and also allows you to parse to a Value/Object etc. that can then be further parsed / passed around etc. So Aeson does not need an equivalent to both extract and chroot, it just needs .:.

Currently adding functions like the above to scalpel would not allow any existing functions to be removed, as they do not supersede any existing functions. So while they are nice convenience functions, they don't really simplify the interface or give any composeable benefits.

Internally scalpel basically uses an interface like the one you propose, the context is just implicitly passed via a Monad in the public API. Seems like you could get something similar by partially applying the existing scraping functions and passing around a Scraper str a -> Maybe a

To allow for both possible APIs to be as clean and performant as possible, one option could be combining the Html str -> Maybe a approach with something like ReaderT over the top for when you want to a series of operations over the same context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants