Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blogspot article text scrape? #2789

Closed
XeonG opened this issue Jul 28, 2022 · 3 comments
Closed

blogspot article text scrape? #2789

XeonG opened this issue Jul 28, 2022 · 3 comments

Comments

@XeonG
Copy link

XeonG commented Jul 28, 2022

is it possible to also get the blog text that posted.. by default it just downloads the images/videos...but would be good to get the article text aswel....

@kattjevfel
Copy link
Contributor

The metadata postprocessor is probably what you are looking for, kind of. It will dump the text into the "content" section, but at least with this example it seems to break with the encoding...
"content": "Icy Moon Rise        Checked the photographer's Ephemeris to see when the moon would be rising and setting a few days ago for the full moon, very helpful program i might add, set up a few shots at the Northport Lighthouse on the tip of the Leelanau Peninsula, arrived an hour before the moon was set to rise, composed the first image i wanted and waited for it to rise in front of me over Lake Michigan, very otherworldly sight, it was simply gorgeous, I hope you enjoy!",

Something like this should work:

"postprocessors": [
{
    "name": "metadata",
    "mode": "json",
    "whitelist": ["blogger"]
}
],

@XeonG
Copy link
Author

XeonG commented Jul 29, 2022

how can I use that? can it be converted to a command line arg?

@mikf
Copy link
Owner

mikf commented Jul 31, 2022

You need a config file to properly use post processors, that's not something that can really be done with just command-line arguments.

The settings from #2789 (comment) are just --write-metadata, but that 1) writes a lot more than just the text content and 2) does not include text-only posts.

To achieve that, you need the changes from 5038893 and use something like the following as post processor:

{
    "postprocessors": [
        {
            "name": "metadata",
            "mode": "custom",
            "format": "{post[content]}",
            "event": "post",
            "filename": "{post[date]:%Y-%m-%d} {post[title]}.txt"
        }
    ]
}

@mikf mikf closed this as completed Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants