Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow users to gather tweets from a shared user timeline #99

Open
us3r1d opened this issue Nov 25, 2022 · 20 comments
Open

allow users to gather tweets from a shared user timeline #99

us3r1d opened this issue Nov 25, 2022 · 20 comments
Labels
enhancement New feature or request

Comments

@us3r1d
Copy link

us3r1d commented Nov 25, 2022

OK, this one may be problematic. Feel free to tell me if it isn't workable. :-)

I'm hoping to minimize latency between the time a tweet is posted and the time it's mirrored into a fediverse post, so I'm hitting the 300 requests per 15 minutes rate limiter pretty often. I'm guessing this is because the bot is making a tweets requests from Twitter for each configured account, and I have 45 in there at the moment and would like to be able to support way more than that.

(This is for https://twitter.oksocial.net/about; that page describes the service.)

I'm pretty sure Twitter's API lets you give it a list of accounts to pull tweets from on each request, rather than just a single account, so it should be possible for the bot to batch all the accounts for which it does not have specific login info into a single request?

I suspect that's much more complicated than adding the bot attribute was, but thanks for considering it. :-)

-robin

@robertoszek
Copy link
Owner

It is indeed quite a drastic change that would need a bit of work and time to integrate nicely.

We could optimize a little bit more the gathering of tweets but even then, after scaling up to a certain number of accounts you'll run into Twitter's API rate limits again eventually.

I'm wondering if in the meantime you could mitigate it somewhat by using RSS feeds as the account source for tweets instead of Twitter's API:
https://github.com/robertoszek/pleroma-bot/blob/develop/docs/gettingstarted/usage.md#using-an-rss-feed
This feature is on the rc release but hasn't been rolled into stable yet.
It has some limitations though, you won't be able to mirror polls, know which tweets are pinned and it won't work for accounts that have their tweets protected.

@robertoszek robertoszek added the enhancement New feature or request label Nov 25, 2022
@us3r1d
Copy link
Author

us3r1d commented Nov 25, 2022

This project won't be connecting to any protected accounts anyway, and polls may not even matter since there's no way for that data to get back onto Twitter, so those are probably no biggie; I'll look at the RSS option.

If the API still supports batching, that would obviously be more desirable in the long term. I haven't seen their API directly in something like 10 years though, so my ideas on how it works are way out of date.

(The temporary solution I looked at first is from my other issue today: running multiple bot instances connected to different API apps. It looks like you've already done that one, so yay! :-)

@us3r1d
Copy link
Author

us3r1d commented Nov 25, 2022

Ugh; it looks like the standard API doesn't have a way to make a batch request.

Search might work, but a simpler approach would be:

a) have a global setting in the bot config for a user whose timeline should be scraped for tweets (since "fetch a user timeline" is a single query) before processing the configured users which would then be broken out by what user tweeted them if that user matches one in the bot config

b) have a per-user setting in the bot config that says whether to fetch this user's tweets separately or to use tweets from the globally configured timeline

That way I could set up a single Twitter account that follows all the accounts I want batched.

It would also maybe minimize the impact on the bot's processing path; it makes me do the work of setting up a timeline to scrape, so the optimization workload is on me instead of you. :-)

@robertoszek
Copy link
Owner

Oh, I forgot to mention if you're gonna try the RSS feature maybe do so on the latest rc version (1.1.1rc29). It includes some improvements to it and multithreading when processing the tweets present on the RSS feed.

@robertoszek
Copy link
Owner

robertoszek commented Nov 26, 2022

Search might work, but a simpler approach would be:
[...]

Hmmm, I'm a little torn about this.
In one hand it would help relieve the requests load on the Twitter's API, but on the other hand I feel like this would overcomplicate further an already confusing and hard to understand bot and config (my bad on that front 😅).

How do you envision a config looking like if using the timeline approach you suggest, something along the lines of this?:

pleroma_base_url: https://pleroma.instance
max_tweets: 40
timeline_user: TwitterUserFollowingAccounts
twitter_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
users:
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  use_timeline: true
- twitter_username: User2
  pleroma_username: MyPleromaUser2
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

@us3r1d
Copy link
Author

us3r1d commented Nov 26, 2022

Yeah, that's pretty much what I was thinking. "shared_timeline_user" and "use_shared_timeline" might be more clear?

@robertoszek
Copy link
Owner

robertoszek commented Nov 26, 2022

Excellent. Yeah, the names are subject to change, just wanted to make sure I understood what you were going for.

On a related note, I've been also experimenting with Guest Tokens as another way of circumventing Twitter's API rate limits (and for people who don't want to apply for a dev account):
74def6f

If you have no twitter_token in your config or set the guest mapping to true (globally or per-user), it will generate a Guest Token for every user:

pleroma_base_url: https://pleroma.instance
twitter_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
users:
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  guest: true # <---

It's limited to 20 tweets (or I haven't figured out how to force it to paginate with the cursor yet).
But if you're looking for decreasing latency between tweet and mirroring maybe it's worth looking at, as I haven't ran into rate limits no matter how many users I used on the config (as it generates a fresh token for each one).

You can try it for yourself by installing 1.1.1rc30:
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc30
Fair warning, it may be broken in a few different ways I haven't found yet.

@us3r1d
Copy link
Author

us3r1d commented Nov 28, 2022

If that works, it should do the trick.

I converted my config to guest and ran with no rate-limiter hits, though I do seem to have gotten re-posts of recent tweets on some accounts. For example:

https://twitter.oksocial.net/loresjoberg
https://twitter.oksocial.net/HAL9000_

(In total, it looks like maybe 10 out of 46 accounts ended up with a re-post.)

@us3r1d
Copy link
Author

us3r1d commented Nov 28, 2022

That did not go well. :-)

I'm running with a script that rebuilds the bot config files, runs the bot then sleeps 5 minutes; in that setup, it ran one pass successfully as guest, then all subsequent runs got this for all accounts:

Error log
ℹ 2022-11-28 10:00:35,631 - pleroma_bot - INFO - ====================================== 
INFO:pleroma_bot:======================================
ℹ 2022-11-28 10:00:35,631 - pleroma_bot - INFO - Processing user:	adamconover 
INFO:pleroma_bot:Processing user:	adamconover
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.twitter.com:443
DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "POST /1.1/guest/activate.json HTTP/1.1" 429 69
✖ 2022-11-28 10:00:35,775 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:700) 
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 539, in main
    user = User(user_item, config, base_path, posts_ids)
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 205, in __init__
    guest_token, headers = self._get_guest_token_header()
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 1085, in _get_guest_token_header
    guest_token = json_resp['guest_token']
KeyError: 'guest_token'
ERROR:pleroma_bot:Exception occurred for user, skipping...
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 539, in main
    user = User(user_item, config, base_path, posts_ids)
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 205, in __init__
    guest_token, headers = self._get_guest_token_header()
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 1085, in _get_guest_token_header
    guest_token = json_resp['guest_token']
KeyError: 'guest_token'

@robertoszek
Copy link
Owner

Looks like it hit a 429 when requesting a guest token:
DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "POST /1.1/guest/activate.json HTTP/1.1" 429 69

How many users would you say you run it with in the span of 15min? I may try to replicate it on my side too.

@us3r1d
Copy link
Author

us3r1d commented Nov 28, 2022

There are 46 accounts on it at the moment, so that'd presumably be 92 to 138 attempts depending on how the timing goes?

@robertoszek
Copy link
Owner

robertoszek commented Nov 28, 2022

After some testing, if I randomize the user agent slightly I'm getting 1000 requests for a new guest token before getting rate limited.

In addition to that, I've also added retrying with proxies once you've hit an 429. This really only helps when using guest tokens (with an app token your request count goes up no matter what the source IP happens to be):
7f062d7

They are configurable with the proxy_pool mapping but if it's not present some free proxies will be used instead (and you can disable it completely setting proxy to false):

proxy_pool:
- 128.199.221.6:443
- 164.62.72.90:80
- 178.128.121.196:443
pleroma_base_url: https://pleroma.instance
users:
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
- twitter_username: User1
  pleroma_username: MyPleromaUser1
  proxy: false
  pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Hopefully that would help alleviate your rate limit issue a bit, these changes are included in 1.1.1rc35:
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc35

@us3r1d
Copy link
Author

us3r1d commented Nov 28, 2022

Thanks; I'll give that rc a try.

@robertoszek
Copy link
Owner

Oh and by the way, if I had to guess the re-posts probably were due to some timestamps not being transformed correctly to UTC format.
So timezones were probably wrongly offsetting the start and now dates:
9957708

This change is included in 1.1.1rc37:
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple pleroma-bot==1.1.1rc37

@us3r1d
Copy link
Author

us3r1d commented Nov 29, 2022

That didn't work; I left it running unattended for an hour and it doesn't seem to have hit the rate limiter but also didn't post anything. When I pulled guest:true out of the config it caught up with what it had missed.

This was with rc35, so I'll re-try in a bit with rc37.

Thanks.

@robertoszek
Copy link
Owner

Timezones are fun, I had to force it to UTC otherwise it would use the local timezone when parsing the start date into an UTC epoch timestamp:
62502f1

If it still happens on 1.1.1rc38 let me know.

@us3r1d
Copy link
Author

us3r1d commented Nov 29, 2022

rc38 is doing better; it doesn't seem to be missing tweets.

I see it doing the rollover to public proxies:

⚠ 2022-11-29 09:07:13,423 - pleroma_bot - WARNING - Rate limit exceeded when getting guest token. Retrying with a proxy. (_utils.py:1095)

That's a neat feature, but for my project I'm not happy about depending on someone else's proxy; I wouldn't want to cause anyone else trouble. That's my problem to deal with, though. :-)

This seems to be viable for running every 5 minutes at the moment.

I do think that batching tweets from a user timeline is a better strategy in the long run, but this fix is working for now.

Thanks.

@robertoszek
Copy link
Owner

For sure, this was meant just as an stopgap for your usecase because batching and user timelines will take me a while to implement. (And I also happened to be investigating guest tokens anyway for people who would rather not apply for a dev account)

I still agree the timeline approach is something we want to pursue and would be a nice option when using the bot. I'll change the title of the issue to reflect that if you're ok with that.

Oh, just a last remark. If you happen to have access to or run private proxies, putting them into the proxy_pool mapping will force the bot to only make use of the ones listed there, if you don't want to rely on the public ones.

@robertoszek robertoszek changed the title batch pulling from non-logged-in twitter accounts allow users to gather tweets from a shared user timeline Nov 29, 2022
@us3r1d
Copy link
Author

us3r1d commented Nov 29, 2022

It did just crash out with this error:

Error log
✖ 2022-11-29 12:57:33,853 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:707) 
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 125, in process_tweets
    _get_rt_media_url(self, tweet, media)
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 264, in _get_rt_media_url
    tweet_rt = self._get_tweets("v2", tweet_id)
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 411, in _get_tweets
    tweet_id=tweet_id, start_time=start_time, t_user=t_user, pbar=pbar
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 478, in _get_tweets_v2
    params=params
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request
    "Rate limit exceeded. 0 out of {} requests remaining until {}"
TypeError: 'list' object is not callable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 643, in main
    tweets, user, threads
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 121, in process_parallel
    p.imap_unordered(user.process_tweets, tweets_chunked)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
TypeError: 'list' object is not callable
ERROR:pleroma_bot:Exception occurred for user, skipping...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 125, in process_tweets
    _get_rt_media_url(self, tweet, media)
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 264, in _get_rt_media_url
    tweet_rt = self._get_tweets("v2", tweet_id)
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 411, in _get_tweets
    tweet_id=tweet_id, start_time=start_time, t_user=t_user, pbar=pbar
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 478, in _get_tweets_v2
    params=params
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request
    "Rate limit exceeded. 0 out of {} requests remaining until {}"
TypeError: 'list' object is not callable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 643, in main
    tweets, user, threads
  File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 121, in process_parallel
    p.imap_unordered(user.process_tweets, tweets_chunked)
  File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
TypeError: 'list' object is not callable

(Occasional crashes don't bother me much, but I figured you'd like to know.)

@robertoszek
Copy link
Owner

File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request
"Rate limit exceeded. 0 out of {} requests remaining until {}"
TypeError: 'list' object is not callable

Ah, of course, the requests using the guest tokens don't contain the same rate limiting headers as the proper API when hitting an 429 (for whatever reason).
I changed the structure around a bit to account for that:
db104cd

It should be included in 1.1.1rc40.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants