Bug/security fixes
- Django upgraded to 3.2.18 (supported until 2024)
Support for Twitter API v.2
See sfm-twitter-harvester
- Added support for v.2 API credentials, including the bearer token (recommended) and the combination of consumer key/secret and access token/secret
- Added support (with twarc2) for harvesting and exporting from v.2 endpoints
- Due to changes in the Twitter API access model, only the v.2 search_recent and user_timeline endpoints (accessible on the new Basic Access tier) are available in production. A new environment variable, TWITTER_COLLECTION_TYPES, specifies which of the supported Twitter API endpoints are available in the app.
- Twitter v. 1.1 endpoints have been disabled, but collections previously created via these endpoints are still available for export.
Outstanding issues
Streaming API
- Streaming rules are handled as seeds; because the Streaming API supports multiple rules per request, an SFM stream collection can have multiple seeds. However, the functionality to limit exports to a subset of active/deleted seeds does not work for these collections. (The logic in SFM for seed-based export applies only to user-timeline collections.)
- During testing, a long-running stream harvest encountered a "Read timed out" error from the Twitter API, as a result of which, no further Tweets could be collected until the harvest was voided in the UI and restarted. Consulted with the twarc developers; the cause of the error remains unclear, but it may be related to the following:
- Streaming harvests involve a periodic restart of the twarc.stream() process (every 30 minutes). This logic is designed to prevent excessively large WARC files (since a new WARC is created only at the start of the twarc.stream() process).
- The twarc developers posit that this regular interruption of the twarc stream could cause problems. The stream is designed to be run continuously. Apparently, the v.2 API is less responsive than the v.1 API, so it's possible that the API might be giving a timeout error if the previous connection hasn't fully closed by the time twarc tries to open a new one.
- If that is the problem – and it's hard to know for sure – then introducing a sleep before restarting could be effective; however, that could result in missed Tweets (a risk already posed by restarting the stream every 30 minutes).
Processing container
- The processing container needs to be upgraded. The image fails to build because of dependency conflicts with the new versions of certain libraries in sfm-utils. We didn't tackle this work during this release because it will probably also involve upgrading the Python and Ubuntu versions used in the image. Since the processing container doesn't directly interact with other components, it should be fine to use for now with the 2.5.0 image for legacy collections, etc. But to use with collections harvested from the v. 2 API, an upgrade will be necessary.