Lazybug is a free and open source project for learning Chinese from TV and movies. Massive comprehensible (and compelling) input is one of the best ways to learn a language. Beginner material is rarely massive nor compelling, and TV is compelling and massive but not always comprehensible to beginners and intermediate learners. This project aims to make TV more comprehensible through interactive subtitles tailored to the individual learner, and aims to make all video available for study, no matter which website it's on.
The project consists of algorithms for extracting subtitles from video and processing the text, as well as a web app and browser extension for viewing and customaizing them, and saving/exporting vocabulary. In the future exercises and SRS (Spaced Repetition System) will be integrated.
The app (and browser extension) works more like a mobile app than a web app in that it stores all data locally in the browser and only syncs to the cloud on the user's request. The app is fully functional without a backend server, except in the cloud syncing functionality and forum. In the future the app will also be usable offline.
- Remain free and open source by minimizing hosting costs
Do as much processing as possible on the front-end rather on servers. For example:
- Keep all logic client side (except account and social functionality), including ML inference (if possible)
- Store user data client side and sync it to cheaper object storage rather than using SQL database server
- Run heavy processing offline on local machines rather than in the cloud, such as extracting subtitles from video
- Single Language (Chinese)
Supporting multiple languages creates a "Jack of all trades, master of none" kind of situation. Chinese is vastly different from any other language out there and requires very specific solutions that in general cannot be generalized to others.
- Open Data and Format
The user should own their data, it should be easily exported.
Work towards an open data format for Chinese learning activity (impressions and SRS) to enable interoperability between apps. Today, a major inefficiency for learners is splitting their data between apps. A unified format/place for data would enable much more efficient scheduling for Spaced Repetition Systems, ranking of content etc.
- Works Offline
The parts of the app that don't depend on a network (e.g. viewing videos) should work offline.
- Copyright
This project provides content based on fair use doctrine:
- "the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;" Interpretation: The full functionality of Lazybug is freely available and open source and could thus be seen as non-commercial educational in nature
- "the nature of the copyrighted work;" Interpretation: the content is merely a written down and augmented version of what the user hears and sees (through hard or soft-subs) on the page. Google already does this when translating whole websites for end users.
- "the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and" Interpretation: videos themselves are not hosted nor copied, only the subtitles
- "the effect of the use upon the potential market for or value of the copyrighted work." Interpretation: the project does not include subtitles for work that is illegally provided, e.g. only on legal sites like Youtube.com and Bilibili.com. The project does not interfere with ads or paid subscriptions. It also does not include subtitles where they could affect the creator financially, such as when PDF scripts are provided for Patreon supporters or available after paid subscription.
If you own the copyright to content hosted on lazybug.ai and it is being used inappropriately, please notify us at [email protected].
If you're interested in contributing to the project, here are some possible areas:
- Processing shows - Finding good shows
- Frontend (Vue.js / Quasar) - Design improvements, code cleanup and bug fixes very welcome
- Machine Learning (Python, PyTorch, Segmentation, OCR, NLP) - There are plenty of interesting ML projects and improvements for processing videos and subtitles
Some work items may be found in the Github issues of this project, but before starting on anything please reach out to [email protected] with details of your skillset and what you'd like to help out on.
If you have a Github account you can open an issue here, or you can create a new topic on the forum (requires Lazybug account) with the category "Site Feedback and Bugs".
NOTE: this document is currently incomplete, as I'd adding to it as I revisit the different parts of the codebase.
This project consists of a number of different high-level parts:
- Algorithms to extract Chinese captions from video, and process them in various ways, like performing sentence segmentation and word disambiguation. These algorithms are mostly implemented in Python.
- A web app to watch videos, interact with and learn vocabulary, discuss and ask questions about content. The frontend is written in JS and Vue.js, the backend in Python.
- A browser extension for viewing videos that cannot be embedded into the web app. The codebase is shared with the JS web app.
The processing pipeline for a video can vary depending on:
- The platform: Youtube videos can be downloaded, but not others
- Videos may have Chinese hard subs or soft subs
- Videos may have English hard subs or soft subs, or no English subs
As of writing, the code only supports Youtube videos that can be downloaded, with or without Chinese/English subtitles, but implementation of a way to process any video accessible in the browser is under way.
Here is what the rough process for a Youtube video with hard subs
The general process to process videos (a show) with OCR is roughly as follows:
- Using the browser extension devtool, create a caption bounding box which will be used for processing, as well as a JSON file with meta data and ids of all episodes
- Download the videos and any subtitles using provided script
- Extract captions from all the videos using provided script
- Process translations for all Chinese sentences (using deepl.com and browser extension). This has to be done even if show already has English subs.
- Using Chinese and English subs, run a script to segment Chinese sentences and determine individual word meaning
- The final subtitle file is uploaded to Backblaze
Here I'll detail each of these steps
Each show (or movie) is defined in a JSON file in data/git/shows/{show_id}.json
that contains meta data and info about all seasons, episodes and caption bounding boxes
To help produce the JSON file, you can use the dev-tool in the browser extension. Once the extension is installed, go to a Youtube video or playlist and press <ctrl>+<shift>+<alt>
. This should open up a menu at the top of the screen.
In the dev-tool you can create bounding boxes around embedded hard captions. To do this, find a place in the video where a hard caption is visible and pause, then click "Add Caption", the click and drag a box around the text. It's important that the top and bottom edges follow the top and bottom of the Chinese characters as closely as possible, see image below:
The left and right edges are optional and can be removed before processing (then it will process the whole width of the video), but can be set smaller to make processing faster.
The "Import Playlist" button uses the playlist query selector specified in the input box and populates a new season with new episodes and their ids.
The "Import Episode" creates a new episode and adds the current video id as its id.
name
- the name of the show, if a dict then English/Hanzi/Pinyin variants all specifieddate_added
- used to sort new shows on the front endtype
- "tv" or "movie"genres
- list of genre strings such as "drama", "sci-fi" etcsynopsis
- very short descriptionyear
- year releasedcaption_source
- "soft" or "hard"translation_source
- "human" or "machine"douban
- Douban scorereleased
- if false, this show is a work in progress and will not be included when baking the final shows filediscourse_topic_id
anddiscourse_topic_slug
- these are added by another processing step when creating Discourse topics for the showocr_params
- caption bounding box(es) applied to all seasons and episodestype
- "hanzi", "english" or "pinyin"caption_top
- top y value (e.g 0.9 = 90% down from top)caption_bottom
- bottom y valuecaption_left
[optional] - left x value, can set this to speed up processingcaption_right
[optional] - right x valuecaption_font_height
[optional] - in case the caption position is not static, set caption_top/bottom to the full area and set caption_font_height to the observed font height to frame height ratio (e.g. 3% -> 0.03), so that the image can be scaled correctly for OCRstart_time
[optional] - time in seconds when to start processingend_time
[optional] - time in seconds when to end processingrefine_bounding_rect
[optional] - if true will first do a refining pass to improve caption bounding box (expensive)filter_out_too_many_low_prob_chars
[optional] - use if there are many noisy linesocr_engine
[optional] - "cnocr" (default) or "easyocr"use_bert_prior
[optional] - if true uses a Chinese BERT model as word prior to correct errorsextra_height_buffer_px
[optional] - useful if caption moves around vertically by some small amountid
[optional] - an id for this specific caption (for dependent captions), e.g. "hanzi-top"/"hanzi-bottom"depends_on
[optional] - id of other caption this one depends on, if set, OCR will only be done on this bounding box when there is another captionaction
[optional] - action to take ifdepends_on
caption happenstype
- "append", "prepend" or "assign"join
- string to join dependee and dependent if type is "append" or "prepend"
seasons
- a list of seasons, if movie it has just one virtual seasonplaylist_id
[optional] - if this season has a playlist (e.g. Youtube)ocr_params
- same as above, overwrites values for all videos below in hierarchyepisodes
- list of episodesid
- the caption id, e.g. "youtube-RRczNO40Zww"ocr_params
[optional] - same as above, overwrites values for this video if set abovetimings
[optional] - synced timings between recorded video and actual video (only for manually recorded videos)
Once you've specified a show on Youtube in the above JSON format, you can use this command to download all the episodes and their soft-subs (if they exists):
make download-yt show=$SHOW_ID out=../videos
this assumes there's a folder above in the filetree to put the videos. This command and further processing requires Python 3.9 and the packages listed in requirements.txt. You can install them with the command pip install -r requirements.txt
.
The first step in the processing is extracting the hard captions from the video using these command:
make pull-models
make process-video-captions show=$SHOW_ID videos=../videos
The first commands downloads the pre-trained segmentation model from storage.
After this you need to translate all the Chinese captions to English. This is needed no matter whether the video has soft subs in English or not, because it's used for matching word translations and soft subs may differ significantly from the Chinese captions in content. We use deepl.com to translate sentences as it's currently the highest quality translation service. Instead of copy-pasting the script for each show into the web UI, you can use the browser extension to automate this. For this you need to build and install the extensionlocally (not from the Chrome store). Once you've installed the extension, run this command to push episode scripts to the it:
make process-translations show=$SHOW_ID
Note that while deepl.com is free to use, it has usage limits which makes it difficult to use for this purpose unless you pay for a subscription. If you plan to process whole shows, be sure to support deepl.com by buying a subscription.
Next it's time to run segmentation and alignment processing. This segments the Chinese sentences into separate words and attempts to find in-context translations for them given the full sentence translation
make process-segmentation-alignments show=$SHOW_ID
This command pushes another script to be translated by deepl.com specifically for word translation disambiguation. Sentences are segmented and fed in like this:
怎么就这么走了
1. 怎么 2. 就 3. 这么 4. 走
This tricks deepl into translating each word in-context.
Finally, for words that an in-context translation has not been found, they are passed through deepl individually.
Finished subtitles are saved in data/remote/public/subtitles/
NOTE: For now this step can only be taken by an approved developer since you need API keys to upload.
First run
make pre-public-sync
This creates:
- A file containing show metadata for all shows that is used by the frontend
- A video list containing all the video ids that have been processed. The extension checks against this list to determine if a video has processed subtitles
- A cedict dictionary file for frontend consumption
- Versioned files for various resources such as HSK words
This syncs the public files to Backblaze
make push-public
The Python backend is meant to be simple and do mainly lightweight work in order to keep costs down. It's used for few things:
- Serve the static frontend files (js, html and images)
- User accounts and authentication
- Managing the syncing of personal data to Backblaze B2
- Discourse Single Sign On
The server can be run with the command make run-server
for local development, or make run-server-prod
on the production server.
On prod, server can be conveniently killed with make kill-server
The server needs a cert in data/local/ssl_keys/{privkey,fullchain}.pem
. Create one in Cloudflare and copy them here.
To release new frontend changes do:
ssh root@$LAZYBUG_SERVER_IP
cd lazybug && make update-server
# this kills the server, pulls changes, builds the frontend and runs the server again- The previous command purges the Cloudflare cache, but this sometimes doesn't work, in that case go to cloudflare and do "Purge Cache -> Purge Everything"
These environment variables need to be set for the server if not LOCAL is set:
- DISCOURSE_SECRET: for Discourse SSO. Secret is created in the UI when setting up DiscordConnect
- B2_ENDPOINT, B2_APPLICATION_KEY_ID, B2_APPLICATION_KEY: keys for integrating with Backblaze B2, for creating upload and download links for personal data
After having built the frontend with make frontend-clean
or make local
, you can install the extension in Chrome by going to "Manage Extensions", enabling "Developer mode" and then clicking "Load unpacked", pointing it to the {lazybug_dir}/frontend/dist
folder. Note that every time you make a change and rebuild the frontend (including the extension) you have to reload the extension in Chrome. This is easiest done by right clicking on the extension icon -> Manage Extension -> click Update button.
In order to use the local backend server with the browser extension, you need a self-signed certificate for SSL. The reason for this is that in the browser extension we inject an iframe pointing at https://localhost/static/iframe.html
. We then communicate with this iframe using message passing in order to load and save data. This way we don't have a syncronization problem between the extension and the website when the user uses both.
To generate a local certificate authority and the certificate and keys, run:
make local-ssl-cert
Any passphrase such as "1234" is fine, since it's for local use only.
After this command you need to import the CA.pem in the browser you use. In Chrome it's under "Settings"->"Security And Privacy"->"Security"->"Manage Certificates"->"Authorities"->"Import", then select lazybug/data/local/ssl_cert/CA.pem
.
To run a local only version of the frontend and server, run:
# this, like `make frontend-clean` builds the browser extension and web frontend,
# but with URLs pointing to localhost and other online features turned off
make local
# This runs the server for local development, in SSL mode
make run-server-local
All data (besides Discourse and the backend user auth data), including things like captions and user databases are stored as plain files in Backblaze B2 (S3 compatible) and served through Cloudflare CDN. Transfering data from Backblaze to Cloudflare is free due to an agreement between them, and Cloudflare CDN is free which makes this a very cost efficient combination. Backblaze storage cost is 1/2 of Amazon S3, and 1/10 the download fees (outside of Cloudflare), and no upload fees.
Captions, show data, dictionaries and other public data is stored in a public bucket accessible at https://cdn.lazybug.ai/file/lazybug-public/
. Personal user data, such as the interaction log and user settings are stored in a private bucket at https://cdn.lazybug.ai/file/lazybug-accounts/
. Other data, which we don't want/need to serve to users is stored in another private bucket.
The accounts bucket, being private cannot be served through the CDN, and therefore we have to pay data fees for when users download their personal database (upload is free). Downloads are $0.01/GB. Downloads should only happen when the user switches clients or logs out/in. 10k downloads per day @ 5MB amounts to $15/month. By the time this becomes expensive we can switch to something more fine-grained than syncing the entire database every time.
For captions and other data, we want to keep old versions of the same file in case it's updated but references to it remain elsewhere in user data. We also want to reduce the number of requests and amount of data we pull from Cloudflare (and Backblaze), even if it's free (minimizing waste is the decent thing to do), therefore we want to make files immutable so that Cloudflare doesn't have to keep fetching from Backblaze to see if the data changed.
- We store a
{filename}.hash
with the SHA256 content hash of the file, and the contents at{filename}-{hash}.{ext}
- When the file changes, we save the new version in a new file with the hash in the filname and update the .hash file to point to it
- We then upload the new files with the
make push-public
command. - We then use the
purge_cache
Cloudflare API to purge the .hash files from the cache, so that clients will download the new one This can be done for all files indata/remote/public
by calling themake purge-cloudflare-public
command.
We also cache files in the IndexedDB for lazybug.ai in order to minimize uncessary fetching from Cloudflare as the browser cache can be unreliable.
User data is stored in the browser IndexedDB of lazybug.ai and is synced to the private lazybug-accounts bucket on Backblaze via the Python backend. This is done by exporting the database file, calling the /api/signed-upload-link/{size}
end-point, which will in turn authenticate the user and call the B2 API to get a signed upload link for that user's database file. When the frontend receives this URL, which is valid for a temporary duration, it simply uploads the file directly to it. This avoids having to send the data to our backend first, which saves on bandwidth. The same goes for downloads, but instead using the api/signed-download-link
endpoint.
Cache header for lazybug-public: {"cache-control":"max-age=31536000"}
Update cors rules for lazybug-accounts bucket to allow signed upload/download URLs:
b2 update-bucket --corsRules '[{"corsRuleName":"uploadDownloadFromAnyOrigin", "allowedOrigins": [""], "allowedHeaders": [""], "allowedOperations": ["s3_delete", "s3_get", "s3_head", "s3_post", "s3_put"], "maxAgeSeconds": 3600}]' lazybug-accounts allPublic
- SSL/TLS mode should be set to Full (strict)
- SSL/TLS -> Origin Server: The Python backend needs a certificate for serving HTTPS, which can be created/downloaded here
- SSL/TLS -> Edge Certificates -> Always Use HTTPS (ON)
- Rules -> Page Rules:
discourse.lazybug.ai/*
| Disable Performance (https://meta.discourse.org/t/how-do-you-setup-cloudflare/32258/22?page=2)lazybug.ai/static/*
| Browser Cache TTL: 2 minutes, Cache Level: Cache Everythingcdn.lazybug.ai/file/lazybug-public/*
| Cache Level: Cache Everything
Lazybug integrates comments and discussion with a separate discourse instance.
Guide how to set up Discourse on a DigitalOcean droplet
We use the accounts on the main site (lazybug.ai) as a SSO for Discourse. Here is the authentication flow:
- User enters
lazybug.ai
and logs in. When logged in a JWT is stored in a cookie on this domain - Embed a hidden iframe pointing to
discourse.lazybug.ai
. - When loading the iframe, the discourse server does a request back to
lazybug.ai/api/discourse/sso
with an sso and sig token as query parameters (seediscourse_sso
endpoint) ) - Lazybug server validates the sso and sig using the key in the
DISCOURSE_SECRET
env variable (created by admin on Discourse) - Lazybug server uses the JWT stored in the cookie to extract user id and email, and returns this info to Discourse
- Discourse sets session tokens in the cookies for discourse.lazybug.ai
- User is now logged in using their email on Discourse
- When the user logs out from Lazybug, we call the Python server endpoint at
/api/discourse/logout
which calls the Discourse API to log the user out. This can't be done from the frontend because it requires a CSRF token.
See this article for enabling SSO on Discourse.
When a user wants to ask a question regarding a specific caption we link to the topic with a special query parameter "reply_quote", for example https://discourse.lazybug.ai/t/lud-ngji-the-deer-and-the-cauldron/71?reply_quote=%3E+hello+world
. This prefills a reply with some text, where we can put the context of the specific caption (a link back to the video, the chinese etc), so that it's easier to ask a question about it. This context is thus generated on the lazybug.ai frontend and supplied through the query parameter.
This ability to add prefill a reply doesn't come with Discourse by default, but is implemented using a simple plugin, which the admin installs on the Discourd server.
In order for users to be able to comment on a specific video and show, one topic has to be generated for each show. This is done in make_discourse_topics.py
by going through all the shows in data/git/shows/*.json
, checking if a topic for this show id exists and if not, create it through the Discourse API and save the topic slug and id in the show json. Note that the DISCOURSE_API_KEY environment variable has to be set. This script also needs to save the internal Discourse topic id in the show json, in order for the frontend to be able to link to it.
Note that make_discourse_topics.py
is run as part of the show-list-full
make command, which bakes all the individual show files into one, creates bloom filters for the vocabulary etc.
Comments on a show topic is accessed directly from the frontend by the api at discourse.lazybug.ai/t/{topic_id}.json
. Linking to a topic or a comment a comment requires the topic slug as well as the id, the URL template being discourse.lazybug.ai/t/${topic_slug}/${topic_id}/${post_number}
. To enable this the https://lazybug.ai
has to be added to acceptable CORS origins in the Discourse Admin UI. The value for "same site cookies" under Security also needs to be set to "None".
There area two docker images which contains everything needed to build the frontend and run the server (martindbp/lazybug-server) and one for other processing such as extracting captions from videos etc (martindbp/lazybug-processing). To download one of them image run: docker pull martindbp/lazybug-{server,processing}
or make docker-pull
to download both.
To run a make command (e.g. frontend
) inside a lazybug container, run make docker-server COMMAND=frontend
. This maps a volume from current directory to the container so the container can access anything in the repository.