A collection of small scripts to mass-process audio files from the Local Audio Server for Yomichan Anki Add-on.
Goals of this repo:
- Normalize the audio, in order to standardize the volume across the entire collection.
- Remove most silence from the front and back of audio.
- Convert all audio to opus / mp3 (opus for more efficient storage of files, mp3 for better compatibility with AnkiMobile).
- Remove broken files:
- Remove JPod audio that has different readings mapping to the same audio file.
- Remove a few audio files that simply didn't have the correct word audio in the first place.
- Remove duplicate files:
- A lot of the JPod audio were simply duplicate files. With
jpod_index.py
, we simply removed these duplicate files and provided a more efficient way of mapping the data to the files.
- A lot of the JPod audio were simply duplicate files. With
- Create
jmdict_forms.json
for post processing work after creating the entire database. This json file allows us to use the JMdict word variants data to fill in missing gaps with words that had the same reading and were simply spelt differently.
NON-Goals:
- Reduce the quality of the audio for the sake of more efficient storage
(repo-root)
L input
L forvo_files
L jpod_files
L jpod_alt_files
L nhk16_files
L shinmeikai8_files
L output
L opus
L user_files
L ...
L mp3
L user_files
L ...
L ...
This requires a Unix system to run. On Windows, you can use WSL.
Dependencies:
- ffmpeg >= 6.0 (one that can decode aac/mp3, and can encode mp3/opus)
- python (3.11+)
You'll also need the unprocessed audio files to run this script on.
Download the files from this torrent. Alternatively, use the magnet link below:
Magnet link
magnet:?xt=urn:btih:e80dc4b9dd837257a181dd378325e6b433f55071&dn=local-yomichan-audio-collection-2023-06-11-original.tar.xz&tr=http%3a%2f%2fanidex.moe%3a6969%2fannounce&tr=http%3a%2f%2fnyaa.tracker.wf%3a7777%2fannounce&tr=udp%3a%2f%2fexodus.desync.com%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.opentrackr.org%3a1337%2fannounce&tr=udp%3a%2f%2fopen.stealth.si%3a80%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.moeking.me%3a6969%2fannounce&tr=udp%3a%2f%2fopentracker.i2p.rocks%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.torrent.eu.org%3a451%2fannounce&tr=udp%3a%2f%2fexplodie.org%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.zerobytes.xyz%3a1337%2fannounce
- Download the original audio files.
- Run the following command:
./build-collection.sh
- (done) run ffmpegmulti on all forvo audio and shinmeikai8 audio
- (done) remove exact duplicates within the
jpod
andjpod_alternate
sources, unify it as one, and build an index (done viajpod_index.py
) - (done?) make all scripts use the unified file directory structure
- (done) run
ffmpegmulti --no-silence-remove
onjpod
andjpod_alternate
- (done) convert
nhk16_files
to opus and to mp3. Decide if they need normalization and/or silence removal (norm + silence it is) - (done) remove broken files:
- (done) skent/解く - just rm from the build script
- (done) broken jpod files (https://discord.com/channels/617136488840429598/1074057444365443205/1113679859609260062)
- (done) Filter these out in the
jpod_index
script
- (done) Filter these out in the
- (done) use jmdict word alternatives to map audio to more words (i.e. all of 手すり・手摺り・手摺 should have the same audio)
- the alternatives data should be available to the add-on, not to this repo (as it would be part of creating the main database?)
- see
yomichan_import
/ JMdict forms dictionary for reference on parsing the original xml
- (done) main build script (
build_collection.sh
) - (done) rename repo
- (done) investigate how much
opus
actually reduces file size for all sources! SMK8 might not have much of an effectfor shinmeikai, i initially converted all the files to opus, but decided to keep them in the original format after measuring file sizes and noticing no improvement in size.
- It should suffice to just compare the mp3 and opus outputs when it's all done
- (done) get more data for normalization (sentence audio data, compare with output)
None of this would have even existed if it wasn't for the ideas and work from these fantastic people:
- @Mansive:
- Original author of
ffmpegmulti.py
- Jump-started this project entirely by noting that Forvo audio can be improved via mass processed
- Original author of
- @tsweet64:
- Added
opus
andmp3
support - Created the general build script
- Pointed out that most JPod files were duplicates
- Added
There's a lot more stuff that these two did that isn't mentioned here, either because I forgot or it would make the list a bit too long. Thanks again!