Quality Filtering

load from raw web scrapes in /data using collate_pdfs.py to get all valid PDF journals in indo_journals
sample_and_split.py splits indo_journals into subsets of 10K PDFs, which should average to around 2B LLAMA tokens (based on estimate)
filter_pdfs.py to build initial raw DF filtered using Anthropic prompt, because some Journals actually in English, not useful for Indonesian linguistic adaptation, powered by Spark a. iniital OCR uses pypdf cos uses CPU compute, sufficient for LLM to do initial filtering
use MathPix to get OCR with structured MD tags used for REGEX processing
write REGEX and all data processing to get continuous text with dominantly Indonesian only to train model a. will get some contamination from other languages like English but normal
Load everything into a DF
Run filtering rules using .apply()

naive filtering:

filter for PDFs > 2 pages
throw away everything but continuous text a. remove titles, newline characters b. use LLM to write filtering logic based on analysis of MPix format (authors, tables, page numbers, exhibits, diagrams, special symbols) c. use Gemini for LID -> throw away anything containing English

Pre-OCR - quality filtering, initial dedup (can explore web-pages type later)

Post-OCR - what is actually meaningful text to keep which improves model????

Provide feedback