A scraping tool for MSE Daily Reports, which are uploaded on their site. The tool basically goes through different daily reports (according tothe configuration) and downloads the PDFs before converting them to CSV and then eventually uploading to a Postgres Database. What you do with the data is up to you! Just make sure you have a VPN when scraping.
We use PDF Tables to convert the PDFs to CSV, but you may use your own converter, at which point some of the logic in the downloader
will need to be tweaked.
To run this project, you will need to add the following environment variables to your .env file
MSE_URL
the absolute URL where a PDF is to be found, without the PDF NumberPDF_START_NO
PDF Start NumberPDF_END_NO
PDF End NumberRAW_PDF_PATH
Relative Project Path where you want to save PDFsRAW_CSV_PATH
Relative Project Path where you want to save uncleaded CSVsERROR_FILE_PATH
Relative Project Path where you want to save ErrorsCLEANED_CSV_PATH
Relative Project Path where you want to save cleaned CSVsQUEUE_SIZE
Pool Maximum Queue SizeWORKER_NUM
Pool Number of workersPDFTABLES_API_KEY
PDF Tables API Key
After cloning this repo, make sure you change copy the example.env
into an .env
and replace all the values in there with sensible configurations.
You may then build and run the program with the following flags
go build -o scraper
./scraper -mode download
#wait for completion
./scraper -mode clean
#wait for completion
./scraper -mode save
#wait for completion
Any errors incurred will be both logged in the terminal as well as recorded in the error path you provide. You may handle the errors however you see fit - including manually converting and saving them.