An awesome (not really) tool that scrapes marvel's API to get the leaks for Wandavision (this is a joke).
- Implement the APIs as simple as possible with minimal setup needed.
- Pass the exam
-
Clone the repo
git clone https://github.com/Fatal-Errol/simple-marvel-web-scraper.git
-
Go to the project root
cd simple-marvel-api-scraper
-
Create your .env file using the contents of
.env.sample
as template.
cp .env.sample .env
-
Add your marvel api credentials, you can leave the rest as is.
- MARVEL_API_PUBLIC_KEY
- MARVEL_API_PRIVATE_KEY
-
Run
yarn install
-
To run the test, use
yarn test
-
Run
yarn start
to start the app -
Open
http://localhost:8080
- Setting NODE_ENV to
production
will hide the stack trace value when an error occurs. - You can tweak some of the in-memory cache configuration in
src/config/cache.js
/
- Returns the basic details about the app
/docs
- Shows the documentation page
/characters
- Returns the ids of marvel characters
/characters/:id
- Returns additional information about the marvel character id
The /characters
endpoint uses asynchronous calls to get all the marvel character ids in less than 30 seconds.
The result is stored in an in-memory cache that auto expires after a day. Any succeeding call to this endpoint will
use the cache.
Once the cache expires, the script calls the third party API again to refresh it. We can assume that marvel characters are not typically updated multiple times a day. If there are new characters being added to their database, then it might be done on a certain period and probably in bulk. Updating the cache once a day should be enough but a configurable time and a manual trigger should cover other edge cases (src/config/cache.js).
A semaphore is created while an API call triggers the refresh command. This will prevent other concurrent calls from execute multiple refresh commands by checking if the semaphore exists. While the cache is being updated, the other concurrent calls wait and checks the semaphore every 5 seconds. When the semaphore is released, those calls can now read the cache.
If the other concurrent calls waited for more than 25 seconds, the API errors out with a "retry again" message. Usually the refresh takes 10-15 seconds so it should not normally come to this.
It doesn't add value since the marvel api returns fast. We might consider this if we have a lot of traffic and our 3000 daily limit is being hit frequently. I'm confident that this will not happen but who knows the future.
I want this to have a very small third party service dependency. In-memory cache works with this kind of data since the characters isn't updated frequently so the numbers should not jump significantly even after a few years (unless antman finds an antqueen and creates superhero antbabies).
- Initial POC release
- Make it a full blown scraper that competes with any other scrapers out there (ETA: in another life maybe)
- Haven't completed the unit tests since I ran out of time. Hopefully this is enough. Thanks