If you read this file, you were successful in the behavioral interview. Well done! 👏 👏 👏
🚀 The next step to join the Data Science team of xtream is this assignment. You will find several datasets: please choose only one. For each dataset, we propose several challenges. You do not need to complete all of them, but rather only the ones you feel comfortable about or the ones that interest you.
✨ Choose what really makes you shine!
⌚ We estimate it should take less than 8 hours to solve the challenges for a dataset, and we give you 10 days to submit a solution, so that you can move at your own pace.
❗ Important: you might feel the tasks are too broad, or the requirements are not fully elicited. This is done on purpose: we wish to let you take your own way in extracting value from the data and in developing your own solutions.
Please fork this repository and work on it as if you were taking on a real-world project. On the deadline, we will check out your work.
❗ Important: At the end of this README, you will find a blank "How to run" section. Please write there instructions on how to run your code.
Your work will be assessed according to several criteria, for instance:
- Method
- Understanding of the data
- Completeness and clarity of the results
- Code quality
- Work quality (use of git, dataset management, workflow, tests, ...)
- Documentation
❗ Important: this is not a Kaggle competition, we do not care about model performance. No need to get the best possible model: focus on showing your method and why you would be able to get there, given enough time and support.
Problem type: regression
Dataset description: Diamonds readme
Don Francesco runs a jewelry. He is a very rich fellow, but his past is shady: be sure not to make him angry. Over the years, he collected data from 5000 diamonds. The dataset provides physical features of the stones, as well as their value, as estimated by a respected expert.
Francesco wants to know which factors influence the value of a diamond: he is not an expert, he wants simple and clear messages. However, he trusts no one, and he hired another data scientist to get a second opinion on your work. Create a Jupyter notebook to explain what Francesco should look at and why. Your code should be understandable by a data scientist, but your text should be clear for a layman.
Then, Francesco tells you that the expert providing him with the stone valuations disappeared. He wants you to develop a model to predict the value of a new diamond given its characteristics. He insists on a point: his customer are not easy-going, so he wants to know why a stone is given a certain value. Create a Jupyter notebook to meet Francesco's request.
Francesco likes your model! Now he wants to use it. To improve the model, Francesco is open to hire a new expert and let him value more stones. Create an automatic pipeline capable of training a new instance of your model from the raw dataset.
Finally, Francesco wants to embed your model in a web application, to allow for easy use by his employees. Develop a REST API to expose the model predictions.
In challenge1
folder the jupyter notebook main.ipynb
can be found addressing the challenge. It runs in a virtual environment with requirements.txt
installed (one can alternatively run it in google colab by correctly pointing the text files the notebook reads). In the figs
folder a plot is saved for future use.
In challenge2
folder the jupyter notebook main.ipynb
can be found addressing the challenge. It runs in a virtual environment with requirements.txt
installed (one can alternatively run it in google colab by correctly pointing the text files the notebook reads).
The solution to this challenge can be found in challenge3
folder. For it we created two python scripts.
-
The first one,
pipeline_training.py
, reads the csv filediamonds.csv
, processes the columns "carat", "color" and "clarity" and then trains the model against the ground truth given in the column "price". These transformations and model training are encoded in an object of thesklearn.pipeline.Pipeline()
class and, after training, thePipeline()
object is saved in themy_pipeline.joblib
file. -
The second one,
diamond_pricing.py
, is an I/O python script which asks the features of a given diamond and returns the estimated price by using thePipeline()
object loaded from themy_pipeline.joblib
file.
Besides those files, there is the requirements.txt
which lists the libraries needed for these scripts.
It should be said, regarding the training of the model, that since Francesco and its clients needed a sufficiently simple model to be able to grasp its working, we chose a model which has an algorithm which cannot learn incrementally. This means that if the expert hired by Francesco values additional diamonds, the model has to be retrained with the combination of old and new data (this can be done by simply appending the new data in the diamonds.csv
file).
For this challenge we have developed an API and deployed it in a dockerized ECS instance of AWS cloud. It can be found in http://18.234.175.187/. The API works with a GET request with URL parameters following the structure http://18.234.175.187/pricing/{carat}/{color}/{clarity}
which responds with a JSON with the features and the price computed from the model developed in the previous challenges. A simple documentation can be found in the root http://18.234.175.187/ while a more technical one in http://18.234.175.187/docs.
For a more detailed description of the API, the code used both locally and in the cloud please go to the Challenge 4 readme