Data on penguin populations are limited because most monitored colonies are near permanent research stations and other sites are surveyed only sporadically. Because the data are so patchy, and time series relatively short, it has been difficult to build statistical models that can explain past dynamics or provide reliable future predictions.
Your goal is to create better models to estimate populations for hard-to-reach sites in the Antarctic, and thereby greatly improve our ability to use penguins to monitor the health of the Southern Ocean!
This repository contains code volunteered from leading competitors in the Random Walk of the Penguins DrivenData challenge. Code for all winning solutions are open source under the MIT License.
Winning code for other DrivenData competitions is available in the competition-winners repository.
Place | Team or User | Public Score | Private Score | Summary of Model |
---|---|---|---|---|
1 | ambarishg | 4.4193 | 4.8127 | Before running the model, imputation of the data was done for each of the site and penguin type combinations. The imputations were done in the following order - Stine in case of R models and Linear in case of Python models - Last Observation Carried Forward in case of R models only - Next Observation Carried Backward in case of R and Python models - Replace by Zero in case of R models and Python models |
2 | TomBolton | 4.5915 | 4.8274 | My method involved a mix of persistence, auto-regressive models, linear regressions and exponential models. As the challenge involved multiple time series, the particular method applied depended on my judgement, after manual inspection of the data. Tailoring the approach for each time-series individually proved successful as opposed for a single technique across all of them. The behaviour of the time-series varied massively between sites and species. One time-series might show a linear increase due to increased food availability, where a neighbouring site may exhibit exponential decrease due to humans entering the region; with so many potential factors affect the inter-annual variability of nest counts, it made the challenge particularly difficult. |
3 | aaronr | 4.4953 | 4.8495 | The first is estimation of the mean, which is done by last estimation (could be done as an average of last N observations). The second is a recent trend, changes in recent years. The third is the whole time series trend. This approach was partially adapted from Facebook's Prophet blog post. I also wanted to study more from the public leaderboard set (with a risk of overfit) whether there was an additional general trend of increase/decrease that years. Eventually I combined the constant estimation, with a minor general trend and a linear fit that was computed differently for each year. |
4 | oleg.panichev | 5.2106 | 4.9605 | My approach is based on Benchmark: Simple Linear Models – this is a great starting point for this problem. The first what I did is replaced of Linear Regression with XGBoost and minor parameters tuning. After that, I have looked around the preprocessing part and decided to improve function for replacing NaNs in the dataset. Actually, that didn't give any improvements but allowed to combine different models into an ensemble. Submission generated by ensembling of previous submissions resulted in 5.2132 on the public leaderboard and 4.9605 on the private. |
5 | bicarrio | 4.3782 | 4.9872 | My thought was that populations are in a delicate equilibrium between competitors, preys, and predators, which can be written down using differential equations, on which the absolute rate of change between breeding seasons (birth rate minus death rate) depends in some complex way on the other species' populations. |
The 2017 Prediction Bonus prize was awarded to loweew.
Interview with winners: Random Walk of the Penguins—Guest Post by Dr. Grant Humphries
Benchmark Blog Post: "Simple Linear Models"