Skip to content

Latest commit

 

History

History
17 lines (15 loc) · 1.22 KB

README.md

File metadata and controls

17 lines (15 loc) · 1.22 KB

beachwaterqualitytimeseries

Time Series Forecasting of E.Coli counts using Python and Data Science Tools (Pandas, matplotlib, statsmodels etc)

Tasks completed thus far:

  • Extracted and cleaned raw XML weather data and e. Coli count data (~1000 data points)
  • Analyzed e. Coli counts in the past 10 years over 11 beaches around Toronto (~920 datapoints)
  • Performed day-by-day differencing to check for stationarity in the data before modeling
  • Predicted single data point forecast and compared against expanded forecast using the statsmodels ARIMA model with parameters (AR=6, D=1, MA=1)
  • Programmed the ARIMA model to forecast using 30% of the dataset based on a training dataset of 70%
  • Calculated RMSE values of the ARIMA model for both forecasts
  • Results: RMSE IQR doubled after including more datapoints; RMSE median tripled as well.

TASKS TODO:

  • Investigate why the IQR and median values are increasing with increasing number of datapoints in the forecast
  • Conduct log/sqrt transformations for the E.coli counts of each beach
  • Check whether transformations change the data to fit them better with a normal distribution
  • Change the parameters of the ARIMA model, leveraging MA model more (this is currently set only to q=1)