Repository for the PyData DC 2016 tutorial
The objective of the project is to review the functions and methods available in the Scipy.stats library to perform common frequentist statistical tests; including how to format the data and interpret results. The tests will be run using data from the iris data set. Some common data handling commands in Pandas, along with plotting using Matplotlib and Seaborn will also be mentioned. The following statistical tests will be covered:
- Normality testing
- Homogeneity of variance testing
- Comparing 2 samples of a continuous measure: t-tests, Cohen's d, Wilcoxon rank-sum, Mann-Whitney U test, Wilcoxon test
- Comparing multiple groups: ANOVA, Kruskal-Wallis H
- Contingency tables: Chi square, Fisher's exact test
- Correlation: Pearson's correlation coefficient r, Spearman rank-order correlation coefficient rho, Point-biserial correlation coefficient, Kendall's Tau
- Linear regression. This test will require the use of the Statsmodels library
- Logistic regression. This test will require the use of the Statsmodels library
Files in the repository:
- The iris data set in CSV format. This is the same dataset available in the Scikit-learn library
- The Python 3 Jupyter Notebook with the code
The video of the presentation is available at: