March Madness Datasets from 2002-2019, 2021. Datasets curated by [Chris Toukmaji] (https://www.toukmaji.com/), data published by [Ken Pomeroy] (https://en.wikipedia.org/wiki/Ken_Pomeroy) via kenpom.com. Please read the rest of my explanation so the structure of the rows makes sense.
Due to the bracket style of the tournament, the dataset follows a similar approach. First and foremost, the order of data does initially matter in this dataset; be cautious if and when you partion your data. For the first region (the first set of 16 teams), each consecutive pair of teams play each other, and the winner is appended to the set of 16 teams. The process continues until there is only 1 winner remaining from these 16 teams. Then, we continue the same process for the remaining three regions. Finally, we are left with four teams (one team per region). We append these four teams, then continue the same process.
The fluidity of the data is left as a decision on how one wants to structure their data/model. Initially, I one-hot encoded each round, but now realized the drawbacks in this use case. For example, it is impossible for a team to win a 2nd round game without winning a 1st round game (since each team gets eliminated after one loss), so these features become uesless for most teams. Another option is to count the number of games a certain team wins. However, both of these approaches are unnatural and base a prediction on overall performance, not the performance against a certain team (like a true bracket). In order to keep the context of both teams, I am thinking of some form of data manipulation to each pair of consecutive teams and possibly subtracting each pairs' data so we are training on the difference between the performance of two teams rather than the average performance of a team against any arbitrary opponent.