Application Case 2.4

Predicting NCAA Bowl Game Outcomes

Predicting the outcome of a college football game (or any sports game, for that matter) is an interesting and challenging problem. Therefore, challengeseeking researchers from both academics and industry have spent a great deal of effort on forecasting the outcome of sporting events. Large quantities of historic data exist in different media outlets (often publicly available) regarding the structure and outcomes of sporting events in the form of a variety of numerically or symbolically represented factors that are assumed to contribute to those outcomes.

The end-of-season bowl games are very important to colleges both financially (bringing in millions of dollars of additional revenue) as well as reputational—for recruiting quality students and highly regarded high school athletes for their athletic programs (Freeman & Brewer, 2016). Teams that are selected to compete in a given bowl game split a purse, the size of which depends on the specific bowl (some bowls are more prestigious and have higher payouts for the two teams), and therefore securing an invitation to a bowl game is the main goal of any division I-A college football program. The decision makers of the bowl games are given the authority to select and invite bowl-eligible (a team that has six wins against its Division I-A opponents in that season) successful teams (as per the ratings and rankings) that will play in an exciting and competitive game, attract fans of both schools, and keep the remaining fans tuned in via a variety of media outlets for advertising.

In a recent data mining study, Delen, Cogdell, and Kasap (2012) used 8 years of bowl game data along with three popular data mining techniques (decision trees, neural networks, and support vector machines) to predict both the classification-type outcome of a game (win versus loss) as well as the regression-type outcome (projected point difference between the scores of the two opponents). What follows is a shorthand description of their study.

Methodology

FIGURE 2.16 The Graphical Illustration of the Methodology Employed in the Study.

In this research, Delen and his colleagues followed a popular data mining methodology called CRISP-DM (Cross-Industry Standard Process for Data Mining), which is a six-step process. This popular methodology, which is covered in detail in Chapter 4, provided them with a systematic and structured way to conduct the underlying data mining study and hence improved the likelihood of obtaining accurate and reliable results. To objectively assess the prediction power of the different model types, they used a cross-validation methodology, called k-fold crossvalidation. Details on k-fold cross-validation can be found in Chapter 4. Figure 2.16 graphically illustrates the methodology employed by the researchers.

Data Acquisition and Data Preprocessing

TABLE 2.5 Description of the Variables Used in the Study.

The sample data for this study is collected from a variety of sports databases available on the Web, including jhowel.net, ESPN.com, Covers.com, ncaa. org, and rauzulusstreet.com. The data set included 244 bowl games, representing a complete set of eight seasons of college football bowl games played between 2002 and 2009. We also included an outof-sample data set (2010–2011 bowl games) for additional validation purposes. Exercising one of the popular data mining rules-of-thumb, they included as much relevant information into the model as possible. Therefore, after an in-depth variable identification and collection process, they ended up with a data set that included 36 variables, of which the first 6 were the identifying variables (i.e., name and the year of the bowl game, home and away team names and their athletic conferences—see variables 1–6 in Table 2.5), followed by 28 input variables (which included variables delineating a team’s seasonal statistics on offense and defense, game outcomes, team composition characteristics, athletic conferencecharacteristics, and how they fared against the odds— see variables 7–34 in Table 2.5), and finally the last two were the output variables (i.e., ScoreDiff—the score difference between the home team and the away team represented with an integer number, and WinLoss—whether the home team won or lost the bowl game represented with a nominal label).

In the formulation of the data set, each row (a.k.a. tuple, case, sample, example, etc.) represented a bowl game, and each column stood for a variable (i.e., identifier/input or output type). To represent the game-related comparative characteristics of the two opponent teams, in the input variables, we calculated and used the differences between the measures of the home and away teams. All these variable values are calculated from the home team’s perspective. For instance, the variable PPG (average number of points a team scored per game) represents the difference between the home team’s PPG and away team’s PPG. The output variables represent whether the home team wins or loses the bowl game. That is, if the ScoreDiff variable takes a positive integer number, then the home team is expected to win the game by that margin, otherwise (if the ScoreDiff variable takes a negative integer number) then the home team is expected to lose the game by that margin. In the case of WinLoss, the value of the output variable is a binary label, “Win” or “Loss” indicating the outcome of the game for the home team.

Results and Evaluation

In this study, three popular prediction techniques are used to build models (and to compare them to each other): artificial neural networks, decision trees, and support vector machines. These prediction techniques are selected based on their capability of modeling both classification as well as regression-type prediction problems and their popularity in recently published data mining literature. More details about these popular data mining methods can be found in Chapter 4.

To compare predictive accuracy of all models to one another, the researchers used a stratified k-fold cross-validation methodology. In a stratified version of k-fold cross-validation, the folds are created in a way that they contain approximately the same proportion of predictor labels (i.e., classes) as the original data set. In this study, the value of k is set to 10 (i.e., the complete set of 244 samples are split into 10 subsets, each having about 25 samples), which is a common practice in predictive data mining applications. A graphical depiction of the 10-fold cross-validations was shown earlier in this chapter. To compare the prediction models that were developed using the aforementioned three data mining techniques, the researchers chose to use three common performance criteria: accuracy, sensitivity, and specificity. The simple formulas for these metrics were also explained earlier in this chapter.

TABLE 2.6 Prediction Results for the Direct Classification Methodology.

TABLE 2.7 Prediction Results for the Regression-Based Classification Methodology.

The prediction results of the three modeling techniques are presented in Table 2.6 and Table 2.7. Table 2.6 presents the 10-fold cross-validation results of the classification methodology where the three data mining techniques are formulated to have a binary-nominal output variable (i.e., WinLoss). Table 2.7 presents the 10-fold cross-validation results of the regression-based classification methodology, where the three data mining techniques are formulated to have a numerical output variable (i.e., ScoreDiff). In the regression-based classification prediction, the numerical output of the models is converted to a classification type by labeling the positive WinLoss numbers with a “Win” and negative WinLoss numbers with a “Loss,” and then tabulating them in the confusion matrixes. Using the confusion matrices, the overall prediction accuracy, sensitivity, and specificity of each model type are calculated and presented in these two tables. As the results indicate, the classification-type prediction methods performed better than regression-based classification-type prediction methodology. Among the three data mining technologies, classification and regression trees produced better prediction accuracy in both prediction methodologies. Overall, classification and regression tree classification models produced a 10-fold crossvalidation accuracy of 86.48%, followed by support vector machines (with a 10-fold cross-validation accuracy of 79.51%) and neural networks (with a 10-fold cross-validation accuracy of 75.00%). Using a t-test, researchers found that these accuracy values were significantly different at 0.05 alpha level, that is, the decision tree is a significantly better predictor of this domain than the neural network and support vector machine, and the support vector machine is a significantly better predictor than neural networks.

The results of the study showed that the classification-type models predict the game outcomes better than regression-based classification models. Even though these results are specific to the application domain and the data used in this study, and therefore should not be generalized beyond the scope of the study, they are exciting because decision trees are not only the best predictors but also the best in understanding and deployment, compared to the other two machine-learning techniques employed in this study. More details about this study can be found in Delen et al. (2012).

QUESTIONS FOR DISCUSSION

1. What are the foreseeable challenges in predicting sporting event outcomes (e.g., college bowl games)?

2. How did the researchers formulate/design the prediction problem (i.e., what were the inputs and output, and what was the representation of a single sample—row of data)?

3. How successful were the prediction results? What else can they do to improve the accuracy?

English

应用案例2.4

预测NCAA碗赛结果

预测大学橄榄球比赛(或任何体育比赛)的结果是一个有趣并具有挑战性的问题。因此，来自学术界与业界的寻找挑战的研究人员在预测体育赛事结果上付出了巨大的努力。大量关于体育事件的结构和结果的历史数据存在于不同的媒体渠道(通常公开提供)，以各种被认为促成了这些结果的数字化或符号化表示的因素展现出来。

赛季结束时的碗赛对于大学的经济(带来上百万美元的额外收入)和名誉都非常重要,为了招收优质学生和受到高度评价的高中运动员( Freeman和Brewer, 2016)。被选中参加比赛的球队会分享奖金，其金额大小取决于特定的碗(有些碗更有声望，并且给予两队的薪酬更多)，因此确保碗赛的邀请会是任何Division I-A级别大学橄榄球项目的主要目标。.碗赛的决策者被授权选择和邀请合适参加碗赛的(对于Division I-A级别对手在该赛季有着6场.胜利的球队)成功球队(按照评级和排名),这些球队将在激动人心、富有竞争性的比赛中角逐，吸引两所学校的球迷们，并通过各种媒体渠道让其他球迷收听以进行广告宣传。

在最近的数据挖掘研究中，Delen、 Cogell 和Kasap使用了8年的碗赛数据与三个热门的数据挖掘技术(决策树、神经网络和支持向量机)同时预测一场比赛的分类结果(胜或负)与回归结果(两个对手得分之间的分差)。下面是他们研究的简述。

方法

图 2.16 研究采用的方法的图形表示

在这项研究中，Delen 和他的同事们采取了热门的名为CRISP-DM (跨行业数据挖掘标准过程，Cross-Industry Standard Process for Data Mining)的数据挖掘方法，这是一个六步的过程。这个将在第4章中详细介绍的热门方法，为他们提供了一种系统和结构化的方法进行基础数据挖掘的研究，从而提高了获得准确、可靠结果的可能性。为了客观评估不同模型的预测能力，他们使用了称作k折交叉验证的交叉验证方法。k折交叉验证方法的细节可以在第4章中找到。图2.16说明了研究人员采用的方法。

数据采集与数据预处理

表 2.5 研究中使用的变量描述

研究所用的样本数据从Web.上可用的各种体育数据库中获得，包括ESPN.com, Covers.com、ncaa. org和rauzulusstreet.com。数据集包括244场碗赛，代表了一套完整的2002年到2009年之间进行的8个赛季大学碗赛集合。我们还计入了一个样本外数据集(2010 年至.2011年之间的碗赛)用于附加的验证目的。应用一个热门的数据挖掘经验法则，在模型中包含尽可能多的相关信息。因此，在深度变量识别和收集过程后，得到了一个包含36个变量的数据集，其中前6个变量是识别变量(即名称、碗赛年份、主客队名以及它们的运动联合会，参考表2.5中的变量1至变量6)，紧跟其后的是28个输入变量(包括描述团队攻防的赛季统计、比赛结果、球队组成特征、运动联合会特征以及它们在赔率较高时表现如何，参考表格2.5中的变量7至变量34)，还有两个是输出变量(即ScoreDiff-主队与客队的分差，以整数形式呈现; WinLoss-主队在碗赛中取胜或失败，以名义标签呈现)。

在数据集的公式中，每一行(即元组、案例、样本、样例等)表示一场碗赛，而每一列代表一个变量(即标识符/输入或输出类型)。为了表示两个对手的比赛相关特征比较，在输入变量中，我们计算并使用主队与客队之间的度量差。所有这些变量值都是从主队的角度计算的。例如，变量PPG (每场比赛队伍的平均得分)代表主队的PPG与客队的PPG之间的分差。输出变量代表主队在碗赛中获胜还是失败。也就是说，如果ScoreDiff取正整数，那么主队以此分差贏得比赛;反之(如果ScoreDiff取负整数)，则主队以此分差输掉比赛。在WinLoss的情况下，输出变量的值是一个二元标签，“胜”或“负”表示主队的比赛结果。结果与评估

在这项研究中，使用三种热门的预测技术进行建模(并将它们互相比较):人工神经网络(ANN)、决策树(DT)和支持向量机(SVM)。这些预测技术是基于它们对于分类与回归类型预测问题建模的能力以及在最近出版的数据挖掘文献中的受欢迎程度进行选择。这些热门的数据挖掘方法的更多细节可以在第4章中找到。

为了比较所有模型的预测准确度，研究人员使用了分层k折交叉验证方法。在k折交叉验证的分层版本中，折叠以一种包含与原始数据集比例大约相同的预测标签(即类)的方式被创建。在研究中，k的值设置为10 (即244个样本的完整集合被分为10个子集，每个子集有25个样本)，这是预测数据挖掘应用中的普遍做法。十折交叉验证的一-个图形化描述在本章前面已经介绍。为了比较上述三种数据挖掘技术开发的预测模型，研究人员选择使用三种常见的性能标准:准确度、敏感度和特异度。这些指标的简单公式也在本章前文中介绍。

表 2.6 直接分类方法的预测结果

表 2.7 基于回归的分类方法的预测结果

三种建模技术的预测结果如表2.6与表2.7所示。表2.6是使用三种数据挖掘技术公式化输出二元名义输出变量(即WinLoss)的分类方法的10折交叉验证结果。表2.7呈现了三种数据挖掘技术公式化输出数字输出变量(即ScoreDiff)的基于回归分类方法的10折交叉验证结果。在基于回归的分类预测中，模型的数字输出通过给予正WinLoss数“Win”标签，给予负WinLoss数“Loss”标签，然后把它们列在混淆矩阵中转换为分类类型。使用混淆矩阵计算每个模型的总体预测准确度、敏感度和特异度，并在这两个表格中呈现。结果表明，分类类型的预测方法表现得比基于回归的分类预测方法要更好。在三种数据挖掘技术中，分类和回归树在各种预测方法中都产生了更好的预测准确度。总体上，分类和回归树分类模型产生了86.48%的十折交叉验证准确度，随后是支持向量机(79.51%的十折交叉验证准确度)与神经网络(75.00%的十折交叉验证准确度)。研究人员使用t检验发现这些准确度在a=0.05水平上有显著差异，决策树在该领域.上明显比神经网络和支持向量机更好，而支持向量机的预测效果明显优于神经网络。

研究结果表明分类类型模型预测比赛结果要比基于回归的分类模型更佳。即使这些结果对于应用领域和本研究中使用的数据是特定的，因此不应被概括到此研究范围之外，但这些结果是非常激动人心的，因为决策树与另外两种研究中采用的机器学习技术相比，不仅能更好地预测，还更容易理解和开发。关于这个研究的更多细节可以在Delen等发表的论文中找到(2012)。

问题讨论

1.预测体育赛事结果时有什么可预见的挑战(例如大学碗赛)?

2.研究人员怎样公式化或设计预测问题(即输入输出是什么，简单样本一数据行的表示是什么)?

3.预测结果有多成功?还能做什么以提高准确度?