由浅入深理解主成分分析 #6231

Ultrasteve · 2019-07-27T08:51:24Z

译文翻译完成，resolve #6192

翻译完成，幸苦校对的同学了

由浅入深理解主成分分析

tsonglew · 2019-07-29T01:11:58Z

校对认领 @leviding

fanyijihua · 2019-07-29T01:12:01Z

@kasheemlew 好的呢 🍺

TrWestdoor · 2019-07-29T02:20:12Z

@leviding 校对认领

fanyijihua · 2019-07-29T02:20:13Z

@TrWestdoor 妥妥哒 🍻

TrWestdoor

译者翻译的很好，提了一些语句调整的建议，使之更符合中文使用习惯。
PS: 译者将一些 “reduce the number of variables” 翻译为减少数据维度，虽然本意上与减少变量数目是一致的，但是考虑到上下文中 variable 一词都是译为变量，而且中文博客中野大多采用“变量数目”来形容，所以建议译者参考一下，我没有全部标记出来。

TrWestdoor · 2019-07-29T02:30:58Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-PCA is actually a widely covered method on the web, and there are some great articles about it, but only few of them go straight to the point and explain how it works without diving too much into the technicalities and the ‘why’ of things. That’s the reason why i decided to make my own post to present it in a simplified way.
+网上已经有很多介绍 PCA 的文章，其中一些质量也很高，但很少文章会直截了当的去介绍它是怎么工作的，通常它们会过度的拘泥于 PCA 背后的技术及原理。因此，我打算以我自己的方式，简单易懂的来向各位介绍 PCA 。


Suggested change

网上已经有很多介绍 PCA 的文章，其中一些质量也很高，但很少文章会直截了当的去介绍它是怎么工作的，通常它们会过度的拘泥于 PCA 背后的技术及原理。因此，我打算以我自己的方式，简单易懂的来向各位介绍 PCA 。

网上已经有很多介绍 PCA 的文章，其中一些质量也很高，但很少文章会直截了当的去介绍它是怎么工作的，通常它们会过度的拘泥于 PCA 背后的技术及原理。因此，我打算以我自己的方式，来向各位简单易懂的介绍 PCA 。

TrWestdoor · 2019-07-29T02:53:56Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them.
+在解释 PCA 之前，这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的，同时我们会简化其背后的数学概念。我们会讲到标准化，协方差，特征向量和特征值，但我们不会介绍如何去计算它们。


Suggested change

在解释 PCA 之前，这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的，同时我们会简化其背后的数学概念。我们会讲到标准化，协方差，特征向量和特征值，但我们不会介绍如何去计算它们。

在解释 PCA 之前，这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的，同时我们会简化其背后的数学概念。我们会讲到标准化，协方差，特征向量和特征值，但我们不会专注于如何计算它们。

TrWestdoor · 2019-07-29T02:54:25Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-## So what is Principal Component Analysis ?
+## 什么是 PCA ？


Suggested change

## 什么是 PCA ？

## 什么是主成分分析？

TrWestdoor · 2019-07-29T02:59:24Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
+PCA（主成分分析）是一种降维方法，常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量，同时保留这些变量的大部分信息。


Suggested change

PCA（主成分分析）是一种降维方法，常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量，同时保留这些变量的大部分信息。

PCA（主成分分析）是一种降维方法，常用于对高维数据集作降维。它会将一个大的变量集合转化为更少的变量集合，同时保留大的变量集合中的大部分信息。

TrWestdoor · 2019-07-29T03:03:09Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
+减少数据的维度天然会牺牲一些精度，但奇妙的是，在降维算法中，精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。


Suggested change

减少数据的维度天然会牺牲一些精度，但奇妙的是，在降维算法中，精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。

减少数据集的变量数一般会牺牲一些精度，但奇妙的是，在降维算法中，通过牺牲少量精度来让事情变得简单化。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。

TrWestdoor · 2019-07-29T03:31:53Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
+这一步的目标是理解，数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。事实上，特征中常常包含着一些冗余信息，这使得特征之间有时候会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。


Suggested change

这一步的目标是理解，数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。事实上，特征中常常包含着一些冗余信息，这使得特征之间有时候会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。

这一步的目标是理解数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。因为有时候，特征中常常包含着一些冗余信息，这使得特征之间会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。

TrWestdoor · 2019-07-29T05:54:10Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-The covariance matrix is a **p** × **p**** **symmetric matrix (where** p **is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables** x**,** y**, and** z**, the covariance matrix is a 3×3 matrix of this from:
+协方差矩阵是一个 **p** × **p**** **的对称矩阵 （** p **是维度的数量）它涵盖了数据集中所有元组对初始值的协方差。**例如，对于一个拥有三个变量** x**，** y** **，z 和三个维度的数据集，协方差矩阵将是一个 3 × 3 的矩阵：


此处的加粗应该有问题，我现在的网暂时看不了原链接，译者如果可以打开原链接的话，可以根据英文原文调整一下加粗格式。

TrWestdoor · 2019-07-29T06:13:16Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
+由于变量与自身的协方差等于它的方差（ Cov(a,a)=Var(a) ），在主对角线（左上到右下）上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律（ Cov(a,b)=Cov(b,a) ），协方差矩阵的每一个元组关于主对角线对称，这意味着上三角部分和下三角部分是相等的。


Suggested change

由于变量与自身的协方差等于它的方差（ Cov(a,a)=Var(a) ），在主对角线（左上到右下）上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律（ Cov(a,b)=Cov(b,a) ），协方差矩阵的每一个元组关于主对角线对称，这意味着上三角部分和下三角部分是相等的。

由于变量与自身的协方差等于它的方差（ Cov(a,a)=Var(a) ），在主对角线（左上到右下）上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律（ Cov(a,b)=Cov(b,a) ），协方差矩阵的每一个元素关于主对角线对称，这意味着上三角部分和下三角部分是相等的。

TrWestdoor · 2019-07-29T07:17:05Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-**What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?**
+**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的？**


Suggested change

**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的？**

**矩阵的协方差元素告诉了我们关于变量之间的什么关系？**

TrWestdoor · 2019-07-29T07:38:34Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained above, because the eigenvectors of the Covariance matrix are actually **the** **directions of the axes where there is the most variance** (most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the **amount of variance carried in each Principal Component**.
+开门见山的说，特征矩阵和特征向量就是主成分分析背后的秘密。协方差矩阵的特征向量其实就是一系列的坐标轴，将数据映射到这些坐标轴后，我们将得到**最大的方差**（这意味这更多的信息），它们就是我们要求的主成分。特征值其实就是特征向量的系数，它代表了每个特征向量**包含了多少信息量**。


开门见山的说 ==》简单地说
此处开门见山不符合语境

TrWestdoor · 2019-07-29T07:49:38Z

@Ultrasteve @fanyijihua 校对完成

Ultrasteve · 2019-07-29T08:33:54Z

译者翻译的很好，提了一些语句调整的建议，使之更符合中文使用习惯。
PS: 译者将一些 “reduce the number of variables” 翻译为减少数据维度，虽然本意上与减少变量数目是一致的，但是考虑到上下文中 variable 一词都是译为变量，而且中文博客中野大多采用“变量数目”来形容，所以建议译者参考一下，我没有全部标记出来。

谢谢，幸苦啦

tsonglew

@Ultrasteve @leviding 提了一些建议供译者参考

tsonglew · 2019-07-29T01:23:53Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
+减少数据的维度天然会牺牲一些精度，但奇妙的是，在降维算法中，精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。


『但奇妙的是，在降维算法中，精度的损失并不大』=> 『但降维算法的诀窍是牺牲很少的精度进行简化』

tsonglew · 2019-07-29T03:08:06Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.
+更具体的说，在 PCA 之前作数据标准化的原因是，后续的结果对数据的方差十分敏感。也就是说，那些方差较大的维度会比方差更小的维度对结果造成更大的影响（例如，一个在 1 到 100 之间变化的维度对结果的影响，比一个 0 到 1 的更大），这会导致一个偏差较大的结果。所以，将数据转化到比较的范围可以预防这个问题。


『那些方差较大的维度会比方差更小的维度对结果造成更大的影响』=> 『那些取值范围较大的维度会比相对较小的维度造成更大的影响』

tsonglew · 2019-07-29T06:00:17Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
+这一步的目标是理解，数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。事实上，特征中常常包含着一些冗余信息，这使得特征之间有时候会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。


『事实上，特征中常常包含着一些冗余信息，这使得特征之间有时候会高度相关』=>『有时特征之间高度相关，因此会有一些冗余的信息』

tsonglew · 2019-07-29T06:05:48Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-The covariance matrix is a **p** × **p**** **symmetric matrix (where** p **is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables** x**,** y**, and** z**, the covariance matrix is a 3×3 matrix of this from:
+协方差矩阵是一个 **p** × **p**** **的对称矩阵 （** p **是维度的数量）它涵盖了数据集中所有元组对初始值的协方差。**例如，对于一个拥有三个变量** x**，** y** **，z 和三个维度的数据集，协方差矩阵将是一个 3 × 3 的矩阵：


这一段中掘金的文章格式和原文格式有点不一样

tsonglew · 2019-07-29T06:28:40Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-**What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?**
+**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的？**


『作为矩阵元组的协方差是怎么告诉我们变量之间的关系的？』=>『协方差矩阵中的元素告诉了我们变量间什么样的关系呢？』

tsonglew · 2019-07-29T06:41:16Z

TODO1/a-step-by-step-explanation-of-principal-component-analysis.md


-Geometrically speaking, principal components represent the directions of the data that explain a **maximal amount of variance**, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.
+从理论方面来说 ，主成分代表着蕴含**最多变量信息的方向**。对于主成分来说，变量的方差越大，空间中点就越分散，空间中的点越分散，那么它包含的信息就越多。简单的讲，主成分就是一条更好的阐述数据信息的新坐标轴，因此我们更容易从中观测到差异。


『最多变量信息』=>『最大方差』

tsonglew · 2019-07-29T10:05:36Z

@Ultrasteve @fanyijihua 校对完成

由浅入深理解主成分分析 (校对完毕)

Ultrasteve · 2019-07-30T13:41:51Z

@kasheemlew @TrWestdoor @leviding 改好了，没问题就交了

Ultrasteve · 2019-07-31T06:20:39Z

@sunui @kasheemlew @TrWestdoor
https://juejin.im/post/5d41321df265da03c926d65a
已发布，感谢各位

leviding · 2019-07-31T13:29:44Z

@Ultrasteve 已经 merge 啦~ 快快麻溜发布到掘金，然后在本 PR 下回复文章链接，方便及时添加积分哟。

掘金翻译计划有自己的知乎专栏，你也可以投稿哈，推荐使用一个好用的插件。
专栏地址：https://zhuanlan.zhihu.com/juejinfanyi

* 数据分片是如何在分布式 SQL 数据库中起作用的翻译完成，幸苦校对的同学了 * back * 由浅入深理解主成分分析由浅入深理解主成分分析 * 由浅入深理解主成分分析 (校对完毕) 由浅入深理解主成分分析 (校对完毕) * Update a-step-by-step-explanation-of-principal-component-analysis.md * Update a-step-by-step-explanation-of-principal-component-analysis.md * Update a-step-by-step-explanation-of-principal-component-analysis.md

This reverts commit 42503bf.

lsvih · 2019-10-18T07:55:22Z

已发布于 https://juejin.im/post/5d41321df265da03c926d65a

Ultrasteve added 3 commits July 26, 2019 22:15

数据分片是如何在分布式 SQL 数据库中起作用的

1be784c

翻译完成，幸苦校对的同学了

back

9d913ae

由浅入深理解主成分分析

cf5f19f

由浅入深理解主成分分析

fanyijihua added the 校对认领 label Jul 27, 2019

fanyijihua mentioned this pull request Jul 27, 2019

由浅入深理解主成分分析 #6192

Closed

leviding added the AI label Jul 27, 2019

fanyijihua added the 正在校对 label Jul 29, 2019

fanyijihua removed the 校对认领 label Jul 29, 2019

TrWestdoor reviewed Jul 29, 2019

View reviewed changes

tsonglew previously approved these changes Jul 29, 2019

View reviewed changes

leviding added enhancement 等待译者修改 and removed 正在校对 labels Jul 29, 2019

由浅入深理解主成分分析 (校对完毕)

f4c98f1

由浅入深理解主成分分析 (校对完毕)

Ultrasteve dismissed tsonglew’s stale review via f4c98f1 July 30, 2019 13:38

JackEggie added 标注待管理员 Review and removed enhancement 等待译者修改 labels Jul 31, 2019

sunui added 2 commits July 31, 2019 13:23

Update a-step-by-step-explanation-of-principal-component-analysis.md

7ae2ce0

Update a-step-by-step-explanation-of-principal-component-analysis.md

c2d4a63

sunui previously approved these changes Jul 31, 2019

View reviewed changes

sunui assigned leviding Jul 31, 2019

sunui added 翻译完成 and removed 标注待管理员 Review labels Jul 31, 2019

Update a-step-by-step-explanation-of-principal-component-analysis.md

7066a36

leviding dismissed sunui’s stale review via 7066a36 July 31, 2019 13:29

leviding approved these changes Jul 31, 2019

View reviewed changes

leviding merged commit 2fba2b5 into xitu:master Jul 31, 2019

pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019

Revert "由浅入深理解主成分分析 (xitu#6231)"

cf1f5dc

This reverts commit 42503bf.

pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019

Revert "由浅入深理解主成分分析 (xitu#6231)"

3adfa11

This reverts commit 42503bf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

由浅入深理解主成分分析 #6231

由浅入深理解主成分分析 #6231

Ultrasteve commented Jul 27, 2019

tsonglew commented Jul 29, 2019

fanyijihua commented Jul 29, 2019

TrWestdoor commented Jul 29, 2019

fanyijihua commented Jul 29, 2019

TrWestdoor left a comment

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor Jul 29, 2019

TrWestdoor commented Jul 29, 2019

Ultrasteve commented Jul 29, 2019

tsonglew left a comment

tsonglew Jul 29, 2019

tsonglew Jul 29, 2019

tsonglew Jul 29, 2019

tsonglew Jul 29, 2019

tsonglew Jul 29, 2019

tsonglew Jul 29, 2019

tsonglew commented Jul 29, 2019

Ultrasteve commented Jul 30, 2019

Ultrasteve commented Jul 31, 2019

leviding commented Jul 31, 2019

lsvih commented Oct 18, 2019


		PCA is actually a widely covered method on the web, and there are some great articles about it, but only few of them go straight to the point and explain how it works without diving too much into the technicalities and the ‘why’ of things. That’s the reason why i decided to make my own post to present it in a simplified way.
		网上已经有很多介绍 PCA 的文章，其中一些质量也很高，但很少文章会直截了当的去介绍它是怎么工作的，通常它们会过度的拘泥于 PCA 背后的技术及原理。因此，我打算以我自己的方式，简单易懂的来向各位介绍 PCA 。


		Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them.
		在解释 PCA 之前，这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的，同时我们会简化其背后的数学概念。我们会讲到标准化，协方差，特征向量和特征值，但我们不会介绍如何去计算它们。


		## So what is Principal Component Analysis ?
		## 什么是 PCA ？


		Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
		PCA（主成分分析）是一种降维方法，常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量，同时保留这些变量的大部分信息。

	PCA（主成分分析）是一种降维方法，常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量，同时保留这些变量的大部分信息。
	PCA（主成分分析）是一种降维方法，常用于对高维数据集作降维。它会将一个大的变量集合转化为更少的变量集合，同时保留大的变量集合中的大部分信息。


		Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
		减少数据的维度天然会牺牲一些精度，但奇妙的是，在降维算法中，精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。

	减少数据的维度天然会牺牲一些精度，但奇妙的是，在降维算法中，精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。
	减少数据集的变量数一般会牺牲一些精度，但奇妙的是，在降维算法中，通过牺牲少量精度来让事情变得简单化。这是因为维度更小的数据能更容易被探索和可视化，在数据的分析和机器学习算法中，我们将不用去处理额外的变量，这让整个过程变得高效。


		The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
		这一步的目标是理解，数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。事实上，特征中常常包含着一些冗余信息，这使得特征之间有时候会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。

	这一步的目标是理解，数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。事实上，特征中常常包含着一些冗余信息，这使得特征之间有时候会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。
	这一步的目标是理解数据集中的变量是如何从平均值变化过来的，不同的特征之间又有什么关系。换句话说，我们想要看看特征之间是否存在某种联系。因为有时候，特征中常常包含着一些冗余信息，这使得特征之间会高度相关。为了了解这一层关系，我们需要计算协方差矩阵。


		The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:
		协方差矩阵是一个 p × p 的对称矩阵（ p 是维度的数量）它涵盖了数据集中所有元组对初始值的协方差。例如，对于一个拥有三个变量 x， y ，z 和三个维度的数据集，协方差矩阵将是一个 3 × 3 的矩阵：


		Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
		由于变量与自身的协方差等于它的方差（ Cov(a,a)=Var(a) ），在主对角线（左上到右下）上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律（ Cov(a,b)=Cov(b,a) ），协方差矩阵的每一个元组关于主对角线对称，这意味着上三角部分和下三角部分是相等的。


		What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?
		作为矩阵元组的协方差是怎么告诉我们变量之间的关系的？

	作为矩阵元组的协方差是怎么告诉我们变量之间的关系的？
	矩阵的协方差元素告诉了我们关于变量之间的什么关系？


		Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained above, because the eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most variance (most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.
		开门见山的说，特征矩阵和特征向量就是主成分分析背后的秘密。协方差矩阵的特征向量其实就是一系列的坐标轴，将数据映射到这些坐标轴后，我们将得到最大的方差（这意味这更多的信息），它们就是我们要求的主成分。特征值其实就是特征向量的系数，它代表了每个特征向量包含了多少信息量。


		More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.
		更具体的说，在 PCA 之前作数据标准化的原因是，后续的结果对数据的方差十分敏感。也就是说，那些方差较大的维度会比方差更小的维度对结果造成更大的影响（例如，一个在 1 到 100 之间变化的维度对结果的影响，比一个 0 到 1 的更大），这会导致一个偏差较大的结果。所以，将数据转化到比较的范围可以预防这个问题。


		Geometrically speaking, principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.
		从理论方面来说，主成分代表着蕴含最多变量信息的方向。对于主成分来说，变量的方差越大，空间中点就越分散，空间中的点越分散，那么它包含的信息就越多。简单的讲，主成分就是一条更好的阐述数据信息的新坐标轴，因此我们更容易从中观测到差异。

由浅入深理解主成分分析 #6231

由浅入深理解主成分分析 #6231

Conversation

Ultrasteve commented Jul 27, 2019

tsonglew commented Jul 29, 2019

fanyijihua commented Jul 29, 2019

TrWestdoor commented Jul 29, 2019

fanyijihua commented Jul 29, 2019

TrWestdoor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TrWestdoor commented Jul 29, 2019

Ultrasteve commented Jul 29, 2019

tsonglew left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsonglew commented Jul 29, 2019

Ultrasteve commented Jul 30, 2019

Ultrasteve commented Jul 31, 2019

leviding commented Jul 31, 2019

lsvih commented Oct 18, 2019