Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

由浅入深理解主成分分析 #6231

Conversation

Ultrasteve
Copy link
Contributor

译文翻译完成,resolve #6192

@tsonglew
Copy link
Contributor

校对认领 @leviding

@fanyijihua
Copy link
Collaborator

@kasheemlew 好的呢 🍺

@TrWestdoor
Copy link
Contributor

@leviding 校对认领

@fanyijihua
Copy link
Collaborator

@TrWestdoor 妥妥哒 🍻

Copy link
Contributor

@TrWestdoor TrWestdoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

译者翻译的很好,提了一些语句调整的建议,使之更符合中文使用习惯。
PS: 译者将一些 “reduce the number of variables” 翻译为减少数据维度,虽然本意上与减少变量数目是一致的,但是考虑到上下文中 variable 一词都是译为变量,而且中文博客中野大多采用“变量数目”来形容,所以建议译者参考一下,我没有全部标记出来。


PCA is actually a widely covered method on the web, and there are some great articles about it, but only few of them go straight to the point and explain how it works without diving too much into the technicalities and the ‘why’ of things. That’s the reason why i decided to make my own post to present it in a simplified way.
网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,简单易懂的来向各位介绍 PCA 。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,简单易懂的来向各位介绍 PCA 。
网上已经有很多介绍 PCA 的文章,其中一些质量也很高,但很少文章会直截了当的去介绍它是怎么工作的,通常它们会过度的拘泥于 PCA 背后的技术及原理。因此,我打算以我自己的方式,来向各位简单易懂的介绍 PCA 。


Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, as standardization, covariance, eigenvectors and eigenvalues without focusing on how to compute them.
在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会介绍如何去计算它们。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会介绍如何去计算它们
在解释 PCA 之前,这篇文章会先富有逻辑性的介绍 PCA 在每一步是做什么的,同时我们会简化其背后的数学概念。我们会讲到标准化,协方差,特征向量和特征值,但我们不会专注于如何计算它们


## So what is Principal Component Analysis ?
## 什么是 PCA ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## 什么是 PCA
## 什么是主成分分析


Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
PCA(主成分分析)是一种降维方法,常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量,同时保留这些变量的大部分信息。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PCA(主成分分析)是一种降维方法,常用于对那些维度很高的数据集作降维。它会将一个大数据集中的变量转化为维度更小的变量,同时保留这些变量的大部分信息
PCA(主成分分析)是一种降维方法,常用于对高维数据集作降维。它会将一个大的变量集合转化为更少的变量集合,同时保留大的变量集合中的大部分信息


Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
减少数据的维度天然会牺牲一些精度,但奇妙的是,在降维算法中,精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
减少数据的维度天然会牺牲一些精度,但奇妙的是,在降维算法中,精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。
减少数据集的变量数一般会牺牲一些精度,但奇妙的是,在降维算法中,通过牺牲少量精度来让事情变得简单化。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。


The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
这一步的目标是理解,数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。事实上,特征中常常包含着一些冗余信息,这使得特征之间有时候会高度相关。为了了解这一层关系,我们需要计算协方差矩阵。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
这一步的目标是理解,数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。事实上,特征中常常包含着一些冗余信息,这使得特征之间有时候会高度相关。为了了解这一层关系,我们需要计算协方差矩阵。
这一步的目标是理解数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。因为有时候,特征中常常包含着一些冗余信息,这使得特征之间会高度相关。为了了解这一层关系,我们需要计算协方差矩阵。


The covariance matrix is a **p** × **p**** **symmetric matrix (where** p **is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables** x**,** y**, and** z**, the covariance matrix is a 3×3 matrix of this from:
协方差矩阵是一个 **p** × **p**** **的对称矩阵 (** p **是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。**例如,对于一个拥有三个变量** x**** y** **,z 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处的加粗应该有问题,我现在的网暂时看不了原链接,译者如果可以打开原链接的话,可以根据英文原文调整一下加粗格式。


Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
由于变量与自身的协方差等于它的方差( Cov(a,a)=Var(a) ),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律( Cov(a,b)=Cov(b,a) ),协方差矩阵的每一个元组关于主对角线对称,这意味着上三角部分和下三角部分是相等的。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
由于变量与自身的协方差等于它的方差( Cov(a,a)=Var(a) ),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律( Cov(a,b)=Cov(b,a) ),协方差矩阵的每一个元组关于主对角线对称,这意味着上三角部分和下三角部分是相等的。
由于变量与自身的协方差等于它的方差( Cov(a,a)=Var(a) ),在主对角线(左上到右下)上我们已经计算出各个变量初始值的方差。又因为协方差满足交换律( Cov(a,b)=Cov(b,a) ),协方差矩阵的每一个元素关于主对角线对称,这意味着上三角部分和下三角部分是相等的。


**What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?**
**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的?**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的**
**矩阵的协方差元素告诉了我们关于变量之间的什么关系**


Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained above, because the eigenvectors of the Covariance matrix are actually **the** **directions of the axes where there is the most variance** (most information) and that we call Principal Components. And eigenvalues are simply the coefficients attached to eigenvectors, which give the **amount of variance carried in each Principal Component**.
开门见山的说,特征矩阵和特征向量就是主成分分析背后的秘密。协方差矩阵的特征向量其实就是一系列的坐标轴,将数据映射到这些坐标轴后,我们将得到**最大的方差**(这意味这更多的信息),它们就是我们要求的主成分。特征值其实就是特征向量的系数,它代表了每个特征向量**包含了多少信息量**。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

开门见山的说 ==》简单地说
此处开门见山不符合语境

@TrWestdoor
Copy link
Contributor

@Ultrasteve @fanyijihua 校对完成

@Ultrasteve
Copy link
Contributor Author

译者翻译的很好,提了一些语句调整的建议,使之更符合中文使用习惯。
PS: 译者将一些 “reduce the number of variables” 翻译为减少数据维度,虽然本意上与减少变量数目是一致的,但是考虑到上下文中 variable 一词都是译为变量,而且中文博客中野大多采用“变量数目”来形容,所以建议译者参考一下,我没有全部标记出来。

谢谢,幸苦啦

tsonglew
tsonglew previously approved these changes Jul 29, 2019
Copy link
Contributor

@tsonglew tsonglew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ultrasteve @leviding 提了一些建议供译者参考


Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
减少数据的维度天然会牺牲一些精度,但奇妙的是,在降维算法中,精度的损失并不大。这是因为维度更小的数据能更容易被探索和可视化,在数据的分析和机器学习算法中,我们将不用去处理额外的变量,这让整个过程变得高效。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『但奇妙的是,在降维算法中,精度的损失并不大』=> 『但降维算法的诀窍是牺牲很少的精度进行简化』


More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.
更具体的说,在 PCA 之前作数据标准化的原因是,后续的结果对数据的方差十分敏感。也就是说,那些方差较大的维度会比方差更小的维度对结果造成更大的影响(例如,一个在 1 到 100 之间变化的维度对结果的影响,比一个 0 到 1 的更大),这会导致一个偏差较大的结果。所以,将数据转化到比较的范围可以预防这个问题。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『那些方差较大的维度会比方差更小的维度对结果造成更大的影响』=> 『那些取值范围较大的维度会比相对较小的维度造成更大的影响』


The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
这一步的目标是理解,数据集中的变量是如何从平均值变化过来的,不同的特征之间又有什么关系。换句话说,我们想要看看特征之间是否存在某种联系。事实上,特征中常常包含着一些冗余信息,这使得特征之间有时候会高度相关。为了了解这一层关系,我们需要计算协方差矩阵。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『事实上,特征中常常包含着一些冗余信息,这使得特征之间有时候会高度相关』=>『有时特征之间高度相关,因此会有一些冗余的信息』


The covariance matrix is a **p** × **p**** **symmetric matrix (where** p **is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables** x**,** y**, and** z**, the covariance matrix is a 3×3 matrix of this from:
协方差矩阵是一个 **p** × **p**** **的对称矩阵 (** p **是维度的数量)它涵盖了数据集中所有元组对初始值的协方差。**例如,对于一个拥有三个变量** x**** y** **,z 和三个维度的数据集,协方差矩阵将是一个 3 × 3 的矩阵:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一段中掘金的文章格式和原文格式有点不一样


**What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?**
**作为矩阵元组的协方差是怎么告诉我们变量之间的关系的?**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『作为矩阵元组的协方差是怎么告诉我们变量之间的关系的?』=>『协方差矩阵中的元素告诉了我们变量间什么样的关系呢?』


Geometrically speaking, principal components represent the directions of the data that explain a **maximal amount of variance**, that is to say, the lines that capture most information of the data. The relationship between variance and information here, is that, the larger the variance carried by a line, the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more the information it has. To put all this simply, just think of principal components as new axes that provide the best angle to see and evaluate the data, so that the differences between the observations are better visible.
从理论方面来说 ,主成分代表着蕴含**最多变量信息的方向**。对于主成分来说,变量的方差越大,空间中点就越分散,空间中的点越分散,那么它包含的信息就越多。简单的讲,主成分就是一条更好的阐述数据信息的新坐标轴,因此我们更容易从中观测到差异。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

『最多变量信息』=>『最大方差』

@tsonglew
Copy link
Contributor

@Ultrasteve @fanyijihua 校对完成

@leviding leviding added enhancement 等待译者修改 and removed 正在校对 labels Jul 29, 2019
由浅入深理解主成分分析 (校对完毕)
@Ultrasteve
Copy link
Contributor Author

@kasheemlew @TrWestdoor @leviding 改好了,没问题就交了

@JackEggie JackEggie added 标注 待管理员 Review and removed enhancement 等待译者修改 labels Jul 31, 2019
sunui
sunui previously approved these changes Jul 31, 2019
@sunui sunui added 翻译完成 and removed 标注 待管理员 Review labels Jul 31, 2019
@Ultrasteve
Copy link
Contributor Author

@sunui @kasheemlew @TrWestdoor
https://juejin.im/post/5d41321df265da03c926d65a
已发布,感谢各位

@leviding leviding merged commit 2fba2b5 into xitu:master Jul 31, 2019
@leviding
Copy link
Member

@Ultrasteve 已经 merge 啦~ 快快麻溜发布到 掘金,然后在本 PR 下回复文章链接,方便及时添加积分哟。

掘金翻译计划有自己的知乎专栏,你也可以投稿哈,推荐使用一个好用的插件
专栏地址:https://zhuanlan.zhihu.com/juejinfanyi

pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019
* 数据分片是如何在分布式 SQL 数据库中起作用的

翻译完成,幸苦校对的同学了

* back

* 由浅入深理解主成分分析

由浅入深理解主成分分析

* 由浅入深理解主成分分析 (校对完毕)

由浅入深理解主成分分析 (校对完毕)

* Update a-step-by-step-explanation-of-principal-component-analysis.md

* Update a-step-by-step-explanation-of-principal-component-analysis.md

* Update a-step-by-step-explanation-of-principal-component-analysis.md
pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019
pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019
@lsvih
Copy link
Member

lsvih commented Oct 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

由浅入深理解主成分分析
8 participants