Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

什么时候需要进行数据的标准化? 为什么? #6214

Merged

Conversation

Ultrasteve
Copy link
Contributor

译文翻译完成,resolve #6193

什么时候需要进行数据的标准化? 为什么?
when-to-standardize-your-data.md
什么时候需要进行数据的标准化? 为什么?

### Z-score

Z-score is one of the most popular methods to standardize data, and can be done by subtracting the mean and dividing by the standard deviation for each value of each feature.
`Z-score`是最受欢迎的数据标准化方法之一,在这种方法中,我们对每一项数据减去它的平均值并除以它的方差。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`Z-score`是最受欢迎的数据标准化方法之一,在这种方法中,我们对每一项数据减去它的平均值并除以它的方差。
`Z-score` 是最受欢迎的数据标准化方法之一,在这种方法中,我们对每一项数据减去它的平均值并除以它的方差。


**1- Before PCA:**
**1- 主成分分析:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**1- 主成分分析:**
**1- 主成分分析**

Copy link
Member

@leviding leviding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

类似的问题,全文检查下哈


### References:
### 参考文献:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### 参考文献:
### 参考文献

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@leviding leviding added the AI label Jul 25, 2019
@TBLGSn
Copy link

TBLGSn commented Jul 27, 2019

@leviding 校对认领

@fanyijihua
Copy link
Collaborator

@TBLGSn 好的呢 🍺

Copy link

@TBLGSn TBLGSn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 .部分专业术语错误
2. 部分标点符号或者格式错误(数字前后应当使用空格)
3. 部分地方用词不当。如:多次出现的”防止“
4.部分语句表达没有达到“口语化”
5.部分翻译为严格按照原文
例外:

  1. 文章字体从64行开始便为斜体了.
  2. 22 行引申出的“预测结果”,52 行第一句,56 行引申出来的“测量结果”,60 行引申出来的“计算结果”
    值得一起商榷.
  3. 60,66,70 行,fitting一词是否使用专业术语“拟合”代替”使用“?


Some ML developers tend to standardize their data blindly before “every” Machine Learning model without taking the effort to understand why it must be used, or even if it’s needed or not. So the goal of this post is to explain how, why and when to standardize data.
一些机器学习工程师通常在使用所有机器学习模型之前,会盲目地对他们的数据进行标准化,然而,其实他们并不清楚数据标准化的理由,更不知道什么情况下使用这一技术是必要的,什么时候不是。因此,在这篇文章中,我们会阐述数据标准化的原因,及什么时候需要这样做,做法是什么?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 会盲目地对他们的数据进行标准化 => 倾向于盲目地对他们的数据进行标准化
  2. 在这篇文章中,我们会阐述数据标准化的原因,及什么时候需要这样做,做法是什么? => 因此,这篇文章的目标是解释如何,为什么以及何时标准化数据。


Standardization comes into picture when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units (e.g., Pounds, Meters, Miles … etc).
当输入数据的变化范围很大,或者它们各自使用的单位不同时(比如说一些用米,一些用厘米),我们会想到对数据进行标准化。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当输入数据的变化范围很大 => 当输入数据集的特征在它们的范围之间具有大差异时


These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.
像这种初始特征变化范围较大的数据,会在我们使用许多机器学习模型时造成麻烦。例如,有一个模型是基于距离的,当其中一个特征值变化范围较大时,那么预测结果很大程度上就会受到它的影响。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议 : 像这种初始特征变化范围较大的数据,会在我们使用许多机器学习模型时造成麻烦。
=> 这些初始特征范围的差异,会给许多机器学习模型带来不必要的麻烦


These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.
像这种初始特征变化范围较大的数据,会在我们使用许多机器学习模型时造成麻烦。例如,有一个模型是基于距离的,当其中一个特征值变化范围较大时,那么预测结果很大程度上就会受到它的影响。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有一个模型是基于距离的 => 对于基于距离计算的模型来说


To illustrate this with an example: say we have a 2-dimensional data set with two features, Height in Meters and Weight in Pounds, that range respectively from [1 to 2] Meters and [10 to 200] Pounds. No matter what distance based model you perform on this data set, the Weight feature will dominate over the Height feature and will have more contribution to the distance computation, just because it has bigger values compared to the Height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.
我们这里举一个例子。现在我们有一个二维的数据集,它有两个特征,以米为单位的高度(范围是1到2米)和以磅为单位的重量(范围是10到200磅)。不论你使用什么基于距离的模型,重量特征对结果的影响都会大大的高于高度特征,因为它的数据变化范围相对更大。因此,为了防止这种问题发生,我们会在这里用到数据标准化来约束重量特征的数据变化范围。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不论你使用什么基于距离的模型 => 无论你在这个数据集上使用什么基于距离的模型


To illustrate this with an example: say we have a 2-dimensional data set with two features, Height in Meters and Weight in Pounds, that range respectively from [1 to 2] Meters and [10 to 200] Pounds. No matter what distance based model you perform on this data set, the Weight feature will dominate over the Height feature and will have more contribution to the distance computation, just because it has bigger values compared to the Height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.
我们这里举一个例子。现在我们有一个二维的数据集,它有两个特征,以米为单位的高度(范围是1到2米)和以磅为单位的重量(范围是10到200磅)。不论你使用什么基于距离的模型,重量特征对结果的影响都会大大的高于高度特征,因为它的数据变化范围相对更大。因此,为了防止这种问题发生,我们会在这里用到数据标准化来约束重量特征的数据变化范围。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数字前后应该有一个空格


![](https://cdn-images-1.medium.com/max/NaN/0*AgmY9auxftS9BI73.png)

Once the standardization is done, all the features will have a mean of zero, a standard deviation of one, and thus, the same scale.
一旦完成了数据标准化, 所有特征对应的数据平均值变为0, 方差变为1, 因此, 所有特征的数据变化范围现在是一致的.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

标点符号应用全角符号


You can measure variable importance in regression analysis, by fitting a regression model using the **standardized** independent variables and comparing the absolute value of their standardized coefficients. But, if the independent variables are not standardized, comparing their coefficients becomes meaningless.
你可以在回归分析中测量变量的重要程度。首先使用标准化过后的独立变量来训练模型,然后计算它们对应的标准化系数的绝对值差就能得出结论。然而,如果独立变量是未经标准化的,那比较它们的系数将毫无意义。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standardized 应 加粗


As seen above, for distance based models, standardization is performed to prevent features with wider ranges from dominating the distance metric. But the reason we standardize data is not the same for all machine learning models, and differs from one model to another.
就如上面所说,在基于距离的模型中,数据标准化用于防止大范围的特征对预测结果进行较大的影响。不过数据标准化的理由不仅仅只有这一个,对于不同的模型会有不同的理由。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议 :不过数据标准化的理由不仅仅只有这一个,对于不同的模型会有不同的理由。
=> 不过使用标准化的原因不仅仅只有这一个,对于不同的模型会有不同的原因。


So before which ML models and methods you have to standardize your data and why?
那么,在使用什么机器学习算法和模型之前,我们需要进行数据标准化呢?原因又是什么?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议遵照原文:
算法 => 方法

@TBLGSn
Copy link

TBLGSn commented Jul 27, 2019

@Ultrasteve @leviding OK了

@Ultrasteve
Copy link
Contributor Author

@Ultrasteve @leviding OK了

幸苦啦,等有两个人校对完了我再修改

@leviding leviding added the enhancement 等待译者修改 label Jul 28, 2019
什么时候需要进行数据的标准化? 为什么? (校对完成)
@Ultrasteve
Copy link
Contributor Author

@TBLGSn @leviding 我改好了,没问题就交了

@leviding leviding added 标注 待管理员 Review and removed enhancement 等待译者修改 校对认领 正在校对 labels Jul 31, 2019
@leviding leviding merged commit 012e8c9 into xitu:master Jul 31, 2019
@leviding
Copy link
Member

@Ultrasteve 已经 merge 啦~ 快快麻溜发布到 掘金,然后在本 PR 下回复文章链接,方便及时添加积分哟。

掘金翻译计划有自己的知乎专栏,你也可以投稿哈,推荐使用一个好用的插件
专栏地址:https://zhuanlan.zhihu.com/juejinfanyi

@leviding leviding added 翻译完成 and removed 标注 待管理员 Review labels Jul 31, 2019
@Ultrasteve
Copy link
Contributor Author

文章已发布
https://juejin.im/post/5d41a46bf265da03d727f85d

pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019
* 什么时候需要进行数据的标准化? 为什么?

什么时候需要进行数据的标准化? 为什么?

* when-to-standardize-your-data.md

when-to-standardize-your-data.md

* 什么时候需要进行数据的标准化? 为什么?

什么时候需要进行数据的标准化? 为什么?

* 什么时候需要进行数据的标准化? 为什么? (校对完成)

什么时候需要进行数据的标准化? 为什么? (校对完成)

* Update when-to-standardize-your-data.md
pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019
pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

什么时候需要进行数据的标准化? 为什么?
4 participants