什么时候需要进行数据的标准化? 为什么？ #6214

Ultrasteve · 2019-07-25T03:00:43Z

译文翻译完成，resolve #6193

什么时候需要进行数据的标准化? 为什么？

when-to-standardize-your-data.md

什么时候需要进行数据的标准化? 为什么？

leviding · 2019-07-25T03:02:01Z

TODO1/when-to-standardize-your-data.md


 ### Z-score

-Z-score is one of the most popular methods to standardize data, and can be done by subtracting the mean and dividing by the standard deviation for each value of each feature.
+`Z-score`是最受欢迎的数据标准化方法之一，在这种方法中，我们对每一项数据减去它的平均值并除以它的方差。


Suggested change

`Z-score`是最受欢迎的数据标准化方法之一，在这种方法中，我们对每一项数据减去它的平均值并除以它的方差。

`Z-score` 是最受欢迎的数据标准化方法之一，在这种方法中，我们对每一项数据减去它的平均值并除以它的方差。

leviding · 2019-07-25T03:02:22Z

TODO1/when-to-standardize-your-data.md


-**1- Before PCA:**
+**1- 主成分分析:**


Suggested change

**1- 主成分分析:**

**1- 主成分分析：**

leviding

类似的问题，全文检查下哈

leviding · 2019-07-25T03:02:39Z

TODO1/when-to-standardize-your-data.md


-### References:
+### 参考文献:


Suggested change

### 参考文献:

### 参考文献：

TBLGSn · 2019-07-27T06:58:07Z

@leviding 校对认领

fanyijihua · 2019-07-27T06:58:09Z

@TBLGSn 好的呢 🍺

TBLGSn

1 .部分专业术语错误
2. 部分标点符号或者格式错误(数字前后应当使用空格)
3. 部分地方用词不当。如：多次出现的”防止“
4.部分语句表达没有达到“口语化”
5.部分翻译为严格按照原文
例外：

文章字体从64行开始便为斜体了.
22 行引申出的“预测结果”，52 行第一句，56 行引申出来的“测量结果”，60 行引申出来的“计算结果”
值得一起商榷.
60，66，70 行，fitting一词是否使用专业术语“拟合”代替”使用“？

TBLGSn · 2019-07-27T07:20:55Z

TODO1/when-to-standardize-your-data.md


-Some ML developers tend to standardize their data blindly before “every” Machine Learning model without taking the effort to understand why it must be used, or even if it’s needed or not. So the goal of this post is to explain how, why and when to standardize data.
+一些机器学习工程师通常在使用所有机器学习模型之前，会盲目地对他们的数据进行标准化，然而，其实他们并不清楚数据标准化的理由，更不知道什么情况下使用这一技术是必要的，什么时候不是。因此，在这篇文章中，我们会阐述数据标准化的原因，及什么时候需要这样做，做法是什么？


会盲目地对他们的数据进行标准化 => 倾向于盲目地对他们的数据进行标准化

在这篇文章中，我们会阐述数据标准化的原因，及什么时候需要这样做，做法是什么？ => 因此，这篇文章的目标是解释如何，为什么以及何时标准化数据。

TBLGSn · 2019-07-27T07:34:50Z

TODO1/when-to-standardize-your-data.md


-Standardization comes into picture when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units (e.g., Pounds, Meters, Miles … etc).
+当输入数据的变化范围很大，或者它们各自使用的单位不同时（比如说一些用米，一些用厘米），我们会想到对数据进行标准化。


当输入数据的变化范围很大 => 当输入数据集的特征在它们的范围之间具有大差异时

TBLGSn · 2019-07-27T07:36:45Z

TODO1/when-to-standardize-your-data.md


-These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.
+像这种初始特征变化范围较大的数据，会在我们使用许多机器学习模型时造成麻烦。例如，有一个模型是基于距离的，当其中一个特征值变化范围较大时，那么预测结果很大程度上就会受到它的影响。


建议：像这种初始特征变化范围较大的数据，会在我们使用许多机器学习模型时造成麻烦。
=> 这些初始特征范围的差异，会给许多机器学习模型带来不必要的麻烦

TBLGSn · 2019-07-27T07:42:13Z

TODO1/when-to-standardize-your-data.md


-These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.
+像这种初始特征变化范围较大的数据，会在我们使用许多机器学习模型时造成麻烦。例如，有一个模型是基于距离的，当其中一个特征值变化范围较大时，那么预测结果很大程度上就会受到它的影响。


有一个模型是基于距离的 => 对于基于距离计算的模型来说

TBLGSn · 2019-07-27T07:46:07Z

TODO1/when-to-standardize-your-data.md


-To illustrate this with an example: say we have a 2-dimensional data set with two features, Height in Meters and Weight in Pounds, that range respectively from [1 to 2] Meters and [10 to 200] Pounds. No matter what distance based model you perform on this data set, the Weight feature will dominate over the Height feature and will have more contribution to the distance computation, just because it has bigger values compared to the Height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.
+我们这里举一个例子。现在我们有一个二维的数据集，它有两个特征，以米为单位的高度（范围是1到2米）和以磅为单位的重量（范围是10到200磅）。不论你使用什么基于距离的模型，重量特征对结果的影响都会大大的高于高度特征，因为它的数据变化范围相对更大。因此，为了防止这种问题发生，我们会在这里用到数据标准化来约束重量特征的数据变化范围。


不论你使用什么基于距离的模型 => 无论你在这个数据集上使用什么基于距离的模型

TBLGSn · 2019-07-27T09:09:12Z

TODO1/when-to-standardize-your-data.md


-To illustrate this with an example: say we have a 2-dimensional data set with two features, Height in Meters and Weight in Pounds, that range respectively from [1 to 2] Meters and [10 to 200] Pounds. No matter what distance based model you perform on this data set, the Weight feature will dominate over the Height feature and will have more contribution to the distance computation, just because it has bigger values compared to the Height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.
+我们这里举一个例子。现在我们有一个二维的数据集，它有两个特征，以米为单位的高度（范围是1到2米）和以磅为单位的重量（范围是10到200磅）。不论你使用什么基于距离的模型，重量特征对结果的影响都会大大的高于高度特征，因为它的数据变化范围相对更大。因此，为了防止这种问题发生，我们会在这里用到数据标准化来约束重量特征的数据变化范围。


数字前后应该有一个空格

TBLGSn · 2019-07-27T09:14:47Z

TODO1/when-to-standardize-your-data.md


 ![](https://cdn-images-1.medium.com/max/NaN/0*AgmY9auxftS9BI73.png)

-Once the standardization is done, all the features will have a mean of zero, a standard deviation of one, and thus, the same scale.
+一旦完成了数据标准化, 所有特征对应的数据平均值变为0, 方差变为1, 因此, 所有特征的数据变化范围现在是一致的.


标点符号应用全角符号

TBLGSn · 2019-07-27T09:19:20Z

TODO1/when-to-standardize-your-data.md


-You can measure variable importance in regression analysis, by fitting a regression model using the **standardized** independent variables and comparing the absolute value of their standardized coefficients. But, if the independent variables are not standardized, comparing their coefficients becomes meaningless.
+你可以在回归分析中测量变量的重要程度。首先使用标准化过后的独立变量来训练模型，然后计算它们对应的标准化系数的绝对值差就能得出结论。然而，如果独立变量是未经标准化的，那比较它们的系数将毫无意义。


standardized 应加粗

TBLGSn · 2019-07-27T09:35:06Z

TODO1/when-to-standardize-your-data.md


-As seen above, for distance based models, standardization is performed to prevent features with wider ranges from dominating the distance metric. But the reason we standardize data is not the same for all machine learning models, and differs from one model to another.
+就如上面所说，在基于距离的模型中，数据标准化用于防止大范围的特征对预测结果进行较大的影响。不过数据标准化的理由不仅仅只有这一个，对于不同的模型会有不同的理由。


建议：不过数据标准化的理由不仅仅只有这一个，对于不同的模型会有不同的理由。
=> 不过使用标准化的原因不仅仅只有这一个，对于不同的模型会有不同的原因。

TBLGSn · 2019-07-27T09:45:44Z

TODO1/when-to-standardize-your-data.md


-So before which ML models and methods you have to standardize your data and why?
+那么，在使用什么机器学习算法和模型之前，我们需要进行数据标准化呢？原因又是什么？


建议遵照原文：
算法 => 方法

TBLGSn · 2019-07-27T09:47:19Z

@Ultrasteve @leviding OK了

Ultrasteve · 2019-07-27T10:11:44Z

@Ultrasteve @leviding OK了

幸苦啦，等有两个人校对完了我再修改

什么时候需要进行数据的标准化? 为什么？ (校对完成)

Ultrasteve · 2019-07-30T14:07:15Z

@TBLGSn @leviding 我改好了，没问题就交了

leviding · 2019-07-31T14:08:52Z

@Ultrasteve 已经 merge 啦~ 快快麻溜发布到掘金，然后在本 PR 下回复文章链接，方便及时添加积分哟。

掘金翻译计划有自己的知乎专栏，你也可以投稿哈，推荐使用一个好用的插件。
专栏地址：https://zhuanlan.zhihu.com/juejinfanyi

Ultrasteve · 2019-07-31T14:26:35Z

文章已发布
https://juejin.im/post/5d41a46bf265da03d727f85d

* 什么时候需要进行数据的标准化? 为什么？什么时候需要进行数据的标准化? 为什么？ * when-to-standardize-your-data.md when-to-standardize-your-data.md * 什么时候需要进行数据的标准化? 为什么？什么时候需要进行数据的标准化? 为什么？ * 什么时候需要进行数据的标准化? 为什么？ (校对完成) 什么时候需要进行数据的标准化? 为什么？ (校对完成) * Update when-to-standardize-your-data.md

This reverts commit 941f4d2.

Ultrasteve added 3 commits July 25, 2019 10:49

什么时候需要进行数据的标准化? 为什么？

b24d7cf

什么时候需要进行数据的标准化? 为什么？

when-to-standardize-your-data.md

5b122fe

when-to-standardize-your-data.md

什么时候需要进行数据的标准化? 为什么？

900a0a0

什么时候需要进行数据的标准化? 为什么？

fanyijihua added the 校对认领 label Jul 25, 2019

fanyijihua mentioned this pull request Jul 25, 2019

什么时候需要进行数据的标准化? 为什么？ #6193

Closed

leviding reviewed Jul 25, 2019

View reviewed changes

leviding added the AI label Jul 25, 2019

fanyijihua added the 正在校对 label Jul 27, 2019

TBLGSn suggested changes Jul 27, 2019

View reviewed changes

leviding added the enhancement 等待译者修改 label Jul 28, 2019

什么时候需要进行数据的标准化? 为什么？ (校对完成)

6bd8b71

什么时候需要进行数据的标准化? 为什么？ (校对完成)

leviding added 标注待管理员 Review and removed enhancement 等待译者修改校对认领正在校对 labels Jul 31, 2019

Update when-to-standardize-your-data.md

5673000

leviding approved these changes Jul 31, 2019

View reviewed changes

leviding merged commit 012e8c9 into xitu:master Jul 31, 2019

leviding added 翻译完成 and removed 标注待管理员 Review labels Jul 31, 2019

pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019

Revert "什么时候需要进行数据的标准化? 为什么？ (xitu#6214)"

93e3f24

This reverts commit 941f4d2.

pingren pushed a commit to pingren/gold-miner that referenced this pull request Jul 31, 2019

Revert "什么时候需要进行数据的标准化? 为什么？ (xitu#6214)"

28becd6

This reverts commit 941f4d2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

什么时候需要进行数据的标准化? 为什么？ #6214

什么时候需要进行数据的标准化? 为什么？ #6214

Ultrasteve commented Jul 25, 2019

leviding Jul 25, 2019

leviding Jul 25, 2019

leviding left a comment

leviding Jul 25, 2019

Ultrasteve Jul 25, 2019

TBLGSn commented Jul 27, 2019

fanyijihua commented Jul 27, 2019

TBLGSn left a comment

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn Jul 27, 2019

TBLGSn commented Jul 27, 2019

Ultrasteve commented Jul 27, 2019

Ultrasteve commented Jul 30, 2019

leviding commented Jul 31, 2019

Ultrasteve commented Jul 31, 2019

	`Z-score`是最受欢迎的数据标准化方法之一，在这种方法中，我们对每一项数据减去它的平均值并除以它的方差。
	`Z-score` 是最受欢迎的数据标准化方法之一，在这种方法中，我们对每一项数据减去它的平均值并除以它的方差。


		Some ML developers tend to standardize their data blindly before “every” Machine Learning model without taking the effort to understand why it must be used, or even if it’s needed or not. So the goal of this post is to explain how, why and when to standardize data.
		一些机器学习工程师通常在使用所有机器学习模型之前，会盲目地对他们的数据进行标准化，然而，其实他们并不清楚数据标准化的理由，更不知道什么情况下使用这一技术是必要的，什么时候不是。因此，在这篇文章中，我们会阐述数据标准化的原因，及什么时候需要这样做，做法是什么？


		Standardization comes into picture when features of input data set have large differences between their ranges, or simply when they are measured in different measurement units (e.g., Pounds, Meters, Miles … etc).
		当输入数据的变化范围很大，或者它们各自使用的单位不同时（比如说一些用米，一些用厘米），我们会想到对数据进行标准化。


		These differences in the ranges of initial features causes trouble to many machine learning models. For example, for the models that are based on distance computation, if one of the features has a broad range of values, the distance will be governed by this particular feature.
		像这种初始特征变化范围较大的数据，会在我们使用许多机器学习模型时造成麻烦。例如，有一个模型是基于距离的，当其中一个特征值变化范围较大时，那么预测结果很大程度上就会受到它的影响。


		To illustrate this with an example: say we have a 2-dimensional data set with two features, Height in Meters and Weight in Pounds, that range respectively from [1 to 2] Meters and [10 to 200] Pounds. No matter what distance based model you perform on this data set, the Weight feature will dominate over the Height feature and will have more contribution to the distance computation, just because it has bigger values compared to the Height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution.
		我们这里举一个例子。现在我们有一个二维的数据集，它有两个特征，以米为单位的高度（范围是1到2米）和以磅为单位的重量（范围是10到200磅）。不论你使用什么基于距离的模型，重量特征对结果的影响都会大大的高于高度特征，因为它的数据变化范围相对更大。因此，为了防止这种问题发生，我们会在这里用到数据标准化来约束重量特征的数据变化范围。


		You can measure variable importance in regression analysis, by fitting a regression model using the standardized independent variables and comparing the absolute value of their standardized coefficients. But, if the independent variables are not standardized, comparing their coefficients becomes meaningless.
		你可以在回归分析中测量变量的重要程度。首先使用标准化过后的独立变量来训练模型，然后计算它们对应的标准化系数的绝对值差就能得出结论。然而，如果独立变量是未经标准化的，那比较它们的系数将毫无意义。


		As seen above, for distance based models, standardization is performed to prevent features with wider ranges from dominating the distance metric. But the reason we standardize data is not the same for all machine learning models, and differs from one model to another.
		就如上面所说，在基于距离的模型中，数据标准化用于防止大范围的特征对预测结果进行较大的影响。不过数据标准化的理由不仅仅只有这一个，对于不同的模型会有不同的理由。


		So before which ML models and methods you have to standardize your data and why?
		那么，在使用什么机器学习算法和模型之前，我们需要进行数据标准化呢？原因又是什么？

什么时候需要进行数据的标准化? 为什么？ #6214

什么时候需要进行数据的标准化? 为什么？ #6214

Conversation

Ultrasteve commented Jul 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leviding left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TBLGSn commented Jul 27, 2019

fanyijihua commented Jul 27, 2019

TBLGSn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TBLGSn commented Jul 27, 2019

Ultrasteve commented Jul 27, 2019

Ultrasteve commented Jul 30, 2019

leviding commented Jul 31, 2019

Ultrasteve commented Jul 31, 2019