diff --git a/.history/zh-tw/cs-229-deep-learning_20191006134707.md b/.history/zh-tw/cs-229-deep-learning_20191006134707.md new file mode 100644 index 000000000..9ab9bbad2 --- /dev/null +++ b/.history/zh-tw/cs-229-deep-learning_20191006134707.md @@ -0,0 +1,321 @@ +1. **Deep Learning cheatsheet** + +⟶ +深度學習參考手冊 +
+ +2. **Neural Networks** + +⟶ +神經網路 +
+ +3. **Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** + +⟶ +神經網路是一種透過 layer 來建構的模型。經常被使用的神經網路模型包括了卷積神經網路 (CNN) 和遞迴式神經網路 (RNN)。 +
+ +4. **Architecture ― The vocabulary around neural networks architectures is described in the figure below:** + +⟶ +架構 - 神經網路架構所需要用到的詞彙描述如下: +
+ +5. **[Input layer, hidden layer, output layer]** + +⟶ +[輸入層、隱藏層、輸出層] +
+ +6. **By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** + +⟶ +我們使用 i 來代表網路的第 i 層、j 來代表某一層中第 j 個隱藏神經元的話,我們可以得到下面得等式: +
+ +7. **where we note w, b, z the weight, bias and output respectively.** + +⟶ +其中,我們分別使用 w 來代表權重、b 代表偏差項、z 代表輸出的結果。 +
+ +8. **Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** + +⟶ +Activation function - Activation function 是為了在每一層尾端的神經元帶入非線性轉換而設計的。底下是一些常見 Activation function: +
+ +9. **[Sigmoid, Tanh, ReLU, Leaky ReLU]** + +⟶ +[Sigmoid, Tanh, ReLU, Leaky ReLU] +
+ +10. **Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** + +⟶ +交叉熵損失函式 +
+ +11. **Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** + +⟶ +學習速率 - 學習速率通常用 α 或 η 來表示,目的是用來控制權重更新的速度。學習速度可以是一個固定值,或是隨著訓練的過程改變。現在最熱門的最佳化方法叫作 Adam,是一種隨著訓練過程改變學習速率的最佳化方法。 +
+ +12. **Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** + +⟶ +反向傳播演算法 - 反向傳播演算法是一種在神經網路中用來更新權重的方法,更新的基準是根據神經網路的實際輸出值和期望輸出值之間的關係。權重的導數是根據連鎖律 (chain rule) 來計算,通常會表示成下面的形式: +
+ +13. **As a result, the weight is updated as follows:** + +⟶ +因此,權重會透過以下的方式來更新: +
+ +14. **Updating weights ― In a neural network, weights are updated as follows:** + +⟶ +更新權重 - 在神經網路中,權重的更新會透過以下步驟進行: +
+ +15. **Step 1: Take a batch of training data.** + +⟶ +步驟一:取出一個批次 (batch) 的資料 +
+ +16. **Step 2: Perform forward propagation to obtain the corresponding loss.** + +⟶ +步驟二:執行前向傳播演算法 (forward propagation) 來得到對應的損失值 +
+ +17. **Step 3: Backpropagate the loss to get the gradients.** + +⟶ +步驟三:將損失值透過反向傳播演算法來得到梯度 +
+ +18. **Step 4: Use the gradients to update the weights of the network.** + +⟶ +步驟四:使用梯度來更新網路的權重 +
+ +19. **Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** + +⟶ +Dropout - Dropout 是一種透過丟棄一些神經元,來避免過擬和的技巧。在實務上,神經元會透過機率值的設定來決定要丟棄或保留 +
+ +20. **Convolutional Neural Networks** + +⟶ +卷積神經網絡 +
+ +21. **Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** + +⟶ +卷積層的需求 - 我們使用 W 來表示輸入資料的維度大小、F 代表卷積層的 filter 尺寸、P 代表對資料墊零 (zero padding) 使資料長度齊一後的長度,S 代表卷積後取出的特徵 stride 數量,則輸出的維度大小可以透過以下的公式表示: +
+ +22. **Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** + +⟶ +批次正規化 (Batch normalization) - 它是一個藉由 γ,β 兩個超參數來正規化每個批次 {xi} 的過程。每一次正規化的過程,我們使用 μB,σ2B 分別代表平均數和變異數。請參考以下公式: +
+ +23. **It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** + +⟶ +批次正規化的動作通常在全連接層/卷積層之後、在非線性層之前進行。目的在於接納更高的學習速率,並且減少該批次學習初期對取樣資料特徵的依賴性。 +
+ +24. **Recurrent Neural Networks** + +⟶ +遞歸神經網路 (RNN) +
+ +25. **Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** + +⟶ +閘的種類 - 在傳統的遞歸神經網路中,你會遇到幾種閘: +
+ +26. **[Input gate, forget gate, gate, output gate]** + +⟶ +輸入閘、遺忘閥、閘、輸出閘 +
+ +27. **[Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** + +⟶ +要不要將資料寫入到記憶區塊中?要不要將存在在記憶區塊中的資料清除?要寫多少資料到記憶區塊?要不要將資料從記憶區塊中取出? +
+ +28. **LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** + +⟶ +長短期記憶模型 - 長短期記憶模型是一種遞歸神經網路,藉由導入遺忘閘的設計來避免梯度消失的問題 +
+ +29. **Reinforcement Learning and Control** + +⟶ +強化學習及控制 +
+ +30. **The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** + +⟶ +強化學習的目標就是為了讓代理 (agent) 能夠學習在環境中進化 +
+ +31. **Definitions** + +⟶ +定義 +
+ +32. **Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** + +⟶ +馬可夫決策過程 - 一個馬可夫決策過程 (MDP) 包含了五個元素: +
+ +33. **S is the set of states** + +⟶ +S 是一組狀態的集合 +
+ +34. **A is the set of actions** + +⟶ +A 是一組行為的集合 +
+ +35. **{Psa} are the state transition probabilities for s∈S and a∈A** + +⟶ +{Psa} 指的是,當 s∈S、a∈A 時,狀態轉移的機率 +
+ +36. **γ∈[0,1[ is the discount factor** + +⟶ +γ∈[0,1[ 是衰減係數 +
+ +37. **R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** + +⟶ +R:S×A⟶R 或 R:S⟶R 指的是獎勵函數,也就是演算法想要去最大化的目標函數 +
+ +38. **Policy ― A policy π is a function π:S⟶A that maps states to actions.** + +⟶ +策略 - 一個策略 π 指的是一個函數 π:S⟶A,這個函數會將狀態映射到行為 +
+ +39. **Remark: we say that we execute a given policy π if given a state a we take the action a=π(s).** + +⟶ +注意:我們會說,我們給定一個策略 π,當我們給定一個狀態 s 我們會採取一個行動 a=π(s) +
+ +40. **Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** + +⟶ +價值函數 - 給定一個策略 π 和狀態 s,我們定義價值函數 Vπ 為: +
+ +41. **Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** + +⟶ +貝爾曼方程 - 最佳的貝爾曼方程是將價值函數 Vπ∗ 和策略 π∗ 表示為: +
+ +42. **Remark: we note that the optimal policy π∗ for a given state s is such that:** + +⟶ +注意:對於給定一個狀態 s,最佳的策略 π∗ 是: +
+ +43. **Value iteration algorithm ― The value iteration algorithm is in two steps:** + +⟶ +價值迭代演算法 - 價值迭代演算法包含兩個步驟: +
+ +44. **1) We initialize the value:** + +⟶ +1) 針對價值初始化: +
+ +45. **2) We iterate the value based on the values before:** + +⟶ +根據之前的值,迭代此價值的值: +
+ +46. **Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** + +⟶ +最大概似估計 - 針對狀態轉移機率的最大概似估計為: +
+ +47. **times took action a in state s and got to s′** + +⟶ +從狀態 s 到 s′ 所採取行為的次數 +
+ +48. **times took action a in state s** + +⟶ +從狀態 s 所採取行為的次數 +
+ +49. **Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** + +⟶ +Q-learning 演算法 - Q-learning 演算法是針對 Q 的一個 model-free 的估計,如下: + +50. **View PDF version on GitHub** + +⟶ +前往 GitHub 閱讀 PDF 版本 +
+ +51. **[Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** + +⟶ +[神經網路, 架構, Activation function, 反向傳播演算法, Dropout] +
+ +52. **[Convolutional Neural Networks, Convolutional layer, Batch normalization]** + +⟶ +[卷積神經網絡, 卷積層, 批次正規化] +
+ +53. **[Recurrent Neural Networks, Gates, LSTM]** + +⟶ +[遞歸神經網路 (RNN), 閘, 長短期記憶模型] +
+ +54. **[Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** + +⟶ +[強化學習, 馬可夫決策過程, 價值/策略迭代, 近似動態規劃, 策略搜尋] \ No newline at end of file diff --git a/.history/zh-tw/cs-229-deep-learning_20191006140209.md b/.history/zh-tw/cs-229-deep-learning_20191006140209.md new file mode 100644 index 000000000..ee64d7556 --- /dev/null +++ b/.history/zh-tw/cs-229-deep-learning_20191006140209.md @@ -0,0 +1,321 @@ +1. **Deep Learning cheatsheet** + +⟶ +深度學習參考手冊 +
+ +2. **Neural Networks** + +⟶ +神經網路 +
+ +3. **Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.** + +⟶ +神經網路是一種透過 layer 來建構的模型。經常被使用的神經網路模型包括了卷積神經網路 (CNN) 和遞迴式神經網路 (RNN)。 +
+ +4. **Architecture ― The vocabulary around neural networks architectures is described in the figure below:** + +⟶ +架構 - 神經網路架構所需要用到的詞彙描述如下: +
+ +5. **[Input layer, hidden layer, output layer]** + +⟶ +[輸入層、隱藏層、輸出層] +
+ +6. **By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** + +⟶ +我們使用 i 來代表網路的第 i 層、j 來代表某一層中第 j 個隱藏神經元的話, 我們可以得到下面得等式: +
+ +7. **where we note w, b, z the weight, bias and output respectively.** + +⟶ +其中, 我們分別使用 w 來代表權重、b 代表偏差項、z 代表輸出的結果。 +
+ +8. **Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** + +⟶ +Activation function - Activation function 是為了在每一層尾端的神經元帶入非線性轉換而設計的。底下是一些常見 Activation function: +
+ +9. **[Sigmoid, Tanh, ReLU, Leaky ReLU]** + +⟶ +[Sigmoid, Tanh, ReLU, Leaky ReLU] +
+ +10. **Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:** + +⟶ +交叉熵損失函式 +
+ +11. **Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** + +⟶ +學習速率 - 學習速率通常用 α 或 η 來表示, 目的是用來控制權重更新的速度。學習速度可以是一個固定值, 或是隨著訓練的過程改變。現在最熱門的最佳化方法叫作 Adam, 是一種隨著訓練過程改變學習速率的最佳化方法。 +
+ +12. **Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** + +⟶ +反向傳播演算法 - 反向傳播演算法是一種在神經網路中用來更新權重的方法, 更新的基準是根據神經網路的實際輸出值和期望輸出值之間的關係。權重的導數是根據連鎖律 (chain rule) 來計算, 通常會表示成下面的形式: +
+ +13. **As a result, the weight is updated as follows:** + +⟶ +因此, 權重會透過以下的方式來更新: +
+ +14. **Updating weights ― In a neural network, weights are updated as follows:** + +⟶ +更新權重 - 在神經網路中, 權重的更新會透過以下步驟進行: +
+ +15. **Step 1: Take a batch of training data.** + +⟶ +步驟一:取出一個批次 (batch) 的資料 +
+ +16. **Step 2: Perform forward propagation to obtain the corresponding loss.** + +⟶ +步驟二:執行前向傳播演算法 (forward propagation) 來得到對應的損失值 +
+ +17. **Step 3: Backpropagate the loss to get the gradients.** + +⟶ +步驟三:將損失值透過反向傳播演算法來得到梯度 +
+ +18. **Step 4: Use the gradients to update the weights of the network.** + +⟶ +步驟四:使用梯度來更新網路的權重 +
+ +19. **Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** + +⟶ +Dropout - Dropout 是一種透過丟棄一些神經元, 來避免過擬和的技巧。在實務上, 神經元會透過機率值的設定來決定要丟棄或保留 +
+ +20. **Convolutional Neural Networks** + +⟶ +卷積神經網絡 +
+ +21. **Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** + +⟶ +卷積層的需求 - 我們使用 W 來表示輸入資料的維度大小、F 代表卷積層的 filter 尺寸、P 代表對資料墊零 (zero padding) 使資料長度齊一後的長度, S 代表卷積後取出的特徵 stride 數量, 則輸出的維度大小可以透過以下的公式表示: +
+ +22. **Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** + +⟶ +批次正規化 (Batch normalization) - 它是一個藉由 γ,β 兩個超參數來正規化每個批次 {xi} 的過程。每一次正規化的過程, 我們使用 μB,σ2B 分別代表平均數和變異數。請參考以下公式: +
+ +23. **It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** + +⟶ +批次正規化的動作通常在全連接層/卷積層之後、在非線性層之前進行。目的在於接納更高的學習速率, 並且減少該批次學習初期對取樣資料特徵的依賴性。 +
+ +24. **Recurrent Neural Networks** + +⟶ +遞歸神經網路 (RNN) +
+ +25. **Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** + +⟶ +閘的種類 - 在傳統的遞歸神經網路中, 你會遇到幾種閘: +
+ +26. **[Input gate, forget gate, gate, output gate]** + +⟶ +輸入閘、遺忘閥、閘、輸出閘 +
+ +27. **[Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]** + +⟶ +要不要將資料寫入到記憶區塊中?要不要將存在在記憶區塊中的資料清除?要寫多少資料到記憶區塊?要不要將資料從記憶區塊中取出? +
+ +28. **LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** + +⟶ +長短期記憶模型 - 長短期記憶模型是一種遞歸神經網路, 藉由導入遺忘閘的設計來避免梯度消失的問題 +
+ +29. **Reinforcement Learning and Control** + +⟶ +強化學習及控制 +
+ +30. **The goal of reinforcement learning is for an agent to learn how to evolve in an environment.** + +⟶ +強化學習的目標就是為了讓代理 (agent) 能夠學習在環境中進化 +
+ +31. **Definitions** + +⟶ +定義 +
+ +32. **Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:** + +⟶ +馬可夫決策過程 - 一個馬可夫決策過程 (MDP) 包含了五個元素: +
+ +33. **S is the set of states** + +⟶ +S 是一組狀態的集合 +
+ +34. **A is the set of actions** + +⟶ +A 是一組行為的集合 +
+ +35. **{Psa} are the state transition probabilities for s∈S and a∈A** + +⟶ +{Psa} 指的是, 當 s∈S、a∈A 時, 狀態轉移的機率 +
+ +36. **γ∈[0,1[ is the discount factor** + +⟶ +γ∈[0,1[ 是衰減係數 +
+ +37. **R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** + +⟶ +R:S×A⟶R 或 R:S⟶R 指的是獎勵函數, 也就是演算法想要去最大化的目標函數 +
+ +38. **Policy ― A policy π is a function π:S⟶A that maps states to actions.** + +⟶ +策略 - 一個策略 π 指的是一個函數 π:S⟶A, 這個函數會將狀態映射到行為 +
+ +39. **Remark: we say that we execute a given policy π if given a state a we take the action a=π(s).** + +⟶ +注意:我們會說, 我們給定一個策略 π, 當我們給定一個狀態 s 我們會採取一個行動 a=π(s) +
+ +40. **Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** + +⟶ +價值函數 - 給定一個策略 π 和狀態 s, 我們定義價值函數 Vπ 為: +
+ +41. **Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** + +⟶ +貝爾曼方程 - 最佳的貝爾曼方程是將價值函數 Vπ∗ 和策略 π∗ 表示為: +
+ +42. **Remark: we note that the optimal policy π∗ for a given state s is such that:** + +⟶ +注意:對於給定一個狀態 s, 最佳的策略 π∗ 是: +
+ +43. **Value iteration algorithm ― The value iteration algorithm is in two steps:** + +⟶ +價值迭代演算法 - 價值迭代演算法包含兩個步驟: +
+ +44. **1) We initialize the value:** + +⟶ +1) 針對價值初始化: +
+ +45. **2) We iterate the value based on the values before:** + +⟶ +根據之前的值, 迭代此價值的值: +
+ +46. **Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** + +⟶ +最大概似估計 - 針對狀態轉移機率的最大概似估計為: +
+ +47. **times took action a in state s and got to s′** + +⟶ +從狀態 s 到 s′ 所採取行為的次數 +
+ +48. **times took action a in state s** + +⟶ +從狀態 s 所採取行為的次數 +
+ +49. **Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** + +⟶ +Q-learning 演算法 - Q-learning 演算法是針對 Q 的一個 model-free 的估計, 如下: + +50. **View PDF version on GitHub** + +⟶ +前往 GitHub 閱讀 PDF 版本 +
+ +51. **[Neural Networks, Architecture, Activation function, Backpropagation, Dropout]** + +⟶ +[神經網路, 架構, Activation function, 反向傳播演算法, Dropout] +
+ +52. **[Convolutional Neural Networks, Convolutional layer, Batch normalization]** + +⟶ +[卷積神經網絡, 卷積層, 批次正規化] +
+ +53. **[Recurrent Neural Networks, Gates, LSTM]** + +⟶ +[遞歸神經網路 (RNN), 閘, 長短期記憶模型] +
+ +54. **[Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]** + +⟶ +[強化學習, 馬可夫決策過程, 價值/策略迭代, 近似動態規劃, 策略搜尋] \ No newline at end of file diff --git a/.history/zh-tw/cs-229-linear-algebra_20191006134707.md b/.history/zh-tw/cs-229-linear-algebra_20191006134707.md new file mode 100644 index 000000000..36d4cef5d --- /dev/null +++ b/.history/zh-tw/cs-229-linear-algebra_20191006134707.md @@ -0,0 +1,338 @@ +1. **Linear Algebra and Calculus refresher** + +⟶ +線性代數與微積分回顧 +
+ +2. **General notations** + +⟶ +通用符號 +
+ +3. **Definitions** + +⟶ +定義 +
+ +4. **Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** + +⟶ +向量 - 我們定義 x∈Rn 是一個向量,包含 n 維元素,xi∈R 是第 i 維元素: +
+ +5. **Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** + +⟶ +矩陣 - 我們定義 A∈Rm×n 是一個 m 列 n 行的矩陣,Ai,j∈R 代表位在第 i 列第 j 行的元素: +
+ +6. **Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** + +⟶ +注意:上述定義的向量 x 可以視為 nx1 的矩陣,或是更常被稱為行向量 +
+ +7. **Main matrices** + +⟶ +主要的矩陣 +
+ +8. **Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** + +⟶ +單位矩陣 - 單位矩陣 I∈Rn×n 是一個方陣,其主對角線皆為 1,其餘皆為 0 +
+ +9. **Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** + +⟶ +注意:對於所有矩陣 A∈Rn×n,我們有 A×I=I×A=A +
+ +10. **Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** + +⟶ +對角矩陣 - 對角矩陣 D∈Rn×n 是一個方陣,其主對角線為非 0,其餘皆為 0 +
+ +11. **Remark: we also note D as diag(d1,...,dn).** + +⟶ +注意:我們令 D 為 diag(d1,...,dn) +
+ +12. **Matrix operations** + +⟶ +矩陣運算 +
+ +13. **Multiplication** + +⟶ +乘法 +
+ +14. **Vector-vector ― There are two types of vector-vector products:** + +⟶ +向量-向量 - 有兩種類型的向量-向量相乘: +
+ +15. **inner product: for x,y∈Rn, we have:** + +⟶ +內積:對於 x,y∈Rn,我們可以得到: +
+ +16. **outer product: for x∈Rm,y∈Rn, we have:** + +⟶ +外積:對於 x∈Rm,y∈Rn,我們可以得到: +
+ +17. **Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** + +⟶ +矩陣-向量 - 矩陣 A∈Rm×n 和向量 x∈Rn 的乘積是一個大小為 Rm 的向量,使得: +
+ +18. **where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** + +⟶ +其中 aTr,i 是 A 的列向量、ac,j 是 A 的行向量、xi 是 x 的元素 +
+ +19. **Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** + +⟶ +矩陣-矩陣:矩陣 A∈Rm×n 和 B∈Rn×p 的乘積為一個大小 Rm×p 的矩陣,使得: +
+ +20. **where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** + +⟶ +其中,aTr,i,bTr,i 和 ac,j,bc,j 分別是 A 和 B 的列向量與行向量 +
+ +21. **Other operations** + +⟶ +其他操作 +
+ +22. **Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** + +⟶ +轉置 - 一個矩陣的轉置矩陣 A∈Rm×n,記作 AT,指的是其中元素的翻轉: +
+ +23. **Remark: for matrices A,B, we have (AB)T=BTAT** + +⟶ +注意:對於矩陣 A、B,我們有 (AB)T=BTAT +
+ +24. **Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** + +⟶ +可逆 - 一個可逆矩陣 A 記作 A−1,存在唯一的矩陣,使得: +
+ +25. **Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** + +⟶ +注意:並非所有的方陣都是可逆的。同樣的,對於矩陣 A、B 來說,我們有 (AB)−1=B−1A−1 +
+ +26. **Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** + +⟶ +跡 - 一個方陣 A 的跡,記作 tr(A),指的是主對角線元素之合: +
+ +27. **Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** + +⟶ +注意:對於矩陣 A、B 來說,我們有 tr(AT)=tr(A) 及 tr(AB)=tr(BA) +
+ +28. **Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** + +⟶ +行列式 - 一個方陣 A∈Rn×n 的行列式,記作|A| 或 det(A),可以透過 A∖i,∖j 來遞迴表示,它是一個沒有第 i 列和第 j 行的矩陣 A: +
+ +29. **Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** + +⟶ +注意:A 是一個可逆矩陣,若且唯若 |A|≠0。同樣的,|AB|=|A||B| 且 |AT|=|A| +
+ +30. **Matrix properties** + +⟶ +矩陣的性質 +
+ +31. **Definitions** + +⟶ +定義 +
+ +32. **Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** + +⟶ +對稱分解 - 給定一個矩陣 A,它可以透過其對稱和反對稱的部分表示如下: +
+ +33. **[Symmetric, Antisymmetric]** + +⟶ +[對稱, 反對稱] +
+ +34. **Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** + +⟶ +範數 - 範數指的是一個函式 N:V⟶[0,+∞[,其中 V 是一個向量空間,且對於所有 x,y∈V,我們有: +
+ +35. **N(ax)=|a|N(x) for a scalar** + +⟶ +對一個純量來說,我們有 N(ax)=|a|N(x) +
+ +36. **if N(x)=0, then x=0** + +⟶ +若 N(x)=0 時,則 x=0 +
+ +37. **For x∈V, the most commonly used norms are summed up in the table below:** + +⟶ +對於 x∈V,最常用的範數總結如下表: +
+ +38. **[Norm, Notation, Definition, Use case]** + +⟶ +[範數, 表示法, 定義, 使用情境] +
+ +39. **Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** + +⟶ +線性相關 - 當集合中的一個向量可以用被定義為集合中其他向量的線性組合時,則則稱此集合的向量為線性相關 +
+ +40. **Remark: if no vector can be written this way, then the vectors are said to be linearly independent** + +⟶ +注意:如果沒有向量可以如上表示時,則稱此集合的向量彼此為線性獨立 +
+ +41. **Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** + +⟶ +矩陣的秩 - 一個矩陣 A 的秩記作 rank(A),指的是其列向量空間所產生的維度,等價於 A 的線性獨立的最大最大行向量 +
+ +42. **Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** + +⟶ +半正定矩陣 - 當以下成立時,一個矩陣 A∈Rn×n 是半正定矩陣 (PSD),且記作A⪰0: +
+ +43. **Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** + +⟶ +注意:同樣的,一個矩陣 A 是一個半正定矩陣 (PSD),且滿足所有非零向量 x,xTAx>0 時,稱之為正定矩陣,記作 A≻0 +
+ +44. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** + +⟶ +特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,當存在一個向量 z∈Rn∖{0} 時,此向量被稱為特徵向量,λ 稱之為 A 的特徵值,且滿足: +
+ +45. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** + +⟶ +譜分解 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以被一個實數正交矩陣 U∈Rn×n 給對角化。令 Λ=diag(λ1,...,λn),我們得到: +
+ +46. **diagonal** + +⟶ +對角線 +
+ +47. **Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** + +⟶ +奇異值分解 - 對於給定維度為 mxn 的矩陣 A,其奇異值分解指的是一種因子分解技巧,保證存在 mxm 的單式矩陣 U、對角線矩陣 Σ m×n 和 nxn 的單式矩陣 V,滿足: +
+ +48. **Matrix calculus** + +⟶ +矩陣導數 +
+ +49. **Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** + +⟶ +梯度 - 令 f:Rm×n→R 是一個函式,且 A∈Rm×n 是一個矩陣。f 相對於 A 的梯度是一個 mxn 的矩陣,記作 ∇Af(A),滿足: +
+ +50. **Remark: the gradient of f is only defined when f is a function that returns a scalar.** + +⟶ +注意:f 的梯度僅在 f 為一個函數且該函數回傳一個純量時有效 +
+ +51. **Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** + +⟶ +海森 - 令 f:Rn→R 是一個函式,且 x∈Rn 是一個向量,則一個 f 的海森對於向量 x 是一個 nxn 的對稱矩陣,記作 ∇2xf(x),滿足: +
+ +52. **Remark: the hessian of f is only defined when f is a function that returns a scalar** + +⟶ +注意:f 的海森僅在 f 為一個函數且該函數回傳一個純量時有效 +
+ +53. **Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** +梯度運算 - 對於矩陣 A、B、C,下列的梯度性質值得牢牢記住: +⟶ + +54. **[General notations, Definitions, Main matrices]** + +⟶ +[通用符號, 定義, 主要矩陣] +
+ +55. **[Matrix operations, Multiplication, Other operations]** + +⟶ +[矩陣運算, 矩陣乘法, 其他運算] +
+ +56. **[Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** + +⟶ +[矩陣性質, 範數, 特徵值/特徵向量, 奇異值分解] +
+ +57. **[Matrix calculus, Gradient, Hessian, Operations]** + +⟶ +[矩陣導數, 梯度, 海森, 運算] \ No newline at end of file diff --git a/.history/zh-tw/cs-229-linear-algebra_20191006140209.md b/.history/zh-tw/cs-229-linear-algebra_20191006140209.md new file mode 100644 index 000000000..8466a6644 --- /dev/null +++ b/.history/zh-tw/cs-229-linear-algebra_20191006140209.md @@ -0,0 +1,338 @@ +1. **Linear Algebra and Calculus refresher** + +⟶ +線性代數與微積分回顧 +
+ +2. **General notations** + +⟶ +通用符號 +
+ +3. **Definitions** + +⟶ +定義 +
+ +4. **Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** + +⟶ +向量 - 我們定義 x∈Rn 是一個向量, 包含 n 維元素, xi∈R 是第 i 維元素: +
+ +5. **Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** + +⟶ +矩陣 - 我們定義 A∈Rm×n 是一個 m 列 n 行的矩陣, Ai,j∈R 代表位在第 i 列第 j 行的元素: +
+ +6. **Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** + +⟶ +注意:上述定義的向量 x 可以視為 nx1 的矩陣, 或是更常被稱為行向量 +
+ +7. **Main matrices** + +⟶ +主要的矩陣 +
+ +8. **Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** + +⟶ +單位矩陣 - 單位矩陣 I∈Rn×n 是一個方陣, 其主對角線皆為 1, 其餘皆為 0 +
+ +9. **Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** + +⟶ +注意:對於所有矩陣 A∈Rn×n, 我們有 A×I=I×A=A +
+ +10. **Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** + +⟶ +對角矩陣 - 對角矩陣 D∈Rn×n 是一個方陣, 其主對角線為非 0, 其餘皆為 0 +
+ +11. **Remark: we also note D as diag(d1,...,dn).** + +⟶ +注意:我們令 D 為 diag(d1,...,dn) +
+ +12. **Matrix operations** + +⟶ +矩陣運算 +
+ +13. **Multiplication** + +⟶ +乘法 +
+ +14. **Vector-vector ― There are two types of vector-vector products:** + +⟶ +向量-向量 - 有兩種類型的向量-向量相乘: +
+ +15. **inner product: for x,y∈Rn, we have:** + +⟶ +內積:對於 x,y∈Rn, 我們可以得到: +
+ +16. **outer product: for x∈Rm,y∈Rn, we have:** + +⟶ +外積:對於 x∈Rm,y∈Rn, 我們可以得到: +
+ +17. **Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** + +⟶ +矩陣-向量 - 矩陣 A∈Rm×n 和向量 x∈Rn 的乘積是一個大小為 Rm 的向量, 使得: +
+ +18. **where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** + +⟶ +其中 aTr,i 是 A 的列向量、ac,j 是 A 的行向量、xi 是 x 的元素 +
+ +19. **Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** + +⟶ +矩陣-矩陣:矩陣 A∈Rm×n 和 B∈Rn×p 的乘積為一個大小 Rm×p 的矩陣, 使得: +
+ +20. **where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** + +⟶ +其中, aTr,i,bTr,i 和 ac,j,bc,j 分別是 A 和 B 的列向量與行向量 +
+ +21. **Other operations** + +⟶ +其他操作 +
+ +22. **Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** + +⟶ +轉置 - 一個矩陣的轉置矩陣 A∈Rm×n, 記作 AT, 指的是其中元素的翻轉: +
+ +23. **Remark: for matrices A,B, we have (AB)T=BTAT** + +⟶ +注意:對於矩陣 A、B, 我們有 (AB)T=BTAT +
+ +24. **Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** + +⟶ +可逆 - 一個可逆矩陣 A 記作 A−1, 存在唯一的矩陣, 使得: +
+ +25. **Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** + +⟶ +注意:並非所有的方陣都是可逆的。同樣的, 對於矩陣 A、B 來說, 我們有 (AB)−1=B−1A−1 +
+ +26. **Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** + +⟶ +跡 - 一個方陣 A 的跡, 記作 tr(A), 指的是主對角線元素之合: +
+ +27. **Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** + +⟶ +注意:對於矩陣 A、B 來說, 我們有 tr(AT)=tr(A) 及 tr(AB)=tr(BA) +
+ +28. **Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** + +⟶ +行列式 - 一個方陣 A∈Rn×n 的行列式, 記作|A| 或 det(A), 可以透過 A∖i,∖j 來遞迴表示, 它是一個沒有第 i 列和第 j 行的矩陣 A: +
+ +29. **Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** + +⟶ +注意:A 是一個可逆矩陣, 若且唯若 |A|≠0。同樣的, |AB|=|A||B| 且 |AT|=|A| +
+ +30. **Matrix properties** + +⟶ +矩陣的性質 +
+ +31. **Definitions** + +⟶ +定義 +
+ +32. **Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** + +⟶ +對稱分解 - 給定一個矩陣 A, 它可以透過其對稱和反對稱的部分表示如下: +
+ +33. **[Symmetric, Antisymmetric]** + +⟶ +[對稱, 反對稱] +
+ +34. **Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** + +⟶ +範數 - 範數指的是一個函式 N:V⟶[0,+∞[, 其中 V 是一個向量空間, 且對於所有 x,y∈V, 我們有: +
+ +35. **N(ax)=|a|N(x) for a scalar** + +⟶ +對一個純量來說, 我們有 N(ax)=|a|N(x) +
+ +36. **if N(x)=0, then x=0** + +⟶ +若 N(x)=0 時, 則 x=0 +
+ +37. **For x∈V, the most commonly used norms are summed up in the table below:** + +⟶ +對於 x∈V, 最常用的範數總結如下表: +
+ +38. **[Norm, Notation, Definition, Use case]** + +⟶ +[範數, 表示法, 定義, 使用情境] +
+ +39. **Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** + +⟶ +線性相關 - 當集合中的一個向量可以用被定義為集合中其他向量的線性組合時, 則則稱此集合的向量為線性相關 +
+ +40. **Remark: if no vector can be written this way, then the vectors are said to be linearly independent** + +⟶ +注意:如果沒有向量可以如上表示時, 則稱此集合的向量彼此為線性獨立 +
+ +41. **Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** + +⟶ +矩陣的秩 - 一個矩陣 A 的秩記作 rank(A), 指的是其列向量空間所產生的維度, 等價於 A 的線性獨立的最大最大行向量 +
+ +42. **Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** + +⟶ +半正定矩陣 - 當以下成立時, 一個矩陣 A∈Rn×n 是半正定矩陣 (PSD), 且記作A⪰0: +
+ +43. **Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** + +⟶ +注意:同樣的, 一個矩陣 A 是一個半正定矩陣 (PSD), 且滿足所有非零向量 x, xTAx>0 時, 稱之為正定矩陣, 記作 A≻0 +
+ +44. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** + +⟶ +特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n, 當存在一個向量 z∈Rn∖{0} 時, 此向量被稱為特徵向量, λ 稱之為 A 的特徵值, 且滿足: +
+ +45. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** + +⟶ +譜分解 - 令 A∈Rn×n, 如果 A 是對稱的, 則 A 可以被一個實數正交矩陣 U∈Rn×n 給對角化。令 Λ=diag(λ1,...,λn), 我們得到: +
+ +46. **diagonal** + +⟶ +對角線 +
+ +47. **Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** + +⟶ +奇異值分解 - 對於給定維度為 mxn 的矩陣 A, 其奇異值分解指的是一種因子分解技巧, 保證存在 mxm 的單式矩陣 U、對角線矩陣 Σ m×n 和 nxn 的單式矩陣 V, 滿足: +
+ +48. **Matrix calculus** + +⟶ +矩陣導數 +
+ +49. **Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** + +⟶ +梯度 - 令 f:Rm×n→R 是一個函式, 且 A∈Rm×n 是一個矩陣。f 相對於 A 的梯度是一個 mxn 的矩陣, 記作 ∇Af(A), 滿足: +
+ +50. **Remark: the gradient of f is only defined when f is a function that returns a scalar.** + +⟶ +注意:f 的梯度僅在 f 為一個函數且該函數回傳一個純量時有效 +
+ +51. **Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** + +⟶ +海森 - 令 f:Rn→R 是一個函式, 且 x∈Rn 是一個向量, 則一個 f 的海森對於向量 x 是一個 nxn 的對稱矩陣, 記作 ∇2xf(x), 滿足: +
+ +52. **Remark: the hessian of f is only defined when f is a function that returns a scalar** + +⟶ +注意:f 的海森僅在 f 為一個函數且該函數回傳一個純量時有效 +
+ +53. **Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** +梯度運算 - 對於矩陣 A、B、C, 下列的梯度性質值得牢牢記住: +⟶ + +54. **[General notations, Definitions, Main matrices]** + +⟶ +[通用符號, 定義, 主要矩陣] +
+ +55. **[Matrix operations, Multiplication, Other operations]** + +⟶ +[矩陣運算, 矩陣乘法, 其他運算] +
+ +56. **[Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]** + +⟶ +[矩陣性質, 範數, 特徵值/特徵向量, 奇異值分解] +
+ +57. **[Matrix calculus, Gradient, Hessian, Operations]** + +⟶ +[矩陣導數, 梯度, 海森, 運算] \ No newline at end of file diff --git a/.history/zh-tw/cs-229-probability_20191006134707.md b/.history/zh-tw/cs-229-probability_20191006134707.md new file mode 100644 index 000000000..0db481cf5 --- /dev/null +++ b/.history/zh-tw/cs-229-probability_20191006134707.md @@ -0,0 +1,382 @@ +1. **Probabilities and Statistics refresher** + +⟶ +機率和統計回顧 +
+ +2. **Introduction to Probability and Combinatorics** + +⟶ +幾率與組合數學介紹 +
+ +3. **Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.** + +⟶ +樣本空間 - 一個實驗的所有可能結果的集合稱之為這個實驗的樣本空間,記做 S +
+ +4. **Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.** + +⟶ +事件 - 樣本空間的任何子集合 E 被稱之為一個事件。也就是說,一個事件是實驗的可能結果的集合。如果該實驗的結果包含 E,我們稱我們稱 E 發生 +
+ +5. **Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.** + +⟶ +機率公理。對於每個事件 E,我們用 P(E) 表示事件 E 發生的機率 +
+ +6. **Axiom 1 ― Every probability is between 0 and 1 included, i.e:** + +⟶ +公理 1 - 每一個機率值介於 0 到 1 之間,包含兩端點。即: +
+ +7. **Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:** + +⟶ +公理 2 - 至少一個基本事件出現在整個樣本空間中的機率是 1。即: +
+ +8. **Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:** + +⟶ +公理 3 - 對於任何互斥的事件 E1,...,En,我們定義如下: +
+ +9. **Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:** + +⟶ +排列 - 排列指的是從 n 個相異的物件中,取出 r 個物件按照固定順序重新安排,這樣安排的數量用 P(n,r) 來表示,定義為: +
+ +10. **Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:** + +⟶ +組合 - 組合指的是從 n 個物件中,取出 r 個物件,但不考慮他的順序。這樣組合要考慮的數量用 C(n,r) 來表示,定義為: +
+ +11. **Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)** + +⟶ +注意:對於 0⩽r⩽n,我們會有 P(n,r)⩾C(n,r) +
+ +12. **Conditional Probability** + +⟶ +條件機率 +
+ +13. **Bayes' rule ― For events A and B such that P(B)>0, we have:** + +⟶ +貝氏定理 - 對於事件 A 和 B 滿足 P(B)>0 時,我們定義如下: +
+ +14. **Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)** + +⟶ +注意:P(A∩B)=P(A)P(B|A)=P(A|B)P(B) +
+ +15. **Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:** + +⟶ +分割 - 令 {Ai,i∈[[1,n]]} 對所有的 i,Ai≠∅,我們說 {Ai} 是一個分割,當底下成立時: +
+ +16. **Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).** + +⟶ +注意:對於任何在樣本空間的事件 B 來說,P(B)=n∑i=1P(B|Ai)P(Ai) +
+ +17. **Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:** + +⟶ +貝氏定理的擴展 - 令 {Ai,i∈[[1,n]]} 為樣本空間的一個分割,我們定義: +
+ +18. **Independence ― Two events A and B are independent if and only if we have:** + +⟶ +獨立 - 當以下條件滿足時,兩個事件 A 和 B 為獨立事件: +
+ +19. **Random Variables** + +⟶ +隨機變數 +
+ +20. **Definitions** + +⟶ +定義 +
+ +21. **Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.** + +⟶ +隨機變數 - 一個隨機變數 X,它是一個將樣本空間中的每個元素映射到實數域的函數 +
+ +22. **Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:** + +⟶ +累積分佈函數 (CDF) - 累積分佈函數 F 是單調遞增的函數,其 limx→−∞F(x)=0 且 limx→+∞F(x)=1,定義如下: +
+ +23. **Remark: we have P(a + +24. **Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.** + +⟶ +機率密度函數 - 機率密度函數 f 是隨機變數 X 在兩個相鄰的實數值附近取值的機率 +
+ +25. **Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.** + +⟶ +機率密度函數和累積分佈函數的關係 - 底下是一些關於離散 (D) 和連續 (C) 的情況下的重要屬性 +
+ +26. **[Case, CDF F, PDF f, Properties of PDF]** + +⟶ +[情況, 累積分佈函數 F, 機率密度函數 f, 機率密度函數的屬性] +
+ +27. **Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:** + +⟶ +分佈的期望值和動差 - 底下是期望值 E[X]、一般期望值 E[g(X)]、第 k 個動差和特徵函數 ψ(ω) 在離散和連續的情況下的表示式: +
+ +28. **Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:** + +⟶ +變異數 - 隨機變數的變異數通常表示為 Var(X) 或 σ2,用來衡量一個分佈離散程度的指標。其表示如下: +
+ +29. **Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:** + +⟶ +標準差 - 一個隨機變數的標準差通常表示為 σ,用來衡量一個分佈離散程度的指標,其單位和實際的隨機變數相容,表示如下: +
+ +30. **Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:** + +⟶ +隨機變數的轉換 - 令變數 X 和 Y 由某個函式連結在一起。我們定義 fX 和 fY 是 X 和 Y 的分佈函式,可以得到: +
+ +31. **Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:** + +⟶ +萊布尼茲積分法則 - 令 g 為 x 和 c 的函數,a 和 b 是依賴於 c 的的邊界,我們得到: +
+ +32. **Probability Distributions** + +⟶ +機率分佈 +
+ +33. **Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:** + +⟶ +柴比雪夫不等式 - 令 X 是一隨機變數,期望值為 μ。對於 k, σ>0,我們有以下不等式: +
+ +34. **Main distributions ― Here are the main distributions to have in mind:** + +⟶ +主要的分佈 - 底下是我們需要熟悉的幾個主要的不等式: +
+ +35. **[Type, Distribution]** + +⟶ +[種類, 分佈] +
+ +36. **Jointly Distributed Random Variables** + +⟶ +聯合分佈隨機變數 +
+ +37. **Marginal density and cumulative distribution ― From the joint density probability function fXY , we have** + +⟶ +邊緣密度和累積分佈 - 從聯合密度機率函數 fXY 中我們可以得到: +
+ +38. **[Case, Marginal density, Cumulative function]** + +⟶ +[種類, 邊緣密度函數, 累積函數] +
+ +39. **Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:** + +⟶ +條件密度 - X 對於 Y 的條件密度,通常用 fX|Y 表示如下: +
+ +40. **Independence ― Two random variables X and Y are said to be independent if we have:** + +⟶ +獨立 - 當滿足以下條件時,我們稱隨機變數 X 和 Y 互相獨立: +
+ +41. **Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:** + +⟶ +共變異數 - 我們定義隨機變數 X 和 Y 的共變異數為 σ2XY 或 Cov(X,Y) 如下: +
+ +42. **Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:** + +⟶ +相關性 - 我們定義 σX、σY 為 X 和 Y 的標準差,而 X 和 Y 的相關係數 ρXY 定義如下: +
+ +43. **Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].** + +⟶ +注意一:對於任何隨機變數 X 和 Y 來說,ρXY∈[−1,1] 成立 +
+ +44. **Remark 2: If X and Y are independent, then ρXY=0.** + +⟶ +注意二:當 X 和 Y 獨立時,ρXY=0 +
+ +45. **Parameter estimation** + +⟶ +參數估計 +
+ +46. **Definitions** + +⟶ +定義 +
+ +47. **Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.** + +⟶ +隨機抽樣 - 隨機抽樣指的是 n 個隨機變數 X1,...,Xn 和 X 獨立且同分佈的集合 +
+ +48. **Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.** + +⟶ +估計量 - 估計量是一個資料的函數,用來推斷在統計模型中未知參數的值 +
+ +49. **Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:** + +⟶ +偏差 - 一個估計量的偏差 ^θ 定義為 ^θ 分佈期望值和真實值之間的差距: +
+ +50. **Remark: an estimator is said to be unbiased when we have E[^θ]=θ.** + +⟶ +注意:當 E[^θ]=θ 時,我們稱為不偏估計量 +
+ +51. **Estimating the mean** + +⟶ +預估平均數 +
+ +52. **Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯X and is defined as follows:** + +⟶ +樣本平均 - 一個隨機樣本的樣本平均是用來預估一個分佈的真實平均 μ,通常我們用 ¯X 來表示,定義如下: +
+ +53. **Remark: the sample mean is unbiased, i.e E[¯X]=μ.** + +⟶ +注意:當 E[¯X]=μ 時,則為不偏樣本平均 +
+ +54. **Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:** + +⟶ +中央極限定理 - 當我們有一個隨機樣本 X1,...,Xn 滿足一個給定的分佈,其平均數為 μ,變異數為 σ2,我們有: +
+ +55. **Estimating the variance** + +⟶ +估計變異數 +
+ +56. **Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:** + +⟶ +樣本變異數 - 一個隨機樣本的樣本變異數是用來估計一個分佈的真實變異數 σ2,通常使用 s2 或 ^σ2 來表示,定義如下: +
+ +57. **Remark: the sample variance is unbiased, i.e E[s2]=σ2.** + +⟶ +注意:當 E[s2]=σ2 時,稱之為不偏樣本變異數 +
+ +58. **Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:** + +⟶ +與樣本變異數的卡方關聯 - 令 s2 是一個隨機樣本的樣本變異數,我們可以得到: +
+ +**59. [Introduction, Sample space, Event, Permutation]** + +⟶ +[介紹, 樣本空間, 事件, 排列] +
+ +**60. [Conditional probability, Bayes' rule, Independence]** + +⟶ +[條件機率, 貝氏定理, 獨立性] +
+ +**61. [Random variables, Definitions, Expectation, Variance]** + +⟶ +[隨機變數, 定義, 期望值, 變異數] +
+ +**62. [Probability distributions, Chebyshev's inequality, Main distributions]** + +⟶ +[機率分佈, 柴比雪夫不等式, 主要分佈] +
+ +**63. [Jointly distributed random variables, Density, Covariance, Correlation]** + +⟶ +[聯合分佈隨機變數, 密度, 共變異數, 相關] +
+ +**64. [Parameter estimation, Mean, Variance]** + +⟶ +[參數估計, 平均數, 變異數] \ No newline at end of file diff --git a/.history/zh-tw/cs-229-probability_20191006140209.md b/.history/zh-tw/cs-229-probability_20191006140209.md new file mode 100644 index 000000000..bd4353351 --- /dev/null +++ b/.history/zh-tw/cs-229-probability_20191006140209.md @@ -0,0 +1,382 @@ +1. **Probabilities and Statistics refresher** + +⟶ +機率和統計回顧 +
+ +2. **Introduction to Probability and Combinatorics** + +⟶ +幾率與組合數學介紹 +
+ +3. **Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.** + +⟶ +樣本空間 - 一個實驗的所有可能結果的集合稱之為這個實驗的樣本空間, 記做 S +
+ +4. **Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.** + +⟶ +事件 - 樣本空間的任何子集合 E 被稱之為一個事件。也就是說, 一個事件是實驗的可能結果的集合。如果該實驗的結果包含 E, 我們稱我們稱 E 發生 +
+ +5. **Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.** + +⟶ +機率公理。對於每個事件 E, 我們用 P(E) 表示事件 E 發生的機率 +
+ +6. **Axiom 1 ― Every probability is between 0 and 1 included, i.e:** + +⟶ +公理 1 - 每一個機率值介於 0 到 1 之間, 包含兩端點。即: +
+ +7. **Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:** + +⟶ +公理 2 - 至少一個基本事件出現在整個樣本空間中的機率是 1。即: +
+ +8. **Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:** + +⟶ +公理 3 - 對於任何互斥的事件 E1,...,En, 我們定義如下: +
+ +9. **Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:** + +⟶ +排列 - 排列指的是從 n 個相異的物件中, 取出 r 個物件按照固定順序重新安排, 這樣安排的數量用 P(n,r) 來表示, 定義為: +
+ +10. **Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:** + +⟶ +組合 - 組合指的是從 n 個物件中, 取出 r 個物件, 但不考慮他的順序。這樣組合要考慮的數量用 C(n,r) 來表示, 定義為: +
+ +11. **Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)** + +⟶ +注意:對於 0⩽r⩽n, 我們會有 P(n,r)⩾C(n,r) +
+ +12. **Conditional Probability** + +⟶ +條件機率 +
+ +13. **Bayes' rule ― For events A and B such that P(B)>0, we have:** + +⟶ +貝氏定理 - 對於事件 A 和 B 滿足 P(B)>0 時, 我們定義如下: +
+ +14. **Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)** + +⟶ +注意:P(A∩B)=P(A)P(B|A)=P(A|B)P(B) +
+ +15. **Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:** + +⟶ +分割 - 令 {Ai,i∈[[1,n]]} 對所有的 i, Ai≠∅, 我們說 {Ai} 是一個分割, 當底下成立時: +
+ +16. **Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).** + +⟶ +注意:對於任何在樣本空間的事件 B 來說, P(B)=n∑i=1P(B|Ai)P(Ai) +
+ +17. **Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:** + +⟶ +貝氏定理的擴展 - 令 {Ai,i∈[[1,n]]} 為樣本空間的一個分割, 我們定義: +
+ +18. **Independence ― Two events A and B are independent if and only if we have:** + +⟶ +獨立 - 當以下條件滿足時, 兩個事件 A 和 B 為獨立事件: +
+ +19. **Random Variables** + +⟶ +隨機變數 +
+ +20. **Definitions** + +⟶ +定義 +
+ +21. **Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.** + +⟶ +隨機變數 - 一個隨機變數 X, 它是一個將樣本空間中的每個元素映射到實數域的函數 +
+ +22. **Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:** + +⟶ +累積分佈函數 (CDF) - 累積分佈函數 F 是單調遞增的函數, 其 limx→−∞F(x)=0 且 limx→+∞F(x)=1, 定義如下: +
+ +23. **Remark: we have P(a + +24. **Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.** + +⟶ +機率密度函數 - 機率密度函數 f 是隨機變數 X 在兩個相鄰的實數值附近取值的機率 +
+ +25. **Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.** + +⟶ +機率密度函數和累積分佈函數的關係 - 底下是一些關於離散 (D) 和連續 (C) 的情況下的重要屬性 +
+ +26. **[Case, CDF F, PDF f, Properties of PDF]** + +⟶ +[情況, 累積分佈函數 F, 機率密度函數 f, 機率密度函數的屬性] +
+ +27. **Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:** + +⟶ +分佈的期望值和動差 - 底下是期望值 E[X]、一般期望值 E[g(X)]、第 k 個動差和特徵函數 ψ(ω) 在離散和連續的情況下的表示式: +
+ +28. **Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:** + +⟶ +變異數 - 隨機變數的變異數通常表示為 Var(X) 或 σ2, 用來衡量一個分佈離散程度的指標。其表示如下: +
+ +29. **Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:** + +⟶ +標準差 - 一個隨機變數的標準差通常表示為 σ, 用來衡量一個分佈離散程度的指標, 其單位和實際的隨機變數相容, 表示如下: +
+ +30. **Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:** + +⟶ +隨機變數的轉換 - 令變數 X 和 Y 由某個函式連結在一起。我們定義 fX 和 fY 是 X 和 Y 的分佈函式, 可以得到: +
+ +31. **Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:** + +⟶ +萊布尼茲積分法則 - 令 g 為 x 和 c 的函數, a 和 b 是依賴於 c 的的邊界, 我們得到: +
+ +32. **Probability Distributions** + +⟶ +機率分佈 +
+ +33. **Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:** + +⟶ +柴比雪夫不等式 - 令 X 是一隨機變數, 期望值為 μ。對於 k, σ>0, 我們有以下不等式: +
+ +34. **Main distributions ― Here are the main distributions to have in mind:** + +⟶ +主要的分佈 - 底下是我們需要熟悉的幾個主要的不等式: +
+ +35. **[Type, Distribution]** + +⟶ +[種類, 分佈] +
+ +36. **Jointly Distributed Random Variables** + +⟶ +聯合分佈隨機變數 +
+ +37. **Marginal density and cumulative distribution ― From the joint density probability function fXY , we have** + +⟶ +邊緣密度和累積分佈 - 從聯合密度機率函數 fXY 中我們可以得到: +
+ +38. **[Case, Marginal density, Cumulative function]** + +⟶ +[種類, 邊緣密度函數, 累積函數] +
+ +39. **Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:** + +⟶ +條件密度 - X 對於 Y 的條件密度, 通常用 fX|Y 表示如下: +
+ +40. **Independence ― Two random variables X and Y are said to be independent if we have:** + +⟶ +獨立 - 當滿足以下條件時, 我們稱隨機變數 X 和 Y 互相獨立: +
+ +41. **Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:** + +⟶ +共變異數 - 我們定義隨機變數 X 和 Y 的共變異數為 σ2XY 或 Cov(X,Y) 如下: +
+ +42. **Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:** + +⟶ +相關性 - 我們定義 σX、σY 為 X 和 Y 的標準差, 而 X 和 Y 的相關係數 ρXY 定義如下: +
+ +43. **Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].** + +⟶ +注意一:對於任何隨機變數 X 和 Y 來說, ρXY∈[−1,1] 成立 +
+ +44. **Remark 2: If X and Y are independent, then ρXY=0.** + +⟶ +注意二:當 X 和 Y 獨立時, ρXY=0 +
+ +45. **Parameter estimation** + +⟶ +參數估計 +
+ +46. **Definitions** + +⟶ +定義 +
+ +47. **Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.** + +⟶ +隨機抽樣 - 隨機抽樣指的是 n 個隨機變數 X1,...,Xn 和 X 獨立且同分佈的集合 +
+ +48. **Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.** + +⟶ +估計量 - 估計量是一個資料的函數, 用來推斷在統計模型中未知參數的值 +
+ +49. **Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:** + +⟶ +偏差 - 一個估計量的偏差 ^θ 定義為 ^θ 分佈期望值和真實值之間的差距: +
+ +50. **Remark: an estimator is said to be unbiased when we have E[^θ]=θ.** + +⟶ +注意:當 E[^θ]=θ 時, 我們稱為不偏估計量 +
+ +51. **Estimating the mean** + +⟶ +預估平均數 +
+ +52. **Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯X and is defined as follows:** + +⟶ +樣本平均 - 一個隨機樣本的樣本平均是用來預估一個分佈的真實平均 μ, 通常我們用 ¯X 來表示, 定義如下: +
+ +53. **Remark: the sample mean is unbiased, i.e E[¯X]=μ.** + +⟶ +注意:當 E[¯X]=μ 時, 則為不偏樣本平均 +
+ +54. **Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:** + +⟶ +中央極限定理 - 當我們有一個隨機樣本 X1,...,Xn 滿足一個給定的分佈, 其平均數為 μ, 變異數為 σ2, 我們有: +
+ +55. **Estimating the variance** + +⟶ +估計變異數 +
+ +56. **Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:** + +⟶ +樣本變異數 - 一個隨機樣本的樣本變異數是用來估計一個分佈的真實變異數 σ2, 通常使用 s2 或 ^σ2 來表示, 定義如下: +
+ +57. **Remark: the sample variance is unbiased, i.e E[s2]=σ2.** + +⟶ +注意:當 E[s2]=σ2 時, 稱之為不偏樣本變異數 +
+ +58. **Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:** + +⟶ +與樣本變異數的卡方關聯 - 令 s2 是一個隨機樣本的樣本變異數, 我們可以得到: +
+ +**59. [Introduction, Sample space, Event, Permutation]** + +⟶ +[介紹, 樣本空間, 事件, 排列] +
+ +**60. [Conditional probability, Bayes' rule, Independence]** + +⟶ +[條件機率, 貝氏定理, 獨立性] +
+ +**61. [Random variables, Definitions, Expectation, Variance]** + +⟶ +[隨機變數, 定義, 期望值, 變異數] +
+ +**62. [Probability distributions, Chebyshev's inequality, Main distributions]** + +⟶ +[機率分佈, 柴比雪夫不等式, 主要分佈] +
+ +**63. [Jointly distributed random variables, Density, Covariance, Correlation]** + +⟶ +[聯合分佈隨機變數, 密度, 共變異數, 相關] +
+ +**64. [Parameter estimation, Mean, Variance]** + +⟶ +[參數估計, 平均數, 變異數] \ No newline at end of file diff --git a/.history/zh-tw/cs-229-supervised-learning_20191006134707.md b/.history/zh-tw/cs-229-supervised-learning_20191006134707.md new file mode 100644 index 000000000..0b329e8db --- /dev/null +++ b/.history/zh-tw/cs-229-supervised-learning_20191006134707.md @@ -0,0 +1,352 @@ +1. **Supervised Learning cheatsheet** + +⟶ 監督式學習參考手冊 + +2. **Introduction to Supervised Learning** + +⟶ 監督式學習介紹 + +3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** + +⟶ 給定一組資料點 {x(1),...,x(m)},以及對應的一組輸出 {y(1),...,y(m)},我們希望建立一個分類器,用來學習如何從 x 來預測 y + +4. **Type of prediction ― The different types of predictive models are summed up in the table below:** + +⟶ 預測的種類 - 根據預測的種類不同,我們將預測模型分為底下幾種: + +5. **[Regression, Classifier, Outcome, Examples]** + +⟶ [迴歸, 分類器, 結果, 範例] + +6. **[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]** + +⟶ [連續, 類別, 線性迴歸, 邏輯迴歸, 支援向量機 (SVM) , 單純貝式分類器] + +7. **Type of model ― The different models are summed up in the table below:** + +⟶ 模型種類 - 不同種類的模型歸納如下表: + +8. **[Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]** + +⟶ [判別模型, 生成模型, 目標, 學到什麼, 示意圖, 範例] + +9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** + +⟶ [直接估計 P(y|x), 先估計 P(x|y),然後推論出 P(y|x), 決策分界線, 資料的機率分佈, 迴歸, 支援向量機 (SVM), 高斯判別分析 (GDA), 單純貝氏 (Naive Bayes)] + +10. **Notations and general concepts** + +⟶ 符號及一般概念 + +11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** + +⟶ 假設 - 我們使用 hθ 來代表所選擇的模型,對於給定的輸入資料 x(i),模型預測的輸出是 hθ(x(i)) + +12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** + +⟶ 損失函數 - 損失函數是一個函數 L:(z,y)∈R×Y⟼L(z,y)∈R, +目的在於計算預測值 z 和實際值 y 之間的差距。底下是一些常見的損失函數: + +13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]** + +⟶ [最小平方法, Logistic 損失函數, Hinge 損失函數, 交叉熵] + +14. **[Linear regression, Logistic regression, SVM, Neural Network]** + +⟶ [線性迴歸, 邏輯迴歸, 支援向量機 (SVM), 神經網路] + +15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** + +⟶ 代價函數 - 代價函數 J 通常用來評估一個模型的表現,它可以透過損失函數 L 來定義: + +16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** + +⟶ 梯度下降 - 使用 α∈R 表示學習速率,我們透過學習速率和代價函數來使用梯度下降的方法找出網路參數更新的方法可以表示為: + +17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** + +⟶ 注意:隨機梯度下降法 (SGD) 使用每一個訓練資料來更新參數。而批次梯度下降法則是透過一個批次的訓練資料來更新參數。 + +18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** + +⟶ 概似估計 - 在給定參數 θ 的條件下,一個模型 L(θ) 的概似估計的目的是透過最大概似估計法來找到最佳的參數。實務上,我們會使用對數概似估計函數 (log-likelihood) ℓ(θ)=log(L(θ)),會比較容易最佳化。如下: + +19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** + +⟶ 牛頓演算法 - 牛頓演算法是一個數值方法,目的在於找到一個 θ,讓 ℓ′(θ)=0。其更新的規則為: + +20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** + +⟶ 注意:多維度正規化的方法,或又被稱之為牛頓-拉弗森 (Newton-Raphson) 演算法,是透過以下的規則更新: + +21. **Linear models** + +⟶ 線性模型 + +22. **Linear regression** + +⟶ 線性迴歸 + +23. **We assume here that y|x;θ∼N(μ,σ2)** + +⟶ 我們假設 y|x;θ∼N(μ,σ2) + +24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** + +⟶ 正規方程法 - 我們使用 X 代表矩陣,讓代價函數最小的 θ 值有一個封閉解,如下: + +25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** + +⟶ 最小均方演算法 (LMS) - 我們使用 α 表示學習速率,針對 m 個訓練資料,透過最小均方演算法的更新規則,或是叫做 Widrow-Hoff 學習法如下: + +26. **Remark: the update rule is a particular case of the gradient ascent.** + +⟶ 注意:這個更新的規則是梯度上升的一種特例 + +27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** + +⟶ 局部加權迴歸 ,又稱為 LWR,是線性洄歸的變形,通過w(i)(x) 對其成本函數中的每個訓練樣本進行加權,其中參數 τ∈R 定義為: + +28. **Classification and logistic regression** + +⟶ 分類與邏輯迴歸 + +29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** + +⟶ Sigmoid 函數 - Sigmoid 函數 g,也可以稱為邏輯函數定義如下: + +30. **Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:** + +⟶ 邏輯迴歸 - 我們假設 y|x;θ∼Bernoulli(ϕ),請參考以下: + +31. **Remark: there is no closed form solution for the case of logistic regressions.** + +⟶ 注意:對於這種情況的邏輯迴歸,並沒有一個封閉解 + +32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** + +⟶ Softmax 迴歸 - Softmax 迴歸又稱做多分類邏輯迴歸,目的是用在超過兩個以上的分類時的迴歸使用。按照慣例,我們設定 θK=0,讓每一個類別的 Bernoulli 參數 ϕi 等同於: + +33. **Generalized Linear Models** + +⟶ 廣義線性模型 + +34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** + +⟶ 指數族分佈 - 一個分佈如果可以透過自然參數 (或稱之為正準參數或連結函數) η、充分統計量 T(y) 和對數區分函數 (log-partition function) a(η) 來表示時,我們就稱這個分佈是屬於指數族分佈。該分佈可以表示如下: + +35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** + +⟶ 注意:我們經常讓 T(y)=y,同時,exp(−a(η)) 可以看成是一個正規化的參數,目的在於讓機率總和為一。 + +36. **Here are the most common exponential distributions summed up in the following table:** + +⟶ 底下是最常見的指數分佈: + +37. **[Distribution, Bernoulli, Gaussian, Poisson, Geometric]** + +⟶ [分佈, 白努利 (Bernoulli), 高斯 (Gaussian), 卜瓦松 (Poisson), 幾何 (Geometric)] + +38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** + +⟶ 廣義線性模型的假設 - 廣義線性模型 (GLM) 的目的在於,給定 x∈Rn+1,要預測隨機變數 y,同時它依賴底下三個假設: + +39. **Remark: ordinary least squares and logistic regression are special cases of generalized linear models.** + +⟶ 注意:最小平方法和邏輯迴歸是廣義線性模型的一種特例 + +40. **Support Vector Machines** + +⟶ 支援向量機 + +41. **The goal of support vector machines is to find the line that maximizes the minimum distance to the line.** + +⟶ 支援向量機的目的在於找到一條決策邊界和資料樣本之間最大化最小距離的線 + +42. **Optimal margin classifier ― The optimal margin classifier h is such that:** + +⟶ 最佳的邊界分類器 - 最佳的邊界分類器可以表示為: + +43. **where (w,b)∈Rn×R is the solution of the following optimization problem:** + +⟶ 其中,(w,b)∈Rn×R 是底下最佳化問題的答案: + +44. **such that** + +⟶ 使得 + +45. **support vectors** + +⟶ 支援向量 + +46. **Remark: the line is defined as wTx−b=0.** + +⟶ 注意:該條直線定義為 wTx−b=0 + +47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** + +⟶ Hinge 損失函數 - Hinge 損失函數用在支援向量機上,定義如下: + +48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** + +⟶ 核(函數) - 給定特徵轉換 ϕ,我們定義核(函數) K 為: + +49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** + +⟶ 實務上,K(x,z)=exp(−||x−z||22σ2) 定義的核(函數) K,一般稱作高斯核(函數)。這種核(函數)經常被使用 + +50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** + +⟶ [非線性可分, 使用核(函數)進行映射, 原始空間中的決策邊界] + +51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** + +⟶ 注意:我們使用 "核(函數)技巧" 來計算代價函數時,不需要真正的知道映射函數 ϕ,這個函數非常複雜。相反的,我們只需要知道 K(x,z) 的值即可。 + +52. **Lagrangian ― We define the Lagrangian L(w,b) as follows:** + +⟶ Lagrangian - 我們將 Lagrangian L(w,b) 定義如下: + +53. **Remark: the coefficients βi are called the Lagrange multipliers.** + +⟶ 注意:係數 βi 稱為 Lagrange 乘數 + +54. **Generative Learning** + +⟶ 生成學習 + +55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** + +⟶ 生成模型嘗試透過預估 P(x|y) 來學習資料如何生成,而我們可以透過貝氏定理來預估 P(y|x) + +56. **Gaussian Discriminant Analysis** + +⟶ 高斯判別分析 + +57. **Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:** + +⟶ 設定 - 高斯判別分析針對 y、x|y=0 和 x|y=1 進行以下假設: + +58. **Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:** + +⟶ 估計 - 底下的表格總結了我們在最大概似估計時的估計值: + +59. **Naive Bayes** + +⟶ 單純貝氏 + +60. **Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:** + +⟶ 假設 - 單純貝氏模型會假設每個資料點的特徵都是獨立的。 + +61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** + +⟶ 解決方法 - 最大化對數概似估計來給出以下解答,k∈{0,1},l∈[[1,L]] + +62. **Remark: Naive Bayes is widely used for text classification and spam detection.** + +⟶ 注意:單純貝氏廣泛應用在文字分類和垃圾信件偵測上 + +63. **Tree-based and ensemble methods** + +⟶ 基於樹狀結構的學習和整體學習 + +64. **These methods can be used for both regression and classification problems.** + +⟶ 這些方法可以應用在迴歸或分類問題上 + +65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** + +⟶ CART - 分類與迴歸樹 (CART),通常稱之為決策數,可以被表示為二元樹。它的優點是具有可解釋性。 + +66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** + +⟶ 隨機森林 - 這是一個基於樹狀結構的方法,它使用大量經由隨機挑選的特徵所建構的決策樹。與單純的決策樹不同,它通常具有高度不可解釋性,但它的效能通常很好,所以是一個相當流行的演算法。 + +67. **Remark: random forests are a type of ensemble methods.** + +⟶ 注意:隨機森林是一種整體學習方法 + +68. **Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:** + +⟶ 增強學習 (Boosting) - 增強學習方法的概念是結合數個弱學習模型來變成強學習模型。主要的分類如下: + +69. **[Adaptive boosting, Gradient boosting]** + +⟶ [自適應增強, 梯度增強] + +70. **High weights are put on errors to improve at the next boosting step** + +⟶ 在下一輪的提升步驟中,錯誤的部分會被賦予較高的權重 + +71. **Weak learners trained on remaining errors** + +⟶ 弱學習器會負責訓練剩下的錯誤 + +72. **Other non-parametric approaches** + +⟶ 其他非參數方法 + +73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** + +⟶ k-最近鄰 - k-最近鄰演算法,又稱之為 k-NN,是一個非參數的方法,其中資料點的決定是透過訓練集中最近的 k 個鄰居而決定。它可以用在分類和迴歸問題上。 + +74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** + +⟶ 注意:參數 k 的值越大,偏差越大。k 的值越小,變異越大。 + +75. **Learning Theory** + +⟶ 學習理論 + +76. **Union bound ― Let A1,...,Ak be k events. We have:** + +⟶ 聯集上界 - 令 A1,...,Ak 為 k 個事件,我們有: + +77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** + +⟶ 霍夫丁不等式 - 令 Z1,..,Zm 為 m 個從參數 ϕ 的白努利分佈中抽出的獨立同分佈 (iid) 的變數。令 ˆϕ 為其樣本平均、固定 γ>0,我們可以得到: + +78. **Remark: this inequality is also known as the Chernoff bound.** + +⟶ 注意:這個不等式也被稱之為 Chernoff 界線 + +79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** + +⟶ 訓練誤差 - 對於一個分類器 h,我們定義訓練誤差為 ˆϵ(h),也可以稱為經驗風險或經驗誤差。定義如下: + +80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: ** + +⟶ 可能近似正確 (PAC) - PAC 是一個框架,有許多學習理論都證明其有效性。它包含以下假設: + +81: **the training and testing sets follow the same distribution** + +⟶ 訓練和測試資料集具有相同的分佈 + +82. **the training examples are drawn independently** + +⟶ 訓練資料集之間彼此獨立 + +83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** + +⟶ 打散 (Shattering) - 給定一個集合 S={x(1),...,x(d)} 以及一組分類器的集合 H,如果對於任何一組標籤 {y(1),...,y(d)},H 都能打散 S,定義如下: + +84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** + +⟶ 上限定理 - 令 H 是一個有限假設類別,使 |H|=k 且令 δ 和樣本大小 m 固定,結著,在機率至少為 1−δ 的情況下,我們得到: + +85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** + +⟶ VC 維度 - 一個有限假設類別的 Vapnik-Chervonenkis (VC) 維度 VC(H) 指的是 H 最多能夠打散的數量 + +86. **Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.** + +⟶ 注意:H={2 維的線性分類器} 的 VC 維度為 3 + +87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** + +⟶ 理論 (Vapnik) - 令 H 已給定,VC(H)=d 且 m 是訓練資料級的數量,在機率至少為 1−δ 的情況下,我們得到: + +88. **Known as Adaboost** + +⟶ 被稱為 Adaboost diff --git a/.history/zh-tw/cs-229-supervised-learning_20191006140209.md b/.history/zh-tw/cs-229-supervised-learning_20191006140209.md new file mode 100644 index 000000000..28c064279 --- /dev/null +++ b/.history/zh-tw/cs-229-supervised-learning_20191006140209.md @@ -0,0 +1,352 @@ +1. **Supervised Learning cheatsheet** + +⟶ 監督式學習參考手冊 + +2. **Introduction to Supervised Learning** + +⟶ 監督式學習介紹 + +3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** + +⟶ 給定一組資料點 {x(1),...,x(m)}, 以及對應的一組輸出 {y(1),...,y(m)}, 我們希望建立一個分類器, 用來學習如何從 x 來預測 y + +4. **Type of prediction ― The different types of predictive models are summed up in the table below:** + +⟶ 預測的種類 - 根據預測的種類不同, 我們將預測模型分為底下幾種: + +5. **[Regression, Classifier, Outcome, Examples]** + +⟶ [迴歸, 分類器, 結果, 範例] + +6. **[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]** + +⟶ [連續, 類別, 線性迴歸, 邏輯迴歸, 支援向量機 (SVM) , 單純貝式分類器] + +7. **Type of model ― The different models are summed up in the table below:** + +⟶ 模型種類 - 不同種類的模型歸納如下表: + +8. **[Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]** + +⟶ [判別模型, 生成模型, 目標, 學到什麼, 示意圖, 範例] + +9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** + +⟶ [直接估計 P(y|x), 先估計 P(x|y), 然後推論出 P(y|x), 決策分界線, 資料的機率分佈, 迴歸, 支援向量機 (SVM), 高斯判別分析 (GDA), 單純貝氏 (Naive Bayes)] + +10. **Notations and general concepts** + +⟶ 符號及一般概念 + +11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** + +⟶ 假設 - 我們使用 hθ 來代表所選擇的模型, 對於給定的輸入資料 x(i), 模型預測的輸出是 hθ(x(i)) + +12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** + +⟶ 損失函數 - 損失函數是一個函數 L:(z,y)∈R×Y⟼L(z,y)∈R, +目的在於計算預測值 z 和實際值 y 之間的差距。底下是一些常見的損失函數: + +13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]** + +⟶ [最小平方法, Logistic 損失函數, Hinge 損失函數, 交叉熵] + +14. **[Linear regression, Logistic regression, SVM, Neural Network]** + +⟶ [線性迴歸, 邏輯迴歸, 支援向量機 (SVM), 神經網路] + +15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** + +⟶ 代價函數 - 代價函數 J 通常用來評估一個模型的表現, 它可以透過損失函數 L 來定義: + +16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** + +⟶ 梯度下降 - 使用 α∈R 表示學習速率, 我們透過學習速率和代價函數來使用梯度下降的方法找出網路參數更新的方法可以表示為: + +17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** + +⟶ 注意:隨機梯度下降法 (SGD) 使用每一個訓練資料來更新參數。而批次梯度下降法則是透過一個批次的訓練資料來更新參數。 + +18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** + +⟶ 概似估計 - 在給定參數 θ 的條件下, 一個模型 L(θ) 的概似估計的目的是透過最大概似估計法來找到最佳的參數。實務上, 我們會使用對數概似估計函數 (log-likelihood) ℓ(θ)=log(L(θ)), 會比較容易最佳化。如下: + +19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** + +⟶ 牛頓演算法 - 牛頓演算法是一個數值方法, 目的在於找到一個 θ, 讓 ℓ′(θ)=0。其更新的規則為: + +20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** + +⟶ 注意:多維度正規化的方法, 或又被稱之為牛頓-拉弗森 (Newton-Raphson) 演算法, 是透過以下的規則更新: + +21. **Linear models** + +⟶ 線性模型 + +22. **Linear regression** + +⟶ 線性迴歸 + +23. **We assume here that y|x;θ∼N(μ,σ2)** + +⟶ 我們假設 y|x;θ∼N(μ,σ2) + +24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** + +⟶ 正規方程法 - 我們使用 X 代表矩陣, 讓代價函數最小的 θ 值有一個封閉解, 如下: + +25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** + +⟶ 最小均方演算法 (LMS) - 我們使用 α 表示學習速率, 針對 m 個訓練資料, 透過最小均方演算法的更新規則, 或是叫做 Widrow-Hoff 學習法如下: + +26. **Remark: the update rule is a particular case of the gradient ascent.** + +⟶ 注意:這個更新的規則是梯度上升的一種特例 + +27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** + +⟶ 局部加權迴歸 , 又稱為 LWR, 是線性洄歸的變形, 通過w(i)(x) 對其成本函數中的每個訓練樣本進行加權, 其中參數 τ∈R 定義為: + +28. **Classification and logistic regression** + +⟶ 分類與邏輯迴歸 + +29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** + +⟶ Sigmoid 函數 - Sigmoid 函數 g, 也可以稱為邏輯函數定義如下: + +30. **Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:** + +⟶ 邏輯迴歸 - 我們假設 y|x;θ∼Bernoulli(ϕ), 請參考以下: + +31. **Remark: there is no closed form solution for the case of logistic regressions.** + +⟶ 注意:對於這種情況的邏輯迴歸, 並沒有一個封閉解 + +32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** + +⟶ Softmax 迴歸 - Softmax 迴歸又稱做多分類邏輯迴歸, 目的是用在超過兩個以上的分類時的迴歸使用。按照慣例, 我們設定 θK=0, 讓每一個類別的 Bernoulli 參數 ϕi 等同於: + +33. **Generalized Linear Models** + +⟶ 廣義線性模型 + +34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** + +⟶ 指數族分佈 - 一個分佈如果可以透過自然參數 (或稱之為正準參數或連結函數) η、充分統計量 T(y) 和對數區分函數 (log-partition function) a(η) 來表示時, 我們就稱這個分佈是屬於指數族分佈。該分佈可以表示如下: + +35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** + +⟶ 注意:我們經常讓 T(y)=y, 同時, exp(−a(η)) 可以看成是一個正規化的參數, 目的在於讓機率總和為一。 + +36. **Here are the most common exponential distributions summed up in the following table:** + +⟶ 底下是最常見的指數分佈: + +37. **[Distribution, Bernoulli, Gaussian, Poisson, Geometric]** + +⟶ [分佈, 白努利 (Bernoulli), 高斯 (Gaussian), 卜瓦松 (Poisson), 幾何 (Geometric)] + +38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** + +⟶ 廣義線性模型的假設 - 廣義線性模型 (GLM) 的目的在於, 給定 x∈Rn+1, 要預測隨機變數 y, 同時它依賴底下三個假設: + +39. **Remark: ordinary least squares and logistic regression are special cases of generalized linear models.** + +⟶ 注意:最小平方法和邏輯迴歸是廣義線性模型的一種特例 + +40. **Support Vector Machines** + +⟶ 支援向量機 + +41. **The goal of support vector machines is to find the line that maximizes the minimum distance to the line.** + +⟶ 支援向量機的目的在於找到一條決策邊界和資料樣本之間最大化最小距離的線 + +42. **Optimal margin classifier ― The optimal margin classifier h is such that:** + +⟶ 最佳的邊界分類器 - 最佳的邊界分類器可以表示為: + +43. **where (w,b)∈Rn×R is the solution of the following optimization problem:** + +⟶ 其中, (w,b)∈Rn×R 是底下最佳化問題的答案: + +44. **such that** + +⟶ 使得 + +45. **support vectors** + +⟶ 支援向量 + +46. **Remark: the line is defined as wTx−b=0.** + +⟶ 注意:該條直線定義為 wTx−b=0 + +47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** + +⟶ Hinge 損失函數 - Hinge 損失函數用在支援向量機上, 定義如下: + +48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** + +⟶ 核(函數) - 給定特徵轉換 ϕ, 我們定義核(函數) K 為: + +49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** + +⟶ 實務上, K(x,z)=exp(−||x−z||22σ2) 定義的核(函數) K, 一般稱作高斯核(函數)。這種核(函數)經常被使用 + +50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** + +⟶ [非線性可分, 使用核(函數)進行映射, 原始空間中的決策邊界] + +51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** + +⟶ 注意:我們使用 "核(函數)技巧" 來計算代價函數時, 不需要真正的知道映射函數 ϕ, 這個函數非常複雜。相反的, 我們只需要知道 K(x,z) 的值即可。 + +52. **Lagrangian ― We define the Lagrangian L(w,b) as follows:** + +⟶ Lagrangian - 我們將 Lagrangian L(w,b) 定義如下: + +53. **Remark: the coefficients βi are called the Lagrange multipliers.** + +⟶ 注意:係數 βi 稱為 Lagrange 乘數 + +54. **Generative Learning** + +⟶ 生成學習 + +55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** + +⟶ 生成模型嘗試透過預估 P(x|y) 來學習資料如何生成, 而我們可以透過貝氏定理來預估 P(y|x) + +56. **Gaussian Discriminant Analysis** + +⟶ 高斯判別分析 + +57. **Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:** + +⟶ 設定 - 高斯判別分析針對 y、x|y=0 和 x|y=1 進行以下假設: + +58. **Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:** + +⟶ 估計 - 底下的表格總結了我們在最大概似估計時的估計值: + +59. **Naive Bayes** + +⟶ 單純貝氏 + +60. **Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:** + +⟶ 假設 - 單純貝氏模型會假設每個資料點的特徵都是獨立的。 + +61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** + +⟶ 解決方法 - 最大化對數概似估計來給出以下解答, k∈{0,1},l∈[[1,L]] + +62. **Remark: Naive Bayes is widely used for text classification and spam detection.** + +⟶ 注意:單純貝氏廣泛應用在文字分類和垃圾信件偵測上 + +63. **Tree-based and ensemble methods** + +⟶ 基於樹狀結構的學習和整體學習 + +64. **These methods can be used for both regression and classification problems.** + +⟶ 這些方法可以應用在迴歸或分類問題上 + +65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** + +⟶ CART - 分類與迴歸樹 (CART), 通常稱之為決策數, 可以被表示為二元樹。它的優點是具有可解釋性。 + +66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** + +⟶ 隨機森林 - 這是一個基於樹狀結構的方法, 它使用大量經由隨機挑選的特徵所建構的決策樹。與單純的決策樹不同, 它通常具有高度不可解釋性, 但它的效能通常很好, 所以是一個相當流行的演算法。 + +67. **Remark: random forests are a type of ensemble methods.** + +⟶ 注意:隨機森林是一種整體學習方法 + +68. **Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:** + +⟶ 增強學習 (Boosting) - 增強學習方法的概念是結合數個弱學習模型來變成強學習模型。主要的分類如下: + +69. **[Adaptive boosting, Gradient boosting]** + +⟶ [自適應增強, 梯度增強] + +70. **High weights are put on errors to improve at the next boosting step** + +⟶ 在下一輪的提升步驟中, 錯誤的部分會被賦予較高的權重 + +71. **Weak learners trained on remaining errors** + +⟶ 弱學習器會負責訓練剩下的錯誤 + +72. **Other non-parametric approaches** + +⟶ 其他非參數方法 + +73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** + +⟶ k-最近鄰 - k-最近鄰演算法, 又稱之為 k-NN, 是一個非參數的方法, 其中資料點的決定是透過訓練集中最近的 k 個鄰居而決定。它可以用在分類和迴歸問題上。 + +74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** + +⟶ 注意:參數 k 的值越大, 偏差越大。k 的值越小, 變異越大。 + +75. **Learning Theory** + +⟶ 學習理論 + +76. **Union bound ― Let A1,...,Ak be k events. We have:** + +⟶ 聯集上界 - 令 A1,...,Ak 為 k 個事件, 我們有: + +77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** + +⟶ 霍夫丁不等式 - 令 Z1,..,Zm 為 m 個從參數 ϕ 的白努利分佈中抽出的獨立同分佈 (iid) 的變數。令 ˆϕ 為其樣本平均、固定 γ>0, 我們可以得到: + +78. **Remark: this inequality is also known as the Chernoff bound.** + +⟶ 注意:這個不等式也被稱之為 Chernoff 界線 + +79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** + +⟶ 訓練誤差 - 對於一個分類器 h, 我們定義訓練誤差為 ˆϵ(h), 也可以稱為經驗風險或經驗誤差。定義如下: + +80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: ** + +⟶ 可能近似正確 (PAC) - PAC 是一個框架, 有許多學習理論都證明其有效性。它包含以下假設: + +81: **the training and testing sets follow the same distribution** + +⟶ 訓練和測試資料集具有相同的分佈 + +82. **the training examples are drawn independently** + +⟶ 訓練資料集之間彼此獨立 + +83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** + +⟶ 打散 (Shattering) - 給定一個集合 S={x(1),...,x(d)} 以及一組分類器的集合 H, 如果對於任何一組標籤 {y(1),...,y(d)}, H 都能打散 S, 定義如下: + +84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** + +⟶ 上限定理 - 令 H 是一個有限假設類別, 使 |H|=k 且令 δ 和樣本大小 m 固定, 結著, 在機率至少為 1−δ 的情況下, 我們得到: + +85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** + +⟶ VC 維度 - 一個有限假設類別的 Vapnik-Chervonenkis (VC) 維度 VC(H) 指的是 H 最多能夠打散的數量 + +86. **Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.** + +⟶ 注意:H={2 維的線性分類器} 的 VC 維度為 3 + +87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** + +⟶ 理論 (Vapnik) - 令 H 已給定, VC(H)=d 且 m 是訓練資料級的數量, 在機率至少為 1−δ 的情況下, 我們得到: + +88. **Known as Adaboost** + +⟶ 被稱為 Adaboost diff --git a/.history/zh-tw/cs-229-unsupervised-learning_20191006134707.md b/.history/zh-tw/cs-229-unsupervised-learning_20191006134707.md new file mode 100644 index 000000000..0f6d5ee34 --- /dev/null +++ b/.history/zh-tw/cs-229-unsupervised-learning_20191006134707.md @@ -0,0 +1,298 @@ +1. **Unsupervised Learning cheatsheet** + +⟶ +非監督式學習參考手冊 +
+ +2. **Introduction to Unsupervised Learning** + +⟶ +非監督式學習介紹 +
+ +3. **Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** + +⟶ +動機 - 非監督式學習的目的是要找出未標籤資料 {x(1),...,x(m)} 之間的隱藏模式 +
+ +4. **Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** + +⟶ +Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們可以得到底下這個不等式: +
+ +5. **Clustering** + +⟶ +分群 +
+ +6. **Expectation-Maximization** + +⟶ +最大期望值 +
+ +7. **Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** + +⟶ +潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數,這會讓問題的估計變得困難,我們通常使用 z 來代表它。底下是潛在變數的常見設定: +
+ +8. **[Setting, Latent variable z, Comments]** + +⟶ +[設定, 潛在變數 z, 評論] +
+ +9. **[Mixture of k Gaussians, Factor analysis]** + +⟶ +[k 元高斯模型, 因素分析] +
+ +10. **Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** + +⟶ +演算法 - 最大期望演算法 (EM Algorithm) 透過重複建構一個概似函數的下界 (E-step) 和最佳化下界 (M-step) 來進行最大概似估計給出參數 θ 的高效率估計方法: +
+ +11. **E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** + +⟶ +E-step: 評估後驗機率 Qi(z(i)),其中每個資料點 x(i) 來自於一個特定的群集 z(i),如下: +
+ +12. **M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** + +⟶ +M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重,用來分別重新估計每個群集,如下: +
+ +13. **[Gaussians initialization, Expectation step, Maximization step, Convergence]** + +⟶ +[高斯分佈初始化, E-Step, M-Step, 收斂] +
+ +14. **k-means clustering** + +⟶ +k-means 分群法 +
+ +15. **We note c(i) the cluster of data point i and μj the center of cluster j.** + +⟶ +我們使用 c(i) 表示資料 i 屬於某群,而 μj 則是群 j 的中心 +
+ +16. **Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** + +⟶ +演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後,k-means 演算法重複以下步驟直到收斂: +
+ +17. **[Means initialization, Cluster assignment, Means update, Convergence]** + +⟶ +[中心點初始化, 指定群集, 更新中心點, 收斂] +
+ +18. **Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** + +⟶ +畸變函數 - 為了確認演算法是否收斂,我們定義以下的畸變函數: +
+ +19. **Hierarchical clustering** + +⟶ +階層式分群法 +
+ +20. **Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** + +⟶ +演算法 - 階層式分群法是透過一種階層架構的方式,將資料建立為一種連續層狀結構的形式。 +
+ +21. **Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** + +⟶ +類型 - 底下是幾種不同類型的階層式分群法,差別在於要最佳化的目標函式的不同,請參考底下: +
+ +22. **[Ward linkage, Average linkage, Complete linkage]** + +⟶ +[Ward 鏈結距離, 平均鏈結距離, 完整鏈結距離] +
+ +23. **[Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** + +⟶ +[最小化群內距離, 最小化各群彼此的平均距離, 最小化各群彼此的最大距離] +
+ +24. **Clustering assessment metrics** + +⟶ +分群衡量指標 +
+ +25. **In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** + +⟶ +在非監督式學習中,通常很難去評估一個模型的好壞,因為我們沒有擁有像在監督式學習任務中正確答案的標籤 +
+ +26. **Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** + +⟶ +輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離,輪廓係數 s 對於此一樣本點的定義為: +
+ +27. **Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** + +⟶ +Calinski-Harabaz 指標 - 定義 k 是群集的數量,Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices): +
+ +28. **the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** + +⟶ +Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高,代表分群模型的表現越好。定義如下: +
+ +29. **Dimension reduction** + +⟶ +維度縮減 +
+ +30. **Principal component analysis** + +⟶ +主成份分析 +
+ +31. **It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** + +⟶ +這是一個維度縮減的技巧,在於找到投影資料的最大方差 +
+ +32. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** + +⟶ +特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,我們說 λ 是 A 的特徵值,當存在一個特徵向量 z∈Rn∖{0},使得: +
+ +33. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** + +⟶ +譜定理 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn),我們得到: +
+ +34. **diagonal** + +⟶ +對角線 +
+ +35. **Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** + +⟶ +注意:與特徵值所關聯的特徵向量就是 A 矩陣的主特徵向量 +
+ +36. **Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:** + +⟶ +演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧,它會透過尋找資料最大變異的方式,將資料投影在 k 維空間上: +
+ +37. **Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** + +⟶ +第一步:正規化資料,讓資料平均為 0,變異數為 1 +
+ +38. **Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** + +⟶ +第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n,即對稱實際特徵值 +
+ +39. **Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** + +⟶ +第三步:計算 u1,...,uk∈Rn,k 個正交主特徵向量的總和 Σ,即是 k 個最大特徵值的正交特徵向量 +
+ +40. **Step 4: Project the data on spanR(u1,...,uk).** + +⟶ +第四部:將資料投影到 spanR(u1,...,uk) +
+ +41. **This procedure maximizes the variance among all k-dimensional spaces.** + +⟶ +這個步驟會最大化所有 k 維空間的變異數 +
+ +42. **[Data in feature space, Find principal components, Data in principal components space]** + +⟶ +[資料在特徵空間, 尋找主成分, 資料在主成分空間] +
+ +43. **Independent component analysis** + +⟶ +獨立成分分析 +
+ +44. **It is a technique meant to find the underlying generating sources.** + +⟶ +這是用來尋找潛在生成來源的技巧 +
+ +45. **Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** + +⟶ +假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生,si 為獨立變數,透過一個混合與非奇異矩陣 A 產生如下: +
+ +46. **The goal is to find the unmixing matrix W=A−1.** + +⟶ +目的在於找到一個 unmixing 矩陣 W=A−1 +
+ +47. **Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** + +⟶ +Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟來找到 unmixing 矩陣: +
+ +48. **Write the probability of x=As=W−1s as:** + +⟶ +紀錄 x=As=W−1s 的機率如下: +
+ +49. **Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** + +⟶ +在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下,其對數概似估計函數與定義 g 為 sigmoid 函數如下: +
+ +50. **Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** + +⟶ +因此,梯度隨機下降學習規則對每個訓練樣本 x(i) 來說,我們透過以下方法來更新 W: diff --git a/.history/zh-tw/cs-229-unsupervised-learning_20191006140209.md b/.history/zh-tw/cs-229-unsupervised-learning_20191006140209.md new file mode 100644 index 000000000..54044637a --- /dev/null +++ b/.history/zh-tw/cs-229-unsupervised-learning_20191006140209.md @@ -0,0 +1,298 @@ +1. **Unsupervised Learning cheatsheet** + +⟶ +非監督式學習參考手冊 +
+ +2. **Introduction to Unsupervised Learning** + +⟶ +非監督式學習介紹 +
+ +3. **Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.** + +⟶ +動機 - 非監督式學習的目的是要找出未標籤資料 {x(1),...,x(m)} 之間的隱藏模式 +
+ +4. **Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** + +⟶ +Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數, 我們可以得到底下這個不等式: +
+ +5. **Clustering** + +⟶ +分群 +
+ +6. **Expectation-Maximization** + +⟶ +最大期望值 +
+ +7. **Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** + +⟶ +潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數, 這會讓問題的估計變得困難, 我們通常使用 z 來代表它。底下是潛在變數的常見設定: +
+ +8. **[Setting, Latent variable z, Comments]** + +⟶ +[設定, 潛在變數 z, 評論] +
+ +9. **[Mixture of k Gaussians, Factor analysis]** + +⟶ +[k 元高斯模型, 因素分析] +
+ +10. **Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:** + +⟶ +演算法 - 最大期望演算法 (EM Algorithm) 透過重複建構一個概似函數的下界 (E-step) 和最佳化下界 (M-step) 來進行最大概似估計給出參數 θ 的高效率估計方法: +
+ +11. **E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** + +⟶ +E-step: 評估後驗機率 Qi(z(i)), 其中每個資料點 x(i) 來自於一個特定的群集 z(i), 如下: +
+ +12. **M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** + +⟶ +M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重, 用來分別重新估計每個群集, 如下: +
+ +13. **[Gaussians initialization, Expectation step, Maximization step, Convergence]** + +⟶ +[高斯分佈初始化, E-Step, M-Step, 收斂] +
+ +14. **k-means clustering** + +⟶ +k-means 分群法 +
+ +15. **We note c(i) the cluster of data point i and μj the center of cluster j.** + +⟶ +我們使用 c(i) 表示資料 i 屬於某群, 而 μj 則是群 j 的中心 +
+ +16. **Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** + +⟶ +演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後, k-means 演算法重複以下步驟直到收斂: +
+ +17. **[Means initialization, Cluster assignment, Means update, Convergence]** + +⟶ +[中心點初始化, 指定群集, 更新中心點, 收斂] +
+ +18. **Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** + +⟶ +畸變函數 - 為了確認演算法是否收斂, 我們定義以下的畸變函數: +
+ +19. **Hierarchical clustering** + +⟶ +階層式分群法 +
+ +20. **Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** + +⟶ +演算法 - 階層式分群法是透過一種階層架構的方式, 將資料建立為一種連續層狀結構的形式。 +
+ +21. **Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** + +⟶ +類型 - 底下是幾種不同類型的階層式分群法, 差別在於要最佳化的目標函式的不同, 請參考底下: +
+ +22. **[Ward linkage, Average linkage, Complete linkage]** + +⟶ +[Ward 鏈結距離, 平均鏈結距離, 完整鏈結距離] +
+ +23. **[Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]** + +⟶ +[最小化群內距離, 最小化各群彼此的平均距離, 最小化各群彼此的最大距離] +
+ +24. **Clustering assessment metrics** + +⟶ +分群衡量指標 +
+ +25. **In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** + +⟶ +在非監督式學習中, 通常很難去評估一個模型的好壞, 因為我們沒有擁有像在監督式學習任務中正確答案的標籤 +
+ +26. **Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** + +⟶ +輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離, 輪廓係數 s 對於此一樣本點的定義為: +
+ +27. **Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** + +⟶ +Calinski-Harabaz 指標 - 定義 k 是群集的數量, Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices): +
+ +28. **the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** + +⟶ +Calinski-Harabaz 指標 s(k) 指出分群模型的好壞, 此指標的值越高, 代表分群模型的表現越好。定義如下: +
+ +29. **Dimension reduction** + +⟶ +維度縮減 +
+ +30. **Principal component analysis** + +⟶ +主成份分析 +
+ +31. **It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** + +⟶ +這是一個維度縮減的技巧, 在於找到投影資料的最大方差 +
+ +32. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** + +⟶ +特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n, 我們說 λ 是 A 的特徵值, 當存在一個特徵向量 z∈Rn∖{0}, 使得: +
+ +33. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** + +⟶ +譜定理 - 令 A∈Rn×n, 如果 A 是對稱的, 則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn), 我們得到: +
+ +34. **diagonal** + +⟶ +對角線 +
+ +35. **Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.** + +⟶ +注意:與特徵值所關聯的特徵向量就是 A 矩陣的主特徵向量 +
+ +36. **Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:** + +⟶ +演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧, 它會透過尋找資料最大變異的方式, 將資料投影在 k 維空間上: +
+ +37. **Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** + +⟶ +第一步:正規化資料, 讓資料平均為 0, 變異數為 1 +
+ +38. **Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** + +⟶ +第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n, 即對稱實際特徵值 +
+ +39. **Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** + +⟶ +第三步:計算 u1,...,uk∈Rn, k 個正交主特徵向量的總和 Σ, 即是 k 個最大特徵值的正交特徵向量 +
+ +40. **Step 4: Project the data on spanR(u1,...,uk).** + +⟶ +第四部:將資料投影到 spanR(u1,...,uk) +
+ +41. **This procedure maximizes the variance among all k-dimensional spaces.** + +⟶ +這個步驟會最大化所有 k 維空間的變異數 +
+ +42. **[Data in feature space, Find principal components, Data in principal components space]** + +⟶ +[資料在特徵空間, 尋找主成分, 資料在主成分空間] +
+ +43. **Independent component analysis** + +⟶ +獨立成分分析 +
+ +44. **It is a technique meant to find the underlying generating sources.** + +⟶ +這是用來尋找潛在生成來源的技巧 +
+ +45. **Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** + +⟶ +假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生, si 為獨立變數, 透過一個混合與非奇異矩陣 A 產生如下: +
+ +46. **The goal is to find the unmixing matrix W=A−1.** + +⟶ +目的在於找到一個 unmixing 矩陣 W=A−1 +
+ +47. **Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:** + +⟶ +Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟來找到 unmixing 矩陣: +
+ +48. **Write the probability of x=As=W−1s as:** + +⟶ +紀錄 x=As=W−1s 的機率如下: +
+ +49. **Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** + +⟶ +在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下, 其對數概似估計函數與定義 g 為 sigmoid 函數如下: +
+ +50. **Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** + +⟶ +因此, 梯度隨機下降學習規則對每個訓練樣本 x(i) 來說, 我們透過以下方法來更新 W: diff --git a/.history/zh/cs-229-supervised-learning_20191006134707.md b/.history/zh/cs-229-supervised-learning_20191006134707.md new file mode 100644 index 000000000..4a7f4bbb9 --- /dev/null +++ b/.history/zh/cs-229-supervised-learning_20191006134707.md @@ -0,0 +1,567 @@ +1. **Supervised Learning cheatsheet** + +⟶ 监督学习简明指南 + +
+ +2. **Introduction to Supervised Learning** + +⟶ 监督学习简介 + +
+ +3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** + +⟶ 给定一组数据点 {x(1),...,x(m)} 和与其对应的输出 {y(1),...,y(m)} , 我们想要建立一个分类器,学习如何从 x 预测 y。 + +
+ +4. **Type of prediction ― The different types of predictive models are summed up in the table below:** + +⟶ 预测类型 - 不同类型的预测模型总结如下表: + +
+ +5. **[Regression, Classifier, Outcome, Examples]** + +⟶ [回归,分类,输出,例子] + +
+ +6. **[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]** + +⟶ [连续,类,线性回归,Logistic回归,SVM,朴素贝叶斯] + +
+ +7. **Type of model ― The different models are summed up in the table below:** + +⟶ 型号类型 - 不同型号总结如下表: + +
+ +8. **[Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]** + +⟶ [判别模型,生成模型,目标,所学内容,例图,示例] + +
+ +9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** + +⟶ [直接估计P(y|x),估计P(x|y) 然后推导 P(y|x),决策边界,数据的概率分布,回归,SVMs,GDA,朴素贝叶斯] + +
+ +10. **Notations and general concepts** + +⟶ 符号和一般概念 + +
+ +11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** + +⟶ 假设 - 假设我们选择的模型是hθ 。 对于给定的输入数据 x(i),模型预测输出是 hθ(x(i))。 + +
+ +12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** + +⟶ 损失函数 - 损失函数是一个 L:(z,y)∈R×Y⟼L(z,y)∈R 的函数,其将真实数据值 y 和其预测值 z 作为输入,输出它们的不同程度。 常见的损失函数总结如下表: + +
+ +13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]** + +⟶ [最小二乘误差,Logistic损失,铰链损失,交叉熵] + +
+ +14. **[Linear regression, Logistic regression, SVM, Neural Network]** + +⟶ [线性回归,Logistic回归,SVM,神经网络] + +
+ +15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** + +⟶ 成本函数 - 成本函数 J 通常用于评估模型的性能,使用损失函数 L 定义如下: + +
+ +16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** + +⟶ 梯度下降 - 记学习率为 α∈R,梯度下降的更新规则使用学习率和成本函数 J 表示如下: + +
+ +17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** + +⟶ 备注:随机梯度下降(SGD)是根据每个训练样本进行参数更新,而批量梯度下降是在一批训练样本上进行更新。 + +
+ +18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** + +⟶ 似然 - 给定参数 θ 的模型 L(θ)的似然性用于通过最大化似然性来找到最佳参数θ。 在实践中,我们使用更容易优化的对数似然 ℓ(θ)=log(L(θ)) 。我们有 + +
+ +19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** + +⟶ 牛顿算法 - 牛顿算法是一种数值方法,目的是找到一个 θ 使得 ℓ′(θ)=0. 其更新规则如下: + +
+ +20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** + +⟶ 备注:多维泛化,也称为 Newton-Raphson 方法,具有以下更新规则: + +
+ +21. **Linear models** + +⟶ 线性模型 + +
+ +22. **Linear regression** + +⟶ 线性回归 + +
+ +23. **We assume here that y|x;θ∼N(μ,σ2)** + +⟶ 我们假设 y|x;θ∼N(μ,σ2) + +
+ +24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** + +⟶ 正规方程 - 通过设计 X 矩阵,使得最小化成本函数时 θ 有闭式解: + +
+ +25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** + +⟶ LMS算法 - 通过 α 学习率,训练集中 m 个数据的最小均方(LMS)算法的更新规则也称为Widrow-Hoff学习规则,如下 + +
+ +26. **Remark: the update rule is a particular case of the gradient ascent.** + +⟶ 备注:更新规则是梯度上升的特定情况。 + +
+ +27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** + +⟶ LWR - 局部加权回归,也称为LWR,是线性回归的变体,通过 w(i)(x) 对其成本函数中的每个训练样本进行加权,其中参数 τ∈R 定义为 + +
+ +28. **Classification and logistic regression** + +⟶ 分类和逻辑回归 + +
+ +29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** + +⟶ Sigmoid函数 - sigmoid 函数 g,也称为逻辑函数,定义如下: + +
+ +30. **Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:** + +⟶ 逻辑回归 - 我们假设 y|x;θ∼Bernoulli(ϕ) 。 我们有以下形式: + +
+ +31. **Remark: there is no closed form solution for the case of logistic regressions.** + +⟶ 备注:对于逻辑回归的情况,没有闭式解。 + +
+ +32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** + +⟶ Softmax回归 - 当存在超过2个结果类时,使用softmax回归(也称为多类逻辑回归)来推广逻辑回归。 按照惯例,我们设置 θK=0,使得每个类 i 的伯努利参数 ϕi 等于: + +
+ +33. **Generalized Linear Models** + +⟶ 广义线性模型 + +
+ +34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** + +⟶ 指数分布族 - 如果可以用自然参数 η,也称为规范参数或链接函数,充分统计量 T(y) 和对数分割函数a(η)来表示,则称一类分布在指数分布族中, 函数如下: + +
+ +35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** + +⟶ 备注:我们经常会有 T(y)=y。 此外,exp(−a(η)) 可以看作是归一化参数,确保概率总和为1 + +
+ +36. **Here are the most common exponential distributions summed up in the following table:** + +⟶ 下表中是总结的最常见的指数分布: + +
+ +37. **[Distribution, Bernoulli, Gaussian, Poisson, Geometric]** + +⟶ [分布,伯努利,高斯,泊松,几何] + +
+ +38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** + +⟶ GLM的假设 - 广义线性模型(GLM)是旨在将随机变量 y 预测为 x∈Rn+1 的函数,并依赖于以下3个假设: + +
+ +39. **Remark: ordinary least squares and logistic regression are special cases of generalized linear models.** + +⟶ 备注:普通最小二乘法和逻辑回归是广义线性模型的特例 + +
+ +40. **Support Vector Machines** + +⟶ 支持向量机 + +
+ +41. **The goal of support vector machines is to find the line that maximizes the minimum distance to the line.** + +⟶ 支持向量机的目标是找到使决策界和训练样本之间最大化最小距离的线。 + +
+ +42. **Optimal margin classifier ― The optimal margin classifier h is such that:** + +⟶ 最优间隔分类器 - 最优间隔分类器 h 是这样的: + +
+ +43. **where (w,b)∈Rn×R is the solution of the following optimization problem:** + +⟶ 其中 (w,b)∈Rn×R 是以下优化问题的解决方案: + +
+ +44. **such that** + +⟶ 使得 + +
+ +45. **support vectors** + +⟶ 支持向量 + +
+ +46. **Remark: the line is defined as wTx−b=0.** + +⟶ 备注:该线定义为 wTx−b=0。 + +
+ +47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** + +⟶ 合页损失 - 合页损失用于SVM,定义如下: + +
+ +48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** + +⟶ 核 - 给定特征映射 ϕ,我们定义核 K 为: + +
+ +49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** + +⟶ 在实践中,由 K(x,z)=exp(−||x−z||22σ2) 定义的核 K 被称为高斯核,并且经常使用这种核。 + +
+ +50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** + +⟶ [非线性可分性,核映射的使用,原始空间中的决策边界] + +
+ +51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** + +⟶ 备注:我们说我们使用“核技巧”来计算使用核的成本函数,因为我们实际上不需要知道显式映射φ,通常,这非常复杂。 相反,只需要 K(x,z) 的值。 + +
+ +52. **Lagrangian ― We define the Lagrangian L(w,b) as follows:** + +⟶ 拉格朗日 - 我们将拉格朗日 L(w,b) 定义如下: + +
+ +53. **Remark: the coefficients βi are called the Lagrange multipliers.** + +⟶ 备注:系数 βi 称为拉格朗日乘子。 + +
+ +54. **Generative Learning** + +⟶ 生成学习 + +
+ +55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** + +⟶ 生成模型首先尝试通过估计 P(x|y) 来模仿如何生成数据,然后我们可以使用贝叶斯法则来估计 P(y|x) + +
+ +56. **Gaussian Discriminant Analysis** + +⟶ 高斯判别分析 + +
+ +57. **Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:** + +⟶ 设置 - 高斯判别分析假设 y 和 x|y=0 且 x|y=1 如下: + +
+ +58. **Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:** + +⟶ 估计 - 下表总结了我们在最大化似然时的估计值: + +
+ +59. **Naive Bayes** + +⟶ 朴素贝叶斯 + +
+ +60. **Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:** + +⟶ 假设 - 朴素贝叶斯模型假设每个数据点的特征都是独立的: + +
+ +61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** + +⟶ 解决方案 - 最大化对数似然给出以下解,k∈{0,1},l∈[[1,L]] + +
+ +62. **Remark: Naive Bayes is widely used for text classification and spam detection.** + +⟶ 备注:朴素贝叶斯广泛用于文本分类和垃圾邮件检测。 + +
+ +63. **Tree-based and ensemble methods** + +⟶ 基于树的方法和集成方法 + +
+ +64. **These methods can be used for both regression and classification problems.** + +⟶ 这些方法可用于回归和分类问题。 + +
+ +65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** + +⟶ CART - 分类和回归树(CART),通常称为决策树,可以表示为二叉树。它们具有可解释性的优点。 + +
+ +66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** + +⟶ 随机森林 - 这是一种基于树模型的技术,它使用大量的由随机选择的特征集构建的决策树。 与简单的决策树相反,它是高度无法解释的,但其普遍良好的表现使其成为一种流行的算法。 + +
+ +67. **Remark: random forests are a type of ensemble methods.** + +⟶ 备注:随机森林是一种集成方法。 + +
+ +68. **Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:** + +⟶ 提升 - 提升方法的思想是将一些弱学习器结合起来形成一个更强大的学习器。 主要内容总结在下表中: + +
+ +69. **[Adaptive boosting, Gradient boosting]** + +⟶ [自适应增强, 梯度提升] + +
+ +70. **High weights are put on errors to improve at the next boosting step** + +⟶ 在下一轮提升步骤中,错误的会被置于高权重 + +
+ +71. **Weak learners trained on remaining errors** + +⟶ 弱学习器训练剩余的错误 + +
+ +72. **Other non-parametric approaches** + +⟶ 其他非参数方法 + +
+ +73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** + +⟶ k-最近邻 - k-最近邻算法,通常称为k-NN,是一种非参数方法,其中数据点的判决由来自训练集中与其相邻的k个数据的性质确定。 它可以用于分类和回归。 + +
+ +74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** + +⟶ 备注:参数 k 越高,偏差越大,参数 k 越低,方差越大。 + +
+ +75. **Learning Theory** + +⟶ 学习理论 + +
+ +76. **Union bound ― Let A1,...,Ak be k events. We have:** + +⟶ 联盟 - 让A1,…,Ak 成为 k 个事件。 我们有: + +
+ +77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** + +⟶ Hoeffding不等式 - 设Z1,...,Zm是从参数 φ 的伯努利分布中提取的 m iid 变量。 设 φ 为其样本均值,固定 γ> 0。 我们有: + +
+ +78. **Remark: this inequality is also known as the Chernoff bound.** + +⟶ 备注:这种不平等也被称为 Chernoff 界限。 + +
+ +79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** + +⟶ 训练误差 - 对于给定的分类器 h,我们定义训练误差 ϵ(h),也称为经验风险或经验误差,如下: + +
+ +80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:** + +⟶ 可能近似正确 (PAC) - PAC是一个框架,在该框架下证明了许多学习理论的结果,并具有以下假设: + +
+ +81. **the training and testing sets follow the same distribution** + +⟶ 训练和测试集遵循相同的分布 + +
+ +82. **the training examples are drawn independently** + +⟶ 训练样本是相互独立的 + +
+ +83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** + +⟶ 打散 - 给定一个集合 S={x(1),...,x(d)} 和一组分类器 H,如果对于任意一组标签 {y(1),...,y(d)} 都能对分,我们称 H 打散 S ,我们有: + +
+ +84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** + +⟶ 上限定理 - 设 H 是有限假设类,使得 |H|=k 并且使 δ 和样本大小 m 固定。 然后,在概率至少为 1-δ 的情况下,我们得到: + +
+ +85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** + +⟶ VC维 - 给定无限假设类 H 的 Vapnik-Chervonenkis(VC) 维,注意 VC(H) 是由 H 打散的最大集合的大小。 + +
+ +86. **Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.** + +⟶ 备注:H = {2维线性分类器集} 的 VC 维数为3。 + +
+ +87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** + +⟶ 定理 (Vapnik) - 设H,VC(H)=d ,m 为训练样本数。 概率至少为 1-δ,我们有: + +
+ +88. **[Introduction, Type of prediction, Type of model]** + +⟶ [简介,预测类型,模型类型] + +
+ +89. **[Notations and general concepts, loss function, gradient descent, likelihood]** + +⟶ [符号和一般概念,损失函数,梯度下降,似然] + +
+ +90. **[Linear models, linear regression, logistic regression, generalized linear models]** + +⟶ [线性模型,线性回归,逻辑回归,广义线性模型] + +
+ +91. **[Support vector machines, Optimal margin classifier, Hinge loss, Kernel]** + +⟶ [支持向量机,最优间隔分类器,合页损失,核] + +
+ +92. **[Generative learning, Gaussian Discriminant Analysis, Naive Bayes]** + +⟶ [生成学习,高斯判别分析,朴素贝叶斯] + +
+ +93. **[Trees and ensemble methods, CART, Random forest, Boosting]** + +⟶ 树和集成方法,CART,随机森林,提升] + +
+ +94. **[Other methods, k-NN]** + +⟶ [其他方法,k-NN] + +
+ +95. **[Learning theory, Hoeffding inequality, PAC, VC dimension]** + +⟶ [学习理论,Hoeffding不等式,PAC,VC维] diff --git a/.history/zh/cs-229-supervised-learning_20191006140209.md b/.history/zh/cs-229-supervised-learning_20191006140209.md new file mode 100644 index 000000000..8f5015c49 --- /dev/null +++ b/.history/zh/cs-229-supervised-learning_20191006140209.md @@ -0,0 +1,567 @@ +1. **Supervised Learning cheatsheet** + +⟶ 监督学习简明指南 + +
+ +2. **Introduction to Supervised Learning** + +⟶ 监督学习简介 + +
+ +3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** + +⟶ 给定一组数据点 {x(1),...,x(m)} 和与其对应的输出 {y(1),...,y(m)} , 我们想要建立一个分类器, 学习如何从 x 预测 y。 + +
+ +4. **Type of prediction ― The different types of predictive models are summed up in the table below:** + +⟶ 预测类型 - 不同类型的预测模型总结如下表: + +
+ +5. **[Regression, Classifier, Outcome, Examples]** + +⟶ [回归, 分类, 输出, 例子] + +
+ +6. **[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]** + +⟶ [连续, 类, 线性回归, Logistic回归, SVM, 朴素贝叶斯] + +
+ +7. **Type of model ― The different models are summed up in the table below:** + +⟶ 型号类型 - 不同型号总结如下表: + +
+ +8. **[Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]** + +⟶ [判别模型, 生成模型, 目标, 所学内容, 例图, 示例] + +
+ +9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** + +⟶ [直接估计P(y|x), 估计P(x|y) 然后推导 P(y|x), 决策边界, 数据的概率分布, 回归, SVMs, GDA, 朴素贝叶斯] + +
+ +10. **Notations and general concepts** + +⟶ 符号和一般概念 + +
+ +11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** + +⟶ 假设 - 假设我们选择的模型是hθ 。 对于给定的输入数据 x(i), 模型预测输出是 hθ(x(i))。 + +
+ +12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** + +⟶ 损失函数 - 损失函数是一个 L:(z,y)∈R×Y⟼L(z,y)∈R 的函数, 其将真实数据值 y 和其预测值 z 作为输入, 输出它们的不同程度。 常见的损失函数总结如下表: + +
+ +13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]** + +⟶ [最小二乘误差, Logistic损失, 铰链损失, 交叉熵] + +
+ +14. **[Linear regression, Logistic regression, SVM, Neural Network]** + +⟶ [线性回归, Logistic回归, SVM, 神经网络] + +
+ +15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** + +⟶ 成本函数 - 成本函数 J 通常用于评估模型的性能, 使用损失函数 L 定义如下: + +
+ +16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** + +⟶ 梯度下降 - 记学习率为 α∈R, 梯度下降的更新规则使用学习率和成本函数 J 表示如下: + +
+ +17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** + +⟶ 备注:随机梯度下降(SGD)是根据每个训练样本进行参数更新, 而批量梯度下降是在一批训练样本上进行更新。 + +
+ +18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** + +⟶ 似然 - 给定参数 θ 的模型 L(θ)的似然性用于通过最大化似然性来找到最佳参数θ。 在实践中, 我们使用更容易优化的对数似然 ℓ(θ)=log(L(θ)) 。我们有 + +
+ +19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** + +⟶ 牛顿算法 - 牛顿算法是一种数值方法, 目的是找到一个 θ 使得 ℓ′(θ)=0. 其更新规则如下: + +
+ +20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** + +⟶ 备注:多维泛化, 也称为 Newton-Raphson 方法, 具有以下更新规则: + +
+ +21. **Linear models** + +⟶ 线性模型 + +
+ +22. **Linear regression** + +⟶ 线性回归 + +
+ +23. **We assume here that y|x;θ∼N(μ,σ2)** + +⟶ 我们假设 y|x;θ∼N(μ,σ2) + +
+ +24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** + +⟶ 正规方程 - 通过设计 X 矩阵, 使得最小化成本函数时 θ 有闭式解: + +
+ +25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** + +⟶ LMS算法 - 通过 α 学习率, 训练集中 m 个数据的最小均方(LMS)算法的更新规则也称为Widrow-Hoff学习规则, 如下 + +
+ +26. **Remark: the update rule is a particular case of the gradient ascent.** + +⟶ 备注:更新规则是梯度上升的特定情况。 + +
+ +27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** + +⟶ LWR - 局部加权回归, 也称为LWR, 是线性回归的变体, 通过 w(i)(x) 对其成本函数中的每个训练样本进行加权, 其中参数 τ∈R 定义为 + +
+ +28. **Classification and logistic regression** + +⟶ 分类和逻辑回归 + +
+ +29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** + +⟶ Sigmoid函数 - sigmoid 函数 g, 也称为逻辑函数, 定义如下: + +
+ +30. **Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:** + +⟶ 逻辑回归 - 我们假设 y|x;θ∼Bernoulli(ϕ) 。 我们有以下形式: + +
+ +31. **Remark: there is no closed form solution for the case of logistic regressions.** + +⟶ 备注:对于逻辑回归的情况, 没有闭式解。 + +
+ +32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** + +⟶ Softmax回归 - 当存在超过2个结果类时, 使用softmax回归(也称为多类逻辑回归)来推广逻辑回归。 按照惯例, 我们设置 θK=0, 使得每个类 i 的伯努利参数 ϕi 等于: + +
+ +33. **Generalized Linear Models** + +⟶ 广义线性模型 + +
+ +34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** + +⟶ 指数分布族 - 如果可以用自然参数 η, 也称为规范参数或链接函数, 充分统计量 T(y) 和对数分割函数a(η)来表示, 则称一类分布在指数分布族中, 函数如下: + +
+ +35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** + +⟶ 备注:我们经常会有 T(y)=y。 此外, exp(−a(η)) 可以看作是归一化参数, 确保概率总和为1 + +
+ +36. **Here are the most common exponential distributions summed up in the following table:** + +⟶ 下表中是总结的最常见的指数分布: + +
+ +37. **[Distribution, Bernoulli, Gaussian, Poisson, Geometric]** + +⟶ [分布, 伯努利, 高斯, 泊松, 几何] + +
+ +38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** + +⟶ GLM的假设 - 广义线性模型(GLM)是旨在将随机变量 y 预测为 x∈Rn+1 的函数, 并依赖于以下3个假设: + +
+ +39. **Remark: ordinary least squares and logistic regression are special cases of generalized linear models.** + +⟶ 备注:普通最小二乘法和逻辑回归是广义线性模型的特例 + +
+ +40. **Support Vector Machines** + +⟶ 支持向量机 + +
+ +41. **The goal of support vector machines is to find the line that maximizes the minimum distance to the line.** + +⟶ 支持向量机的目标是找到使决策界和训练样本之间最大化最小距离的线。 + +
+ +42. **Optimal margin classifier ― The optimal margin classifier h is such that:** + +⟶ 最优间隔分类器 - 最优间隔分类器 h 是这样的: + +
+ +43. **where (w,b)∈Rn×R is the solution of the following optimization problem:** + +⟶ 其中 (w,b)∈Rn×R 是以下优化问题的解决方案: + +
+ +44. **such that** + +⟶ 使得 + +
+ +45. **support vectors** + +⟶ 支持向量 + +
+ +46. **Remark: the line is defined as wTx−b=0.** + +⟶ 备注:该线定义为 wTx−b=0。 + +
+ +47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** + +⟶ 合页损失 - 合页损失用于SVM, 定义如下: + +
+ +48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** + +⟶ 核 - 给定特征映射 ϕ, 我们定义核 K 为: + +
+ +49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** + +⟶ 在实践中, 由 K(x,z)=exp(−||x−z||22σ2) 定义的核 K 被称为高斯核, 并且经常使用这种核。 + +
+ +50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** + +⟶ [非线性可分性, 核映射的使用, 原始空间中的决策边界] + +
+ +51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** + +⟶ 备注:我们说我们使用“核技巧”来计算使用核的成本函数, 因为我们实际上不需要知道显式映射φ, 通常, 这非常复杂。 相反, 只需要 K(x,z) 的值。 + +
+ +52. **Lagrangian ― We define the Lagrangian L(w,b) as follows:** + +⟶ 拉格朗日 - 我们将拉格朗日 L(w,b) 定义如下: + +
+ +53. **Remark: the coefficients βi are called the Lagrange multipliers.** + +⟶ 备注:系数 βi 称为拉格朗日乘子。 + +
+ +54. **Generative Learning** + +⟶ 生成学习 + +
+ +55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** + +⟶ 生成模型首先尝试通过估计 P(x|y) 来模仿如何生成数据, 然后我们可以使用贝叶斯法则来估计 P(y|x) + +
+ +56. **Gaussian Discriminant Analysis** + +⟶ 高斯判别分析 + +
+ +57. **Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:** + +⟶ 设置 - 高斯判别分析假设 y 和 x|y=0 且 x|y=1 如下: + +
+ +58. **Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:** + +⟶ 估计 - 下表总结了我们在最大化似然时的估计值: + +
+ +59. **Naive Bayes** + +⟶ 朴素贝叶斯 + +
+ +60. **Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:** + +⟶ 假设 - 朴素贝叶斯模型假设每个数据点的特征都是独立的: + +
+ +61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** + +⟶ 解决方案 - 最大化对数似然给出以下解, k∈{0,1}, l∈[[1,L]] + +
+ +62. **Remark: Naive Bayes is widely used for text classification and spam detection.** + +⟶ 备注:朴素贝叶斯广泛用于文本分类和垃圾邮件检测。 + +
+ +63. **Tree-based and ensemble methods** + +⟶ 基于树的方法和集成方法 + +
+ +64. **These methods can be used for both regression and classification problems.** + +⟶ 这些方法可用于回归和分类问题。 + +
+ +65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** + +⟶ CART - 分类和回归树(CART), 通常称为决策树, 可以表示为二叉树。它们具有可解释性的优点。 + +
+ +66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** + +⟶ 随机森林 - 这是一种基于树模型的技术, 它使用大量的由随机选择的特征集构建的决策树。 与简单的决策树相反, 它是高度无法解释的, 但其普遍良好的表现使其成为一种流行的算法。 + +
+ +67. **Remark: random forests are a type of ensemble methods.** + +⟶ 备注:随机森林是一种集成方法。 + +
+ +68. **Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:** + +⟶ 提升 - 提升方法的思想是将一些弱学习器结合起来形成一个更强大的学习器。 主要内容总结在下表中: + +
+ +69. **[Adaptive boosting, Gradient boosting]** + +⟶ [自适应增强, 梯度提升] + +
+ +70. **High weights are put on errors to improve at the next boosting step** + +⟶ 在下一轮提升步骤中, 错误的会被置于高权重 + +
+ +71. **Weak learners trained on remaining errors** + +⟶ 弱学习器训练剩余的错误 + +
+ +72. **Other non-parametric approaches** + +⟶ 其他非参数方法 + +
+ +73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** + +⟶ k-最近邻 - k-最近邻算法, 通常称为k-NN, 是一种非参数方法, 其中数据点的判决由来自训练集中与其相邻的k个数据的性质确定。 它可以用于分类和回归。 + +
+ +74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** + +⟶ 备注:参数 k 越高, 偏差越大, 参数 k 越低, 方差越大。 + +
+ +75. **Learning Theory** + +⟶ 学习理论 + +
+ +76. **Union bound ― Let A1,...,Ak be k events. We have:** + +⟶ 联盟 - 让A1, …, Ak 成为 k 个事件。 我们有: + +
+ +77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** + +⟶ Hoeffding不等式 - 设Z1, ..., Zm是从参数 φ 的伯努利分布中提取的 m iid 变量。 设 φ 为其样本均值, 固定 γ> 0。 我们有: + +
+ +78. **Remark: this inequality is also known as the Chernoff bound.** + +⟶ 备注:这种不平等也被称为 Chernoff 界限。 + +
+ +79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** + +⟶ 训练误差 - 对于给定的分类器 h, 我们定义训练误差 ϵ(h), 也称为经验风险或经验误差, 如下: + +
+ +80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:** + +⟶ 可能近似正确 (PAC) - PAC是一个框架, 在该框架下证明了许多学习理论的结果, 并具有以下假设: + +
+ +81. **the training and testing sets follow the same distribution** + +⟶ 训练和测试集遵循相同的分布 + +
+ +82. **the training examples are drawn independently** + +⟶ 训练样本是相互独立的 + +
+ +83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** + +⟶ 打散 - 给定一个集合 S={x(1),...,x(d)} 和一组分类器 H, 如果对于任意一组标签 {y(1),...,y(d)} 都能对分, 我们称 H 打散 S , 我们有: + +
+ +84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** + +⟶ 上限定理 - 设 H 是有限假设类, 使得 |H|=k 并且使 δ 和样本大小 m 固定。 然后, 在概率至少为 1-δ 的情况下, 我们得到: + +
+ +85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** + +⟶ VC维 - 给定无限假设类 H 的 Vapnik-Chervonenkis(VC) 维, 注意 VC(H) 是由 H 打散的最大集合的大小。 + +
+ +86. **Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.** + +⟶ 备注:H = {2维线性分类器集} 的 VC 维数为3。 + +
+ +87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** + +⟶ 定理 (Vapnik) - 设H, VC(H)=d , m 为训练样本数。 概率至少为 1-δ, 我们有: + +
+ +88. **[Introduction, Type of prediction, Type of model]** + +⟶ [简介, 预测类型, 模型类型] + +
+ +89. **[Notations and general concepts, loss function, gradient descent, likelihood]** + +⟶ [符号和一般概念, 损失函数, 梯度下降, 似然] + +
+ +90. **[Linear models, linear regression, logistic regression, generalized linear models]** + +⟶ [线性模型, 线性回归, 逻辑回归, 广义线性模型] + +
+ +91. **[Support vector machines, Optimal margin classifier, Hinge loss, Kernel]** + +⟶ [支持向量机, 最优间隔分类器, 合页损失, 核] + +
+ +92. **[Generative learning, Gaussian Discriminant Analysis, Naive Bayes]** + +⟶ [生成学习, 高斯判别分析, 朴素贝叶斯] + +
+ +93. **[Trees and ensemble methods, CART, Random forest, Boosting]** + +⟶ 树和集成方法, CART, 随机森林, 提升] + +
+ +94. **[Other methods, k-NN]** + +⟶ [其他方法, k-NN] + +
+ +95. **[Learning theory, Hoeffding inequality, PAC, VC dimension]** + +⟶ [学习理论, Hoeffding不等式, PAC, VC维] diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191005222943.md b/.history/zh/cs-230-recurrent-neural-networks_20191005222943.md new file mode 100644 index 000000000..ea2cbb6c2 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191005222943.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
**循环神经网络简明指南** + + +**2. CS 230 - Deep Learning** + +⟶ + +
**CS 230 - 深度学习** + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
**[概述, 网络结构, RNN的应用, 损失函数, 反向传播]** + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
**[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN]** + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
**[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe]** + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
**[词比较, 余弦相似度, t-SNE]** + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
**[语言模型, n-gram, 困惑]** + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
**[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数]** + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
**[注意力机制, 注意力模型, 注意力权重]** + + +**10. Overview** + +⟶ + +
**概述** + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
**传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式:** + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
**对于每一个时间步t,激活值a和输出y可表示如下:** + + +**13. and** + +⟶ + +
**并且** + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
**其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。** + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
**[一个典型的RNN体系结构的优点和缺点可概括如下表:]** + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
**[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享]** + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
**[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响]** + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
**RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景:** + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
**[RNN的类型, 图形表示, 示例]** + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
**[一对一, 一对多, 多对一, 多对多]** + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
**[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译]** + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
**损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下:** + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
**随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下:** + + +**24. Handling long term dependencies** + +⟶ + +
**解决长时间依赖问题** + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
**常用的激活函数 - 在RNN模型中常用的激活函数如下所示:** + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
**[Sigmoid, Tanh, RELU]** + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
**梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。** + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
**梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。** + + +**29. clipped** + +⟶ + +
**裁剪** + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
**门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ:** + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
**其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下:** + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
**[门类型, 角色, 被用于]** + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
**[更新门, 关联门, 遗忘门, 输出门]** + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
**[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?]** + + +**35. [LSTM, GRU]** + +⟶ + +
**[LSTM, GRU]** + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006135323.md b/.history/zh/cs-230-recurrent-neural-networks_20191006135323.md new file mode 100644 index 000000000..2720352c0 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006135323.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006135801.md b/.history/zh/cs-230-recurrent-neural-networks_20191006135801.md new file mode 100644 index 000000000..395fcbbc4 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006135801.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的 + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006135826.md b/.history/zh/cs-230-recurrent-neural-networks_20191006135826.md new file mode 100644 index 000000000..7ce0567a2 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006135826.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。 + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006140028.md b/.history/zh/cs-230-recurrent-neural-networks_20191006140028.md new file mode 100644 index 000000000..32f2d848d --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006140028.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006140054.md b/.history/zh/cs-230-recurrent-neural-networks_20191006140054.md new file mode 100644 index 000000000..7c2a197e8 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006140054.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006140209.md b/.history/zh/cs-230-recurrent-neural-networks_20191006140209.md new file mode 100644 index 000000000..7c2a197e8 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006140209.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006140226.md b/.history/zh/cs-230-recurrent-neural-networks_20191006140226.md new file mode 100644 index 000000000..7c2a197e8 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006140226.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
+ + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193030.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193030.md new file mode 100644 index 000000000..503782043 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193030.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
+ + +**43. Motivation and notations** + +⟶ + +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193242.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193242.md new file mode 100644 index 000000000..9621bdcc5 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193242.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
+ + +**46. [teddy bear, book, soft]** + +⟶ + +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193413.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193413.md new file mode 100644 index 000000000..01e9a1cde --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193413.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193533.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193533.md new file mode 100644 index 000000000..7cdac85f9 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193533.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193653.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193653.md new file mode 100644 index 000000000..29dceb85e --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193653.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 无相关信息] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193823.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193823.md new file mode 100644 index 000000000..f5a933770 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193823.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 嵌入矩阵E + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006193940.md b/.history/zh/cs-230-recurrent-neural-networks_20191006193940.md new file mode 100644 index 000000000..8b1ea8ca2 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006193940.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E可表示为: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194013.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194013.md new file mode 100644 index 000000000..cd2e9e220 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194013.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
+ + +**50. Word embeddings** + +⟶ + +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194143.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194143.md new file mode 100644 index 000000000..81c541b43 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194143.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec框架旨在于通过估计给定词 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194303.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194303.md new file mode 100644 index 000000000..f9d6315c6 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194303.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194503.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194503.md new file mode 100644 index 000000000..eac30d914 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194503.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194626.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194626.md new file mode 100644 index 000000000..c75b433a5 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194626.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过 + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194814.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194814.md new file mode 100644 index 000000000..ccf704c1e --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194814.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记 + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006194959.md b/.history/zh/cs-230-recurrent-neural-networks_20191006194959.md new file mode 100644 index 000000000..4cd24aed6 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006194959.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间相关的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006195149.md b/.history/zh/cs-230-recurrent-neural-networks_20191006195149.md new file mode 100644 index 000000000..14a18aac1 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006195149.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注: + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006200831.md b/.history/zh/cs-230-recurrent-neural-networks_20191006200831.md new file mode 100644 index 000000000..f3ae70e29 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006200831.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006201004.md b/.history/zh/cs-230-recurrent-neural-networks_20191006201004.md new file mode 100644 index 000000000..bb988eb0c --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006201004.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估对于给定的上下文和给定的目标单词 + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006201522.md b/.history/zh/cs-230-recurrent-neural-networks_20191006201522.md new file mode 100644 index 000000000..31e3f62dc --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006201522.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估对于给定的上下文和给定的目标单词同时出现的 + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006201550.md b/.history/zh/cs-230-recurrent-neural-networks_20191006201550.md new file mode 100644 index 000000000..76ef81943 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006201550.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估对于给定的上下文和给定的目标单词可能同时出现的情况 + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006201647.md b/.history/zh/cs-230-recurrent-neural-networks_20191006201647.md new file mode 100644 index 000000000..42d2896ad --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006201647.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个否定示例集和1个正例上。给定一个上下文单词c和一个目标单词t,预测由以下表达式表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006201845.md b/.history/zh/cs-230-recurrent-neural-networks_20191006201845.md new file mode 100644 index 000000000..97b2e2f96 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006201845.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006201927.md b/.history/zh/cs-230-recurrent-neural-networks_20191006201927.md new file mode 100644 index 000000000..8e2e5f2c8 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006201927.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型, + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006202043.md b/.history/zh/cs-230-recurrent-neural-networks_20191006202043.md new file mode 100644 index 000000000..362c95734 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006202043.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006202356.md b/.history/zh/cs-230-recurrent-neural-networks_20191006202356.md new file mode 100644 index 000000000..fce6d03a1 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006202356.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是权重函数 + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006202456.md b/.history/zh/cs-230-recurrent-neural-networks_20191006202456.md new file mode 100644 index 000000000..9d146defa --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006202456.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。 + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
+ + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006202550.md b/.history/zh/cs-230-recurrent-neural-networks_20191006202550.md new file mode 100644 index 000000000..88367bc8d --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006202550.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注: + + +**60. Comparing words** + +⟶ + +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006202716.md b/.history/zh/cs-230-recurrent-neural-networks_20191006202716.md new file mode 100644 index 000000000..35858ac2e --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006202716.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
+ + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006203151.md b/.history/zh/cs-230-recurrent-neural-networks_20191006203151.md new file mode 100644 index 000000000..07bbbf6e6 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006203151.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,] + + +**65. Language model** + +⟶ + +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006203425.md b/.history/zh/cs-230-recurrent-neural-networks_20191006203425.md new file mode 100644 index 000000000..4f7d429ed --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006203425.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于评估是句子的可能性P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006203522.md b/.history/zh/cs-230-recurrent-neural-networks_20191006203522.md new file mode 100644 index 000000000..de4147df4 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006203522.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型是一个 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006203834.md b/.history/zh/cs-230-recurrent-neural-networks_20191006203834.md new file mode 100644 index 000000000..aff0b03ed --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006203834.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为由词的数量归一化的数据集的逆概率。困惑是这样的,越低越好越好,并且定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
+ + +**70. Machine translation** + +⟶ + +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006204128.md b/.history/zh/cs-230-recurrent-neural-networks_20191006204128.md new file mode 100644 index 000000000..e68207da7 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006204128.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,其 + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006204246.md b/.history/zh/cs-230-recurrent-neural-networks_20191006204246.md new file mode 100644 index 000000000..41d54048c --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006204246.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006204406.md b/.history/zh/cs-230-recurrent-neural-networks_20191006204406.md new file mode 100644 index 000000000..f146fc5a3 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006204406.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[步骤1:] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006204643.md b/.history/zh/cs-230-recurrent-neural-networks_20191006204643.md new file mode 100644 index 000000000..1d9637e26 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006204643.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006204821.md b/.history/zh/cs-230-recurrent-neural-networks_20191006204821.md new file mode 100644 index 000000000..d2e235b82 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006204821.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006204952.md b/.history/zh/cs-230-recurrent-neural-networks_20191006204952.md new file mode 100644 index 000000000..ac05afa0a --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006204952.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果波束搜索的宽度设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006205346.md b/.history/zh/cs-230-recurrent-neural-networks_20191006205346.md new file mode 100644 index 000000000..8334c7ab6 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006205346.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下 + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006205440.md b/.history/zh/cs-230-recurrent-neural-networks_20191006205440.md new file mode 100644 index 000000000..158c9ae7c --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006205440.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度标准化 - 为提高数值稳定性,束搜索常被应用于以下标准化对象,常称为 + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006205605.md b/.history/zh/cs-230-recurrent-neural-networks_20191006205605.md new file mode 100644 index 000000000..1142ed859 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006205605.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006205747.md b/.history/zh/cs-230-recurrent-neural-networks_20191006205747.md new file mode 100644 index 000000000..1e0d57d84 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006205747.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时, + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006205839.md b/.history/zh/cs-230-recurrent-neural-networks_20191006205839.md new file mode 100644 index 000000000..c92018260 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006205839.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006205907.md b/.history/zh/cs-230-recurrent-neural-networks_20191006205907.md new file mode 100644 index 000000000..332b1bd1c --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006205907.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006210000.md b/.history/zh/cs-230-recurrent-neural-networks_20191006210000.md new file mode 100644 index 000000000..d06709510 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006210000.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006210107.md b/.history/zh/cs-230-recurrent-neural-networks_20191006210107.md new file mode 100644 index 000000000..03a653f61 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006210107.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
+ + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006210336.md b/.history/zh/cs-230-recurrent-neural-networks_20191006210336.md new file mode 100644 index 000000000..1340609fb --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006210336.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注: + + +**84. Attention** + +⟶ + +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
+ + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006211205.md b/.history/zh/cs-230-recurrent-neural-networks_20191006211205.md new file mode 100644 index 000000000..1de3238f8 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006211205.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN对输入的特定部分进行 + + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006211312.md b/.history/zh/cs-230-recurrent-neural-networks_20191006211312.md new file mode 100644 index 000000000..ba7e22ea7 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006211312.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006211937.md b/.history/zh/cs-230-recurrent-neural-networks_20191006211937.md new file mode 100644 index 000000000..1c86421c9 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006211937.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
和 + + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
注:注意力分数常用于图像字幕和机器翻译。 + + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
一只可爱的泰迪熊正在阅读波斯文学书。 + + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
注意力权重 - + + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006212242.md b/.history/zh/cs-230-recurrent-neural-networks_20191006212242.md new file mode 100644 index 000000000..77f196830 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006212242.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
和 + + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
注:注意力分数常用于图像字幕和机器翻译。 + + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
一只可爱的泰迪熊正在阅读波斯文学书。 + + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: + + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
注: + + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
+ +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006212346.md b/.history/zh/cs-230-recurrent-neural-networks_20191006212346.md new file mode 100644 index 000000000..c9acee31a --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006212346.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
和 + + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
注:注意力分数常用于图像字幕和机器翻译。 + + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
一只可爱的泰迪熊正在阅读波斯文学书。 + + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: + + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
注:计算复杂度是Tx的平方。 + + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
深度学习简明指南已 + +**92. Original authors** + +⟶ + +
+ +**93. Translated by X, Y and Z** + +⟶ + +
+ +**94. Reviewed by X, Y and Z** + +⟶ + +
+ +**95. View PDF version on GitHub** + +⟶ + +
+ +**96. By X and Y** + +⟶ + +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006212623.md b/.history/zh/cs-230-recurrent-neural-networks_20191006212623.md new file mode 100644 index 000000000..28c0034c7 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006212623.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑度] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
和 + + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
注:注意力分数常用于图像字幕和机器翻译。 + + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
一只可爱的泰迪熊正在阅读波斯文学书。 + + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: + + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
注:计算复杂度是Tx的平方。 + + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
现已提供[目标语言]版本的深度学习简明指南。 + +**92. Original authors** + +⟶ + +
原作者 + +**93. Translated by X, Y and Z** + +⟶ + +
翻译自 X,Y和Z + +**94. Reviewed by X, Y and Z** + +⟶ + +
审阅自X,Y和Z + +**95. View PDF version on GitHub** + +⟶ + +
在Github上查看PDF版本 + +**96. By X and Y** + +⟶ + +
由X和Y diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006212635.md b/.history/zh/cs-230-recurrent-neural-networks_20191006212635.md new file mode 100644 index 000000000..28c0034c7 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006212635.md @@ -0,0 +1,677 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
循环神经网络中文翻译 + +**1. Recurrent Neural Networks cheatsheet** + +⟶ + +
循环神经网络简明指南 + + +**2. CS 230 - Deep Learning** + +⟶ + +
CS 230 - 深度学习 + + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ + +
[概述, 网络结构, RNN的应用, 损失函数, 反向传播] + + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ + +
[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] + + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ + +
[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] + + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ + +
[词比较, 余弦相似度, t-SNE] + + +**7. [Language model, n-gram, Perplexity]** + +⟶ + +
[语言模型, n-gram, 困惑度] + + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ + +
[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] + + +**9. [Attention, Attention model, Attention weights]** + +⟶ + +
[注意力机制, 注意力模型, 注意力权重] + + +**10. Overview** + +⟶ + +
概述 + + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ + +
传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: + + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ + +
对于每一个时间步t,激活值a和输出y可表示如下: + + +**13. and** + +⟶ + +
并且 + + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ + +
其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 + + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ + +
一个典型的RNN体系结构的优点和缺点可概括如下表: + + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ + +
[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] + + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ + +
[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] + + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ + +
RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: + + +**19. [Type of RNN, Illustration, Example]** + +⟶ + +
[RNN的类型, 图形表示, 示例] + + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ + +
[一对一, 一对多, 多对一, 多对多] + + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ + +
[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] + + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ + +
损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: + + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ + +
随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: + + +**24. Handling long term dependencies** + +⟶ + +
解决长时间依赖问题 + + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ + +
常用的激活函数 - 在RNN模型中常用的激活函数如下所示: + + +**26. [Sigmoid, Tanh, RELU]** + +⟶ + +
[Sigmoid, Tanh, RELU] + + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ + +
梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 + + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ + +
梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 + + +**29. clipped** + +⟶ + +
裁剪 + + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ + +
门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: + + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ + +
其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: + + +**32. [Type of gate, Role, Used in]** + +⟶ + +
[门类型, 角色, 被用于] + + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ + +
[更新门, 关联门, 遗忘门, 输出门] + + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ + +
[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] + + +**35. [LSTM, GRU]** + +⟶ + +
[LSTM, GRU] + + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
和 + + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
注:注意力分数常用于图像字幕和机器翻译。 + + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
一只可爱的泰迪熊正在阅读波斯文学书。 + + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: + + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
注:计算复杂度是Tx的平方。 + + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
现已提供[目标语言]版本的深度学习简明指南。 + +**92. Original authors** + +⟶ + +
原作者 + +**93. Translated by X, Y and Z** + +⟶ + +
翻译自 X,Y和Z + +**94. Reviewed by X, Y and Z** + +⟶ + +
审阅自X,Y和Z + +**95. View PDF version on GitHub** + +⟶ + +
在Github上查看PDF版本 + +**96. By X and Y** + +⟶ + +
由X和Y diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006214850.md b/.history/zh/cs-230-recurrent-neural-networks_20191006214850.md new file mode 100644 index 000000000..78142cec0 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006214850.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[LSTM, GRU] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ + +
GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: + + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ + +
特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项 + + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ + +
注:符号⋆表示两个向量之间的元素相乘。 + + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ + +
RNN模型的变种 - 下表列出了其他常用的RNN结构: + + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ + +
[双向RNN(Bidirectional RNN, BRNN), 深度RNN(Deep RNN, DRNN)] + + +**41. Learning word representation** + +⟶ + +
词表示学习 + + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ + +
在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 + + +**43. Motivation and notations** + +⟶ + +
动机和注解 + + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ + +
表示技术 - 两种主要的词表示方法的总结如下表所示: + + +**45. [1-hot representation, Word embedding]** + +⟶ + +
[独热表示(one-hot), 词嵌入(word embedding)] + + +**46. [teddy bear, book, soft]** + +⟶ + +
[泰迪熊, 书, 柔软的] + + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ + +
[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] + + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ + +
嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: + + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ + +
注:使用目标/上下文似然模型可以学习嵌入矩阵。 + + +**50. Word embeddings** + +⟶ + +
词嵌入 + + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ + +
Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 + + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ + +
[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] + + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ + +
[通过代理任务训练网络, 提取高级表示, 计算词嵌入] + + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ + +
Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: + + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ + +
注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 + + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ + +
负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: + + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ + +
注:该模型相比skip-gram模型而言,其计算代价更小。 + + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ + +
GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: + + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ + +
其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: + + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ + +
注:所学单词的嵌入表示的各个部分不一定是可解释的。 + + +**60. Comparing words** + +⟶ + +
词比较 + + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ + +
余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: + + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ + +
注:θ是词w1和w2之间的夹角。 + + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ + +
t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 + + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ + +
[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] + + +**65. Language model** + +⟶ + +
语言模型 + + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ + +
概述 - 语言模型的目标在于估计句子的概率P(y) + + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ + +
n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 + + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ + +
困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: + + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ + +
注:PP常用于t-SNE模型中。 + + +**70. Machine translation** + +⟶ + +
机器翻译 + + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ + +
概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: + + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ + +
波束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 + + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ + +
[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] + + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ + +
注:如果束宽设置为1,则其与朴素贪婪搜索等价。 + + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ + +
束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 + + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ + +
长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: + + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ + +
注:参数α可看做软化器,其值在0.5 ~ 1之间。 + + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ + +
误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: + + +**79. [Case, Root cause, Remedies]** + +⟶ + +
[具体情况、根本原因、补救措施] + + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ + +
[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] + + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ + +
bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: + + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ + +
其中pn是n-gram上的bleu分数,定义如下: + + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ + +
注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 + + +**84. Attention** + +⟶ + +
注意力机制 + + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ + +
注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: + + +**86. with** + +⟶ + +
和 + + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ + +
注:注意力分数常用于图像字幕和机器翻译。 + + +**88. A cute teddy bear is reading Persian literature.** + +⟶ + +
一只可爱的泰迪熊正在阅读波斯文学书。 + + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ + +
注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: + + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ + +
注:计算复杂度是Tx的平方。 + + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ + +
现已提供[目标语言]版本的深度学习简明指南。 + +**92. Original authors** + +⟶ + +
原作者 + +**93. Translated by X, Y and Z** + +⟶ + +
翻译自 X,Y和Z + +**94. Reviewed by X, Y and Z** + +⟶ + +
审阅自X,Y和Z + +**95. View PDF version on GitHub** + +⟶ + +
在Github上查看PDF版本 + +**96. By X and Y** + +⟶ + +
由X和Y diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006215618.md b/.history/zh/cs-230-recurrent-neural-networks_20191006215618.md new file mode 100644 index 000000000..487aa6e8e --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006215618.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191006215639.md b/.history/zh/cs-230-recurrent-neural-networks_20191006215639.md new file mode 100644 index 000000000..487aa6e8e --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191006215639.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 难以考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007130535.md b/.history/zh/cs-230-recurrent-neural-networks_20191007130535.md new file mode 100644 index 000000000..aadc5908a --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007130535.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是相关的系数矩阵, 在时间尺度上被整个网络共享;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当所预测得到的翻译ˆy很差时,有人会想,为什么我们没有通过执行以下错误分析得到一个好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007130740.md b/.history/zh/cs-230-recurrent-neural-networks_20191007130740.md new file mode 100644 index 000000000..0b1767f64 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007130740.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小增加, 计算考虑历史信息, 权重在时间尺度上被整个网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007130831.md b/.history/zh/cs-230-recurrent-neural-networks_20191007130831.md new file mode 100644 index 000000000..360911351 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007130831.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 该方法是用于解决进行反向传播时时而出现梯度爆炸问题的技术。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007130928.md b/.history/zh/cs-230-recurrent-neural-networks_20191007130928.md new file mode 100644 index 000000000..9c855f936 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007130928.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的好坏。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007131018.md b/.history/zh/cs-230-recurrent-neural-networks_20191007131018.md new file mode 100644 index 000000000..0da66ac04 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007131018.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替补(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007131137.md b/.history/zh/cs-230-recurrent-neural-networks_20191007131137.md new file mode 100644 index 000000000..af32a9482 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007131137.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 将该词汇的one-hot表示ow映射至词嵌入表示ew的嵌入矩阵E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007131506.md b/.history/zh/cs-230-recurrent-neural-networks_20191007131506.md new file mode 100644 index 000000000..8a451aa00 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007131506.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和CBOW(Continuous Bag-of-Words Model)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007132100.md b/.history/zh/cs-230-recurrent-neural-networks_20191007132100.md new file mode 100644 index 000000000..543c90955 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007132100.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007132504.md b/.history/zh/cs-230-recurrent-neural-networks_20191007132504.md new file mode 100644 index 000000000..77d04bd79 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007132504.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +多元组(n-gram)模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007132544.md b/.history/zh/cs-230-recurrent-neural-networks_20191007132544.md new file mode 100644 index 000000000..78cb38e70 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007132544.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, 多元组(n-gram), 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +多元组(n-gram)模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007133140.md b/.history/zh/cs-230-recurrent-neural-networks_20191007133140.md new file mode 100644 index 000000000..42ce14c33 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007133140.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, 多元组(n-gram), 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007133202.md b/.history/zh/cs-230-recurrent-neural-networks_20191007133202.md new file mode 100644 index 000000000..543c90955 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007133202.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, GRU/LSTM, 门类型, 双向RNN, 深度RNN] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007133308.md b/.history/zh/cs-230-recurrent-neural-networks_20191007133308.md new file mode 100644 index 000000000..3b3bc6466 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007133308.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, RNN的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, 门控循环单元(GRU)/长短时记忆网络(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007133411.md b/.history/zh/cs-230-recurrent-neural-networks_20191007133411.md new file mode 100644 index 000000000..f5f2c70c7 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007133411.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +RNNs的应用 - RNN模型常用于自然语言处理和语音识别, 下表总结了RNN模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[RNN的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在RNN模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, Tanh, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007133536.md b/.history/zh/cs-230-recurrent-neural-networks_20191007133536.md new file mode 100644 index 000000000..66c4eb6e3 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007133536.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +循环神经网络的应用 - 循环神经网络(RNN)模型常用于自然语言处理和语音识别, 下表总结了循环神经网络(RNN)模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[循环神经网络的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在循环神经网络(RNN)模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, 双曲正切函数, RELU] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007133827.md b/.history/zh/cs-230-recurrent-neural-networks_20191007133827.md new file mode 100644 index 000000000..d382632f8 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007133827.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +循环神经网络的应用 - 循环神经网络(RNN)模型常用于自然语言处理和语音识别, 下表总结了循环神经网络(RNN)模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[循环神经网络的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在循环神经网络(RNN)模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, 双曲正切函数(Tanh), 整流线性单元(RELU)] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在RNN模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆网络(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +GRU/LSTM ― 门控循环单元(GRU)和长短时记忆单元(LSTM)可解决传统RNNs中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆网络(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +RNN模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007134146.md b/.history/zh/cs-230-recurrent-neural-networks_20191007134146.md new file mode 100644 index 000000000..6fc982c09 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007134146.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度裁剪, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +循环神经网络的应用 - 循环神经网络(RNN)模型常用于自然语言处理和语音识别, 下表总结了循环神经网络(RNN)模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[循环神经网络的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在循环神经网络(RNN)模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, 双曲正切函数(Tanh), 整流线性单元(RELU)] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在循环神经网络(RNN)模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度裁剪 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +裁剪 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +门控循环单元(GRU)/长短时记忆(LSTM) ― 门控循环单元(GRU)和长短时记忆(LSTM)可解决传统循环神经网络(RNNs)中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +循环神经网络(RNN)模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007134422.md b/.history/zh/cs-230-recurrent-neural-networks_20191007134422.md new file mode 100644 index 000000000..0655cd6b2 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007134422.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度截断, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +循环神经网络的应用 - 循环神经网络(RNN)模型常用于自然语言处理和语音识别, 下表总结了循环神经网络(RNN)模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[循环神经网络的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在循环神经网络(RNN)模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, 双曲正切函数(Tanh), 整流线性单元(RELU)] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在循环神经网络(RNN)模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度截断 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +截断 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +门控循环单元(GRU)/长短时记忆(LSTM) ― 门控循环单元(GRU)和长短时记忆(LSTM)可解决传统循环神经网络(RNNs)中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +循环神经网络(RNN)模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/.history/zh/cs-230-recurrent-neural-networks_20191007135819.md b/.history/zh/cs-230-recurrent-neural-networks_20191007135819.md new file mode 100644 index 000000000..0655cd6b2 --- /dev/null +++ b/.history/zh/cs-230-recurrent-neural-networks_20191007135819.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度截断, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +循环神经网络的应用 - 循环神经网络(RNN)模型常用于自然语言处理和语音识别, 下表总结了循环神经网络(RNN)模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[循环神经网络的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在循环神经网络(RNN)模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, 双曲正切函数(Tanh), 整流线性单元(RELU)] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在循环神经网络(RNN)模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度截断 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +截断 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +门控循环单元(GRU)/长短时记忆(LSTM) ― 门控循环单元(GRU)和长短时记忆(LSTM)可解决传统循环神经网络(RNNs)中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +循环神经网络(RNN)模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +
diff --git a/README.md b/README.md index 8d14e6a12..746c7ff83 100644 --- a/README.md +++ b/README.md @@ -98,7 +98,7 @@ Please make sure to propose the translation of **only one** cheatsheet per pull |**Türkçe**|done|done|done| |**Українська**|not started|not started|not started| |**Tiếng Việt**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/180)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/178)| -|**中文**|not started|not started|not started| +|**中文**|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/181)|not started| ## Acknowledgements Thank you everyone for your help! Please do not forget to add your name to the `CONTRIBUTORS` file so that we can give you proper credit in the cheatsheets' [official website](https://stanford.edu/~shervine/teaching). diff --git a/zh-tw/cs-229-deep-learning.md b/zh-tw/cs-229-deep-learning.md index 9ab9bbad2..ee64d7556 100644 --- a/zh-tw/cs-229-deep-learning.md +++ b/zh-tw/cs-229-deep-learning.md @@ -31,13 +31,13 @@ 6. **By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:** ⟶ -我們使用 i 來代表網路的第 i 層、j 來代表某一層中第 j 個隱藏神經元的話,我們可以得到下面得等式: +我們使用 i 來代表網路的第 i 層、j 來代表某一層中第 j 個隱藏神經元的話, 我們可以得到下面得等式:
7. **where we note w, b, z the weight, bias and output respectively.** ⟶ -其中,我們分別使用 w 來代表權重、b 代表偏差項、z 代表輸出的結果。 +其中, 我們分別使用 w 來代表權重、b 代表偏差項、z 代表輸出的結果。
8. **Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:** @@ -61,25 +61,25 @@ Activation function - Activation function 是為了在每一層尾端的神經 11. **Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.** ⟶ -學習速率 - 學習速率通常用 α 或 η 來表示,目的是用來控制權重更新的速度。學習速度可以是一個固定值,或是隨著訓練的過程改變。現在最熱門的最佳化方法叫作 Adam,是一種隨著訓練過程改變學習速率的最佳化方法。 +學習速率 - 學習速率通常用 α 或 η 來表示, 目的是用來控制權重更新的速度。學習速度可以是一個固定值, 或是隨著訓練的過程改變。現在最熱門的最佳化方法叫作 Adam, 是一種隨著訓練過程改變學習速率的最佳化方法。
12. **Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:** ⟶ -反向傳播演算法 - 反向傳播演算法是一種在神經網路中用來更新權重的方法,更新的基準是根據神經網路的實際輸出值和期望輸出值之間的關係。權重的導數是根據連鎖律 (chain rule) 來計算,通常會表示成下面的形式: +反向傳播演算法 - 反向傳播演算法是一種在神經網路中用來更新權重的方法, 更新的基準是根據神經網路的實際輸出值和期望輸出值之間的關係。權重的導數是根據連鎖律 (chain rule) 來計算, 通常會表示成下面的形式:
13. **As a result, the weight is updated as follows:** ⟶ -因此,權重會透過以下的方式來更新: +因此, 權重會透過以下的方式來更新:
14. **Updating weights ― In a neural network, weights are updated as follows:** ⟶ -更新權重 - 在神經網路中,權重的更新會透過以下步驟進行: +更新權重 - 在神經網路中, 權重的更新會透過以下步驟進行:
15. **Step 1: Take a batch of training data.** @@ -109,7 +109,7 @@ Activation function - Activation function 是為了在每一層尾端的神經 19. **Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p** ⟶ -Dropout - Dropout 是一種透過丟棄一些神經元,來避免過擬和的技巧。在實務上,神經元會透過機率值的設定來決定要丟棄或保留 +Dropout - Dropout 是一種透過丟棄一些神經元, 來避免過擬和的技巧。在實務上, 神經元會透過機率值的設定來決定要丟棄或保留
20. **Convolutional Neural Networks** @@ -121,19 +121,19 @@ Dropout - Dropout 是一種透過丟棄一些神經元,來避免過擬和的 21. **Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:** ⟶ -卷積層的需求 - 我們使用 W 來表示輸入資料的維度大小、F 代表卷積層的 filter 尺寸、P 代表對資料墊零 (zero padding) 使資料長度齊一後的長度,S 代表卷積後取出的特徵 stride 數量,則輸出的維度大小可以透過以下的公式表示: +卷積層的需求 - 我們使用 W 來表示輸入資料的維度大小、F 代表卷積層的 filter 尺寸、P 代表對資料墊零 (zero padding) 使資料長度齊一後的長度, S 代表卷積後取出的特徵 stride 數量, 則輸出的維度大小可以透過以下的公式表示:
22. **Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:** ⟶ -批次正規化 (Batch normalization) - 它是一個藉由 γ,β 兩個超參數來正規化每個批次 {xi} 的過程。每一次正規化的過程,我們使用 μB,σ2B 分別代表平均數和變異數。請參考以下公式: +批次正規化 (Batch normalization) - 它是一個藉由 γ,β 兩個超參數來正規化每個批次 {xi} 的過程。每一次正規化的過程, 我們使用 μB,σ2B 分別代表平均數和變異數。請參考以下公式:
23. **It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.** ⟶ -批次正規化的動作通常在全連接層/卷積層之後、在非線性層之前進行。目的在於接納更高的學習速率,並且減少該批次學習初期對取樣資料特徵的依賴性。 +批次正規化的動作通常在全連接層/卷積層之後、在非線性層之前進行。目的在於接納更高的學習速率, 並且減少該批次學習初期對取樣資料特徵的依賴性。
24. **Recurrent Neural Networks** @@ -145,7 +145,7 @@ Dropout - Dropout 是一種透過丟棄一些神經元,來避免過擬和的 25. **Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:** ⟶ -閘的種類 - 在傳統的遞歸神經網路中,你會遇到幾種閘: +閘的種類 - 在傳統的遞歸神經網路中, 你會遇到幾種閘:
26. **[Input gate, forget gate, gate, output gate]** @@ -163,7 +163,7 @@ Dropout - Dropout 是一種透過丟棄一些神經元,來避免過擬和的 28. **LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.** ⟶ -長短期記憶模型 - 長短期記憶模型是一種遞歸神經網路,藉由導入遺忘閘的設計來避免梯度消失的問題 +長短期記憶模型 - 長短期記憶模型是一種遞歸神經網路, 藉由導入遺忘閘的設計來避免梯度消失的問題
29. **Reinforcement Learning and Control** @@ -205,7 +205,7 @@ A 是一組行為的集合 35. **{Psa} are the state transition probabilities for s∈S and a∈A** ⟶ -{Psa} 指的是,當 s∈S、a∈A 時,狀態轉移的機率 +{Psa} 指的是, 當 s∈S、a∈A 時, 狀態轉移的機率
36. **γ∈[0,1[ is the discount factor** @@ -217,25 +217,25 @@ A 是一組行為的集合 37. **R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize** ⟶ -R:S×A⟶R 或 R:S⟶R 指的是獎勵函數,也就是演算法想要去最大化的目標函數 +R:S×A⟶R 或 R:S⟶R 指的是獎勵函數, 也就是演算法想要去最大化的目標函數
38. **Policy ― A policy π is a function π:S⟶A that maps states to actions.** ⟶ -策略 - 一個策略 π 指的是一個函數 π:S⟶A,這個函數會將狀態映射到行為 +策略 - 一個策略 π 指的是一個函數 π:S⟶A, 這個函數會將狀態映射到行為
39. **Remark: we say that we execute a given policy π if given a state a we take the action a=π(s).** ⟶ -注意:我們會說,我們給定一個策略 π,當我們給定一個狀態 s 我們會採取一個行動 a=π(s) +注意:我們會說, 我們給定一個策略 π, 當我們給定一個狀態 s 我們會採取一個行動 a=π(s)
40. **Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:** ⟶ -價值函數 - 給定一個策略 π 和狀態 s,我們定義價值函數 Vπ 為: +價值函數 - 給定一個策略 π 和狀態 s, 我們定義價值函數 Vπ 為:
41. **Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:** @@ -247,7 +247,7 @@ R:S×A⟶R 或 R:S⟶R 指的是獎勵函數,也就是演算法想要去最大 42. **Remark: we note that the optimal policy π∗ for a given state s is such that:** ⟶ -注意:對於給定一個狀態 s,最佳的策略 π∗ 是: +注意:對於給定一個狀態 s, 最佳的策略 π∗ 是:
43. **Value iteration algorithm ― The value iteration algorithm is in two steps:** @@ -265,7 +265,7 @@ R:S×A⟶R 或 R:S⟶R 指的是獎勵函數,也就是演算法想要去最大 45. **2) We iterate the value based on the values before:** ⟶ -根據之前的值,迭代此價值的值: +根據之前的值, 迭代此價值的值:
46. **Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:** @@ -289,7 +289,7 @@ R:S×A⟶R 或 R:S⟶R 指的是獎勵函數,也就是演算法想要去最大 49. **Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:** ⟶ -Q-learning 演算法 - Q-learning 演算法是針對 Q 的一個 model-free 的估計,如下: +Q-learning 演算法 - Q-learning 演算法是針對 Q 的一個 model-free 的估計, 如下: 50. **View PDF version on GitHub** diff --git a/zh-tw/cs-229-linear-algebra.md b/zh-tw/cs-229-linear-algebra.md index 36d4cef5d..8466a6644 100644 --- a/zh-tw/cs-229-linear-algebra.md +++ b/zh-tw/cs-229-linear-algebra.md @@ -19,19 +19,19 @@ 4. **Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:** ⟶ -向量 - 我們定義 x∈Rn 是一個向量,包含 n 維元素,xi∈R 是第 i 維元素: +向量 - 我們定義 x∈Rn 是一個向量, 包含 n 維元素, xi∈R 是第 i 維元素:
5. **Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:** ⟶ -矩陣 - 我們定義 A∈Rm×n 是一個 m 列 n 行的矩陣,Ai,j∈R 代表位在第 i 列第 j 行的元素: +矩陣 - 我們定義 A∈Rm×n 是一個 m 列 n 行的矩陣, Ai,j∈R 代表位在第 i 列第 j 行的元素:
6. **Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.** ⟶ -注意:上述定義的向量 x 可以視為 nx1 的矩陣,或是更常被稱為行向量 +注意:上述定義的向量 x 可以視為 nx1 的矩陣, 或是更常被稱為行向量
7. **Main matrices** @@ -43,19 +43,19 @@ 8. **Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:** ⟶ -單位矩陣 - 單位矩陣 I∈Rn×n 是一個方陣,其主對角線皆為 1,其餘皆為 0 +單位矩陣 - 單位矩陣 I∈Rn×n 是一個方陣, 其主對角線皆為 1, 其餘皆為 0
9. **Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.** ⟶ -注意:對於所有矩陣 A∈Rn×n,我們有 A×I=I×A=A +注意:對於所有矩陣 A∈Rn×n, 我們有 A×I=I×A=A
10. **Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:** ⟶ -對角矩陣 - 對角矩陣 D∈Rn×n 是一個方陣,其主對角線為非 0,其餘皆為 0 +對角矩陣 - 對角矩陣 D∈Rn×n 是一個方陣, 其主對角線為非 0, 其餘皆為 0
11. **Remark: we also note D as diag(d1,...,dn).** @@ -85,19 +85,19 @@ 15. **inner product: for x,y∈Rn, we have:** ⟶ -內積:對於 x,y∈Rn,我們可以得到: +內積:對於 x,y∈Rn, 我們可以得到:
16. **outer product: for x∈Rm,y∈Rn, we have:** ⟶ -外積:對於 x∈Rm,y∈Rn,我們可以得到: +外積:對於 x∈Rm,y∈Rn, 我們可以得到:
17. **Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:** ⟶ -矩陣-向量 - 矩陣 A∈Rm×n 和向量 x∈Rn 的乘積是一個大小為 Rm 的向量,使得: +矩陣-向量 - 矩陣 A∈Rm×n 和向量 x∈Rn 的乘積是一個大小為 Rm 的向量, 使得:
18. **where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.** @@ -109,13 +109,13 @@ 19. **Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:** ⟶ -矩陣-矩陣:矩陣 A∈Rm×n 和 B∈Rn×p 的乘積為一個大小 Rm×p 的矩陣,使得: +矩陣-矩陣:矩陣 A∈Rm×n 和 B∈Rn×p 的乘積為一個大小 Rm×p 的矩陣, 使得:
20. **where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively** ⟶ -其中,aTr,i,bTr,i 和 ac,j,bc,j 分別是 A 和 B 的列向量與行向量 +其中, aTr,i,bTr,i 和 ac,j,bc,j 分別是 A 和 B 的列向量與行向量
21. **Other operations** @@ -127,49 +127,49 @@ 22. **Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:** ⟶ -轉置 - 一個矩陣的轉置矩陣 A∈Rm×n,記作 AT,指的是其中元素的翻轉: +轉置 - 一個矩陣的轉置矩陣 A∈Rm×n, 記作 AT, 指的是其中元素的翻轉:
23. **Remark: for matrices A,B, we have (AB)T=BTAT** ⟶ -注意:對於矩陣 A、B,我們有 (AB)T=BTAT +注意:對於矩陣 A、B, 我們有 (AB)T=BTAT
24. **Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:** ⟶ -可逆 - 一個可逆矩陣 A 記作 A−1,存在唯一的矩陣,使得: +可逆 - 一個可逆矩陣 A 記作 A−1, 存在唯一的矩陣, 使得:
25. **Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1** ⟶ -注意:並非所有的方陣都是可逆的。同樣的,對於矩陣 A、B 來說,我們有 (AB)−1=B−1A−1 +注意:並非所有的方陣都是可逆的。同樣的, 對於矩陣 A、B 來說, 我們有 (AB)−1=B−1A−1
26. **Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:** ⟶ -跡 - 一個方陣 A 的跡,記作 tr(A),指的是主對角線元素之合: +跡 - 一個方陣 A 的跡, 記作 tr(A), 指的是主對角線元素之合:
27. **Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)** ⟶ -注意:對於矩陣 A、B 來說,我們有 tr(AT)=tr(A) 及 tr(AB)=tr(BA) +注意:對於矩陣 A、B 來說, 我們有 tr(AT)=tr(A) 及 tr(AB)=tr(BA)
28. **Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:** ⟶ -行列式 - 一個方陣 A∈Rn×n 的行列式,記作|A| 或 det(A),可以透過 A∖i,∖j 來遞迴表示,它是一個沒有第 i 列和第 j 行的矩陣 A: +行列式 - 一個方陣 A∈Rn×n 的行列式, 記作|A| 或 det(A), 可以透過 A∖i,∖j 來遞迴表示, 它是一個沒有第 i 列和第 j 行的矩陣 A:
29. **Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.** ⟶ -注意:A 是一個可逆矩陣,若且唯若 |A|≠0。同樣的,|AB|=|A||B| 且 |AT|=|A| +注意:A 是一個可逆矩陣, 若且唯若 |A|≠0。同樣的, |AB|=|A||B| 且 |AT|=|A|
30. **Matrix properties** @@ -187,7 +187,7 @@ 32. **Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:** ⟶ -對稱分解 - 給定一個矩陣 A,它可以透過其對稱和反對稱的部分表示如下: +對稱分解 - 給定一個矩陣 A, 它可以透過其對稱和反對稱的部分表示如下:
33. **[Symmetric, Antisymmetric]** @@ -199,25 +199,25 @@ 34. **Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:** ⟶ -範數 - 範數指的是一個函式 N:V⟶[0,+∞[,其中 V 是一個向量空間,且對於所有 x,y∈V,我們有: +範數 - 範數指的是一個函式 N:V⟶[0,+∞[, 其中 V 是一個向量空間, 且對於所有 x,y∈V, 我們有:
35. **N(ax)=|a|N(x) for a scalar** ⟶ -對一個純量來說,我們有 N(ax)=|a|N(x) +對一個純量來說, 我們有 N(ax)=|a|N(x)
36. **if N(x)=0, then x=0** ⟶ -若 N(x)=0 時,則 x=0 +若 N(x)=0 時, 則 x=0
37. **For x∈V, the most commonly used norms are summed up in the table below:** ⟶ -對於 x∈V,最常用的範數總結如下表: +對於 x∈V, 最常用的範數總結如下表:
38. **[Norm, Notation, Definition, Use case]** @@ -229,43 +229,43 @@ 39. **Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.** ⟶ -線性相關 - 當集合中的一個向量可以用被定義為集合中其他向量的線性組合時,則則稱此集合的向量為線性相關 +線性相關 - 當集合中的一個向量可以用被定義為集合中其他向量的線性組合時, 則則稱此集合的向量為線性相關
40. **Remark: if no vector can be written this way, then the vectors are said to be linearly independent** ⟶ -注意:如果沒有向量可以如上表示時,則稱此集合的向量彼此為線性獨立 +注意:如果沒有向量可以如上表示時, 則稱此集合的向量彼此為線性獨立
41. **Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.** ⟶ -矩陣的秩 - 一個矩陣 A 的秩記作 rank(A),指的是其列向量空間所產生的維度,等價於 A 的線性獨立的最大最大行向量 +矩陣的秩 - 一個矩陣 A 的秩記作 rank(A), 指的是其列向量空間所產生的維度, 等價於 A 的線性獨立的最大最大行向量
42. **Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:** ⟶ -半正定矩陣 - 當以下成立時,一個矩陣 A∈Rn×n 是半正定矩陣 (PSD),且記作A⪰0: +半正定矩陣 - 當以下成立時, 一個矩陣 A∈Rn×n 是半正定矩陣 (PSD), 且記作A⪰0:
43. **Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.** ⟶ -注意:同樣的,一個矩陣 A 是一個半正定矩陣 (PSD),且滿足所有非零向量 x,xTAx>0 時,稱之為正定矩陣,記作 A≻0 +注意:同樣的, 一個矩陣 A 是一個半正定矩陣 (PSD), 且滿足所有非零向量 x, xTAx>0 時, 稱之為正定矩陣, 記作 A≻0
44. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** ⟶ -特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,當存在一個向量 z∈Rn∖{0} 時,此向量被稱為特徵向量,λ 稱之為 A 的特徵值,且滿足: +特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n, 當存在一個向量 z∈Rn∖{0} 時, 此向量被稱為特徵向量, λ 稱之為 A 的特徵值, 且滿足:
45. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** ⟶ -譜分解 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以被一個實數正交矩陣 U∈Rn×n 給對角化。令 Λ=diag(λ1,...,λn),我們得到: +譜分解 - 令 A∈Rn×n, 如果 A 是對稱的, 則 A 可以被一個實數正交矩陣 U∈Rn×n 給對角化。令 Λ=diag(λ1,...,λn), 我們得到:
46. **diagonal** @@ -277,7 +277,7 @@ 47. **Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:** ⟶ -奇異值分解 - 對於給定維度為 mxn 的矩陣 A,其奇異值分解指的是一種因子分解技巧,保證存在 mxm 的單式矩陣 U、對角線矩陣 Σ m×n 和 nxn 的單式矩陣 V,滿足: +奇異值分解 - 對於給定維度為 mxn 的矩陣 A, 其奇異值分解指的是一種因子分解技巧, 保證存在 mxm 的單式矩陣 U、對角線矩陣 Σ m×n 和 nxn 的單式矩陣 V, 滿足:
48. **Matrix calculus** @@ -289,7 +289,7 @@ 49. **Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:** ⟶ -梯度 - 令 f:Rm×n→R 是一個函式,且 A∈Rm×n 是一個矩陣。f 相對於 A 的梯度是一個 mxn 的矩陣,記作 ∇Af(A),滿足: +梯度 - 令 f:Rm×n→R 是一個函式, 且 A∈Rm×n 是一個矩陣。f 相對於 A 的梯度是一個 mxn 的矩陣, 記作 ∇Af(A), 滿足:
50. **Remark: the gradient of f is only defined when f is a function that returns a scalar.** @@ -301,7 +301,7 @@ 51. **Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:** ⟶ -海森 - 令 f:Rn→R 是一個函式,且 x∈Rn 是一個向量,則一個 f 的海森對於向量 x 是一個 nxn 的對稱矩陣,記作 ∇2xf(x),滿足: +海森 - 令 f:Rn→R 是一個函式, 且 x∈Rn 是一個向量, 則一個 f 的海森對於向量 x 是一個 nxn 的對稱矩陣, 記作 ∇2xf(x), 滿足:
52. **Remark: the hessian of f is only defined when f is a function that returns a scalar** @@ -311,7 +311,7 @@
53. **Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:** -梯度運算 - 對於矩陣 A、B、C,下列的梯度性質值得牢牢記住: +梯度運算 - 對於矩陣 A、B、C, 下列的梯度性質值得牢牢記住: ⟶ 54. **[General notations, Definitions, Main matrices]** diff --git a/zh-tw/cs-229-probability.md b/zh-tw/cs-229-probability.md index 0db481cf5..bd4353351 100644 --- a/zh-tw/cs-229-probability.md +++ b/zh-tw/cs-229-probability.md @@ -13,25 +13,25 @@ 3. **Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.** ⟶ -樣本空間 - 一個實驗的所有可能結果的集合稱之為這個實驗的樣本空間,記做 S +樣本空間 - 一個實驗的所有可能結果的集合稱之為這個實驗的樣本空間, 記做 S
4. **Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.** ⟶ -事件 - 樣本空間的任何子集合 E 被稱之為一個事件。也就是說,一個事件是實驗的可能結果的集合。如果該實驗的結果包含 E,我們稱我們稱 E 發生 +事件 - 樣本空間的任何子集合 E 被稱之為一個事件。也就是說, 一個事件是實驗的可能結果的集合。如果該實驗的結果包含 E, 我們稱我們稱 E 發生
5. **Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.** ⟶ -機率公理。對於每個事件 E,我們用 P(E) 表示事件 E 發生的機率 +機率公理。對於每個事件 E, 我們用 P(E) 表示事件 E 發生的機率
6. **Axiom 1 ― Every probability is between 0 and 1 included, i.e:** ⟶ -公理 1 - 每一個機率值介於 0 到 1 之間,包含兩端點。即: +公理 1 - 每一個機率值介於 0 到 1 之間, 包含兩端點。即:
7. **Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:** @@ -43,25 +43,25 @@ 8. **Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:** ⟶ -公理 3 - 對於任何互斥的事件 E1,...,En,我們定義如下: +公理 3 - 對於任何互斥的事件 E1,...,En, 我們定義如下:
9. **Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:** ⟶ -排列 - 排列指的是從 n 個相異的物件中,取出 r 個物件按照固定順序重新安排,這樣安排的數量用 P(n,r) 來表示,定義為: +排列 - 排列指的是從 n 個相異的物件中, 取出 r 個物件按照固定順序重新安排, 這樣安排的數量用 P(n,r) 來表示, 定義為:
10. **Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:** ⟶ -組合 - 組合指的是從 n 個物件中,取出 r 個物件,但不考慮他的順序。這樣組合要考慮的數量用 C(n,r) 來表示,定義為: +組合 - 組合指的是從 n 個物件中, 取出 r 個物件, 但不考慮他的順序。這樣組合要考慮的數量用 C(n,r) 來表示, 定義為:
11. **Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)** ⟶ -注意:對於 0⩽r⩽n,我們會有 P(n,r)⩾C(n,r) +注意:對於 0⩽r⩽n, 我們會有 P(n,r)⩾C(n,r)
12. **Conditional Probability** @@ -73,7 +73,7 @@ 13. **Bayes' rule ― For events A and B such that P(B)>0, we have:** ⟶ -貝氏定理 - 對於事件 A 和 B 滿足 P(B)>0 時,我們定義如下: +貝氏定理 - 對於事件 A 和 B 滿足 P(B)>0 時, 我們定義如下:
14. **Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)** @@ -85,25 +85,25 @@ 15. **Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:** ⟶ -分割 - 令 {Ai,i∈[[1,n]]} 對所有的 i,Ai≠∅,我們說 {Ai} 是一個分割,當底下成立時: +分割 - 令 {Ai,i∈[[1,n]]} 對所有的 i, Ai≠∅, 我們說 {Ai} 是一個分割, 當底下成立時:
16. **Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).** ⟶ -注意:對於任何在樣本空間的事件 B 來說,P(B)=n∑i=1P(B|Ai)P(Ai) +注意:對於任何在樣本空間的事件 B 來說, P(B)=n∑i=1P(B|Ai)P(Ai)
17. **Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:** ⟶ -貝氏定理的擴展 - 令 {Ai,i∈[[1,n]]} 為樣本空間的一個分割,我們定義: +貝氏定理的擴展 - 令 {Ai,i∈[[1,n]]} 為樣本空間的一個分割, 我們定義:
18. **Independence ― Two events A and B are independent if and only if we have:** ⟶ -獨立 - 當以下條件滿足時,兩個事件 A 和 B 為獨立事件: +獨立 - 當以下條件滿足時, 兩個事件 A 和 B 為獨立事件:
19. **Random Variables** @@ -121,13 +121,13 @@ 21. **Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.** ⟶ -隨機變數 - 一個隨機變數 X,它是一個將樣本空間中的每個元素映射到實數域的函數 +隨機變數 - 一個隨機變數 X, 它是一個將樣本空間中的每個元素映射到實數域的函數
22. **Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:** ⟶ -累積分佈函數 (CDF) - 累積分佈函數 F 是單調遞增的函數,其 limx→−∞F(x)=0 且 limx→+∞F(x)=1,定義如下: +累積分佈函數 (CDF) - 累積分佈函數 F 是單調遞增的函數, 其 limx→−∞F(x)=0 且 limx→+∞F(x)=1, 定義如下:
23. **Remark: we have P(a 29. **Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:** ⟶ -標準差 - 一個隨機變數的標準差通常表示為 σ,用來衡量一個分佈離散程度的指標,其單位和實際的隨機變數相容,表示如下: +標準差 - 一個隨機變數的標準差通常表示為 σ, 用來衡量一個分佈離散程度的指標, 其單位和實際的隨機變數相容, 表示如下:
30. **Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:** ⟶ -隨機變數的轉換 - 令變數 X 和 Y 由某個函式連結在一起。我們定義 fX 和 fY 是 X 和 Y 的分佈函式,可以得到: +隨機變數的轉換 - 令變數 X 和 Y 由某個函式連結在一起。我們定義 fX 和 fY 是 X 和 Y 的分佈函式, 可以得到:
31. **Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:** ⟶ -萊布尼茲積分法則 - 令 g 為 x 和 c 的函數,a 和 b 是依賴於 c 的的邊界,我們得到: +萊布尼茲積分法則 - 令 g 為 x 和 c 的函數, a 和 b 是依賴於 c 的的邊界, 我們得到:
32. **Probability Distributions** @@ -193,7 +193,7 @@ 33. **Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:** ⟶ -柴比雪夫不等式 - 令 X 是一隨機變數,期望值為 μ。對於 k, σ>0,我們有以下不等式: +柴比雪夫不等式 - 令 X 是一隨機變數, 期望值為 μ。對於 k, σ>0, 我們有以下不等式:
34. **Main distributions ― Here are the main distributions to have in mind:** @@ -229,13 +229,13 @@ 39. **Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:** ⟶ -條件密度 - X 對於 Y 的條件密度,通常用 fX|Y 表示如下: +條件密度 - X 對於 Y 的條件密度, 通常用 fX|Y 表示如下:
40. **Independence ― Two random variables X and Y are said to be independent if we have:** ⟶ -獨立 - 當滿足以下條件時,我們稱隨機變數 X 和 Y 互相獨立: +獨立 - 當滿足以下條件時, 我們稱隨機變數 X 和 Y 互相獨立:
41. **Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:** @@ -247,19 +247,19 @@ 42. **Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:** ⟶ -相關性 - 我們定義 σX、σY 為 X 和 Y 的標準差,而 X 和 Y 的相關係數 ρXY 定義如下: +相關性 - 我們定義 σX、σY 為 X 和 Y 的標準差, 而 X 和 Y 的相關係數 ρXY 定義如下:
43. **Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].** ⟶ -注意一:對於任何隨機變數 X 和 Y 來說,ρXY∈[−1,1] 成立 +注意一:對於任何隨機變數 X 和 Y 來說, ρXY∈[−1,1] 成立
44. **Remark 2: If X and Y are independent, then ρXY=0.** ⟶ -注意二:當 X 和 Y 獨立時,ρXY=0 +注意二:當 X 和 Y 獨立時, ρXY=0
45. **Parameter estimation** @@ -283,7 +283,7 @@ 48. **Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.** ⟶ -估計量 - 估計量是一個資料的函數,用來推斷在統計模型中未知參數的值 +估計量 - 估計量是一個資料的函數, 用來推斷在統計模型中未知參數的值
49. **Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:** @@ -295,7 +295,7 @@ 50. **Remark: an estimator is said to be unbiased when we have E[^θ]=θ.** ⟶ -注意:當 E[^θ]=θ 時,我們稱為不偏估計量 +注意:當 E[^θ]=θ 時, 我們稱為不偏估計量
51. **Estimating the mean** @@ -307,19 +307,19 @@ 52. **Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯X and is defined as follows:** ⟶ -樣本平均 - 一個隨機樣本的樣本平均是用來預估一個分佈的真實平均 μ,通常我們用 ¯X 來表示,定義如下: +樣本平均 - 一個隨機樣本的樣本平均是用來預估一個分佈的真實平均 μ, 通常我們用 ¯X 來表示, 定義如下:
53. **Remark: the sample mean is unbiased, i.e E[¯X]=μ.** ⟶ -注意:當 E[¯X]=μ 時,則為不偏樣本平均 +注意:當 E[¯X]=μ 時, 則為不偏樣本平均
54. **Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:** ⟶ -中央極限定理 - 當我們有一個隨機樣本 X1,...,Xn 滿足一個給定的分佈,其平均數為 μ,變異數為 σ2,我們有: +中央極限定理 - 當我們有一個隨機樣本 X1,...,Xn 滿足一個給定的分佈, 其平均數為 μ, 變異數為 σ2, 我們有:
55. **Estimating the variance** @@ -331,19 +331,19 @@ 56. **Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:** ⟶ -樣本變異數 - 一個隨機樣本的樣本變異數是用來估計一個分佈的真實變異數 σ2,通常使用 s2 或 ^σ2 來表示,定義如下: +樣本變異數 - 一個隨機樣本的樣本變異數是用來估計一個分佈的真實變異數 σ2, 通常使用 s2 或 ^σ2 來表示, 定義如下:
57. **Remark: the sample variance is unbiased, i.e E[s2]=σ2.** ⟶ -注意:當 E[s2]=σ2 時,稱之為不偏樣本變異數 +注意:當 E[s2]=σ2 時, 稱之為不偏樣本變異數
58. **Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:** ⟶ -與樣本變異數的卡方關聯 - 令 s2 是一個隨機樣本的樣本變異數,我們可以得到: +與樣本變異數的卡方關聯 - 令 s2 是一個隨機樣本的樣本變異數, 我們可以得到:
**59. [Introduction, Sample space, Event, Permutation]** diff --git a/zh-tw/cs-229-supervised-learning.md b/zh-tw/cs-229-supervised-learning.md index 0b329e8db..28c064279 100644 --- a/zh-tw/cs-229-supervised-learning.md +++ b/zh-tw/cs-229-supervised-learning.md @@ -8,11 +8,11 @@ 3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** -⟶ 給定一組資料點 {x(1),...,x(m)},以及對應的一組輸出 {y(1),...,y(m)},我們希望建立一個分類器,用來學習如何從 x 來預測 y +⟶ 給定一組資料點 {x(1),...,x(m)}, 以及對應的一組輸出 {y(1),...,y(m)}, 我們希望建立一個分類器, 用來學習如何從 x 來預測 y 4. **Type of prediction ― The different types of predictive models are summed up in the table below:** -⟶ 預測的種類 - 根據預測的種類不同,我們將預測模型分為底下幾種: +⟶ 預測的種類 - 根據預測的種類不同, 我們將預測模型分為底下幾種: 5. **[Regression, Classifier, Outcome, Examples]** @@ -32,7 +32,7 @@ 9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** -⟶ [直接估計 P(y|x), 先估計 P(x|y),然後推論出 P(y|x), 決策分界線, 資料的機率分佈, 迴歸, 支援向量機 (SVM), 高斯判別分析 (GDA), 單純貝氏 (Naive Bayes)] +⟶ [直接估計 P(y|x), 先估計 P(x|y), 然後推論出 P(y|x), 決策分界線, 資料的機率分佈, 迴歸, 支援向量機 (SVM), 高斯判別分析 (GDA), 單純貝氏 (Naive Bayes)] 10. **Notations and general concepts** @@ -40,11 +40,11 @@ 11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** -⟶ 假設 - 我們使用 hθ 來代表所選擇的模型,對於給定的輸入資料 x(i),模型預測的輸出是 hθ(x(i)) +⟶ 假設 - 我們使用 hθ 來代表所選擇的模型, 對於給定的輸入資料 x(i), 模型預測的輸出是 hθ(x(i)) 12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** -⟶ 損失函數 - 損失函數是一個函數 L:(z,y)∈R×Y⟼L(z,y)∈R, +⟶ 損失函數 - 損失函數是一個函數 L:(z,y)∈R×Y⟼L(z,y)∈R, 目的在於計算預測值 z 和實際值 y 之間的差距。底下是一些常見的損失函數: 13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]** @@ -57,11 +57,11 @@ 15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** -⟶ 代價函數 - 代價函數 J 通常用來評估一個模型的表現,它可以透過損失函數 L 來定義: +⟶ 代價函數 - 代價函數 J 通常用來評估一個模型的表現, 它可以透過損失函數 L 來定義: 16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** -⟶ 梯度下降 - 使用 α∈R 表示學習速率,我們透過學習速率和代價函數來使用梯度下降的方法找出網路參數更新的方法可以表示為: +⟶ 梯度下降 - 使用 α∈R 表示學習速率, 我們透過學習速率和代價函數來使用梯度下降的方法找出網路參數更新的方法可以表示為: 17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** @@ -69,15 +69,15 @@ 18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** -⟶ 概似估計 - 在給定參數 θ 的條件下,一個模型 L(θ) 的概似估計的目的是透過最大概似估計法來找到最佳的參數。實務上,我們會使用對數概似估計函數 (log-likelihood) ℓ(θ)=log(L(θ)),會比較容易最佳化。如下: +⟶ 概似估計 - 在給定參數 θ 的條件下, 一個模型 L(θ) 的概似估計的目的是透過最大概似估計法來找到最佳的參數。實務上, 我們會使用對數概似估計函數 (log-likelihood) ℓ(θ)=log(L(θ)), 會比較容易最佳化。如下: 19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** -⟶ 牛頓演算法 - 牛頓演算法是一個數值方法,目的在於找到一個 θ,讓 ℓ′(θ)=0。其更新的規則為: +⟶ 牛頓演算法 - 牛頓演算法是一個數值方法, 目的在於找到一個 θ, 讓 ℓ′(θ)=0。其更新的規則為: 20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** -⟶ 注意:多維度正規化的方法,或又被稱之為牛頓-拉弗森 (Newton-Raphson) 演算法,是透過以下的規則更新: +⟶ 注意:多維度正規化的方法, 或又被稱之為牛頓-拉弗森 (Newton-Raphson) 演算法, 是透過以下的規則更新: 21. **Linear models** @@ -93,11 +93,11 @@ 24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** -⟶ 正規方程法 - 我們使用 X 代表矩陣,讓代價函數最小的 θ 值有一個封閉解,如下: +⟶ 正規方程法 - 我們使用 X 代表矩陣, 讓代價函數最小的 θ 值有一個封閉解, 如下: 25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** -⟶ 最小均方演算法 (LMS) - 我們使用 α 表示學習速率,針對 m 個訓練資料,透過最小均方演算法的更新規則,或是叫做 Widrow-Hoff 學習法如下: +⟶ 最小均方演算法 (LMS) - 我們使用 α 表示學習速率, 針對 m 個訓練資料, 透過最小均方演算法的更新規則, 或是叫做 Widrow-Hoff 學習法如下: 26. **Remark: the update rule is a particular case of the gradient ascent.** @@ -105,7 +105,7 @@ 27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** -⟶ 局部加權迴歸 ,又稱為 LWR,是線性洄歸的變形,通過w(i)(x) 對其成本函數中的每個訓練樣本進行加權,其中參數 τ∈R 定義為: +⟶ 局部加權迴歸 , 又稱為 LWR, 是線性洄歸的變形, 通過w(i)(x) 對其成本函數中的每個訓練樣本進行加權, 其中參數 τ∈R 定義為: 28. **Classification and logistic regression** @@ -113,19 +113,19 @@ 29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** -⟶ Sigmoid 函數 - Sigmoid 函數 g,也可以稱為邏輯函數定義如下: +⟶ Sigmoid 函數 - Sigmoid 函數 g, 也可以稱為邏輯函數定義如下: 30. **Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:** -⟶ 邏輯迴歸 - 我們假設 y|x;θ∼Bernoulli(ϕ),請參考以下: +⟶ 邏輯迴歸 - 我們假設 y|x;θ∼Bernoulli(ϕ), 請參考以下: 31. **Remark: there is no closed form solution for the case of logistic regressions.** -⟶ 注意:對於這種情況的邏輯迴歸,並沒有一個封閉解 +⟶ 注意:對於這種情況的邏輯迴歸, 並沒有一個封閉解 32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** -⟶ Softmax 迴歸 - Softmax 迴歸又稱做多分類邏輯迴歸,目的是用在超過兩個以上的分類時的迴歸使用。按照慣例,我們設定 θK=0,讓每一個類別的 Bernoulli 參數 ϕi 等同於: +⟶ Softmax 迴歸 - Softmax 迴歸又稱做多分類邏輯迴歸, 目的是用在超過兩個以上的分類時的迴歸使用。按照慣例, 我們設定 θK=0, 讓每一個類別的 Bernoulli 參數 ϕi 等同於: 33. **Generalized Linear Models** @@ -133,11 +133,11 @@ 34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** -⟶ 指數族分佈 - 一個分佈如果可以透過自然參數 (或稱之為正準參數或連結函數) η、充分統計量 T(y) 和對數區分函數 (log-partition function) a(η) 來表示時,我們就稱這個分佈是屬於指數族分佈。該分佈可以表示如下: +⟶ 指數族分佈 - 一個分佈如果可以透過自然參數 (或稱之為正準參數或連結函數) η、充分統計量 T(y) 和對數區分函數 (log-partition function) a(η) 來表示時, 我們就稱這個分佈是屬於指數族分佈。該分佈可以表示如下: 35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** -⟶ 注意:我們經常讓 T(y)=y,同時,exp(−a(η)) 可以看成是一個正規化的參數,目的在於讓機率總和為一。 +⟶ 注意:我們經常讓 T(y)=y, 同時, exp(−a(η)) 可以看成是一個正規化的參數, 目的在於讓機率總和為一。 36. **Here are the most common exponential distributions summed up in the following table:** @@ -149,7 +149,7 @@ 38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** -⟶ 廣義線性模型的假設 - 廣義線性模型 (GLM) 的目的在於,給定 x∈Rn+1,要預測隨機變數 y,同時它依賴底下三個假設: +⟶ 廣義線性模型的假設 - 廣義線性模型 (GLM) 的目的在於, 給定 x∈Rn+1, 要預測隨機變數 y, 同時它依賴底下三個假設: 39. **Remark: ordinary least squares and logistic regression are special cases of generalized linear models.** @@ -169,7 +169,7 @@ 43. **where (w,b)∈Rn×R is the solution of the following optimization problem:** -⟶ 其中,(w,b)∈Rn×R 是底下最佳化問題的答案: +⟶ 其中, (w,b)∈Rn×R 是底下最佳化問題的答案: 44. **such that** @@ -185,15 +185,15 @@ 47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** -⟶ Hinge 損失函數 - Hinge 損失函數用在支援向量機上,定義如下: +⟶ Hinge 損失函數 - Hinge 損失函數用在支援向量機上, 定義如下: 48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** -⟶ 核(函數) - 給定特徵轉換 ϕ,我們定義核(函數) K 為: +⟶ 核(函數) - 給定特徵轉換 ϕ, 我們定義核(函數) K 為: 49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** -⟶ 實務上,K(x,z)=exp(−||x−z||22σ2) 定義的核(函數) K,一般稱作高斯核(函數)。這種核(函數)經常被使用 +⟶ 實務上, K(x,z)=exp(−||x−z||22σ2) 定義的核(函數) K, 一般稱作高斯核(函數)。這種核(函數)經常被使用 50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** @@ -201,7 +201,7 @@ 51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** -⟶ 注意:我們使用 "核(函數)技巧" 來計算代價函數時,不需要真正的知道映射函數 ϕ,這個函數非常複雜。相反的,我們只需要知道 K(x,z) 的值即可。 +⟶ 注意:我們使用 "核(函數)技巧" 來計算代價函數時, 不需要真正的知道映射函數 ϕ, 這個函數非常複雜。相反的, 我們只需要知道 K(x,z) 的值即可。 52. **Lagrangian ― We define the Lagrangian L(w,b) as follows:** @@ -217,7 +217,7 @@ 55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** -⟶ 生成模型嘗試透過預估 P(x|y) 來學習資料如何生成,而我們可以透過貝氏定理來預估 P(y|x) +⟶ 生成模型嘗試透過預估 P(x|y) 來學習資料如何生成, 而我們可以透過貝氏定理來預估 P(y|x) 56. **Gaussian Discriminant Analysis** @@ -241,7 +241,7 @@ 61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** -⟶ 解決方法 - 最大化對數概似估計來給出以下解答,k∈{0,1},l∈[[1,L]] +⟶ 解決方法 - 最大化對數概似估計來給出以下解答, k∈{0,1},l∈[[1,L]] 62. **Remark: Naive Bayes is widely used for text classification and spam detection.** @@ -257,11 +257,11 @@ 65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** -⟶ CART - 分類與迴歸樹 (CART),通常稱之為決策數,可以被表示為二元樹。它的優點是具有可解釋性。 +⟶ CART - 分類與迴歸樹 (CART), 通常稱之為決策數, 可以被表示為二元樹。它的優點是具有可解釋性。 66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** -⟶ 隨機森林 - 這是一個基於樹狀結構的方法,它使用大量經由隨機挑選的特徵所建構的決策樹。與單純的決策樹不同,它通常具有高度不可解釋性,但它的效能通常很好,所以是一個相當流行的演算法。 +⟶ 隨機森林 - 這是一個基於樹狀結構的方法, 它使用大量經由隨機挑選的特徵所建構的決策樹。與單純的決策樹不同, 它通常具有高度不可解釋性, 但它的效能通常很好, 所以是一個相當流行的演算法。 67. **Remark: random forests are a type of ensemble methods.** @@ -277,7 +277,7 @@ 70. **High weights are put on errors to improve at the next boosting step** -⟶ 在下一輪的提升步驟中,錯誤的部分會被賦予較高的權重 +⟶ 在下一輪的提升步驟中, 錯誤的部分會被賦予較高的權重 71. **Weak learners trained on remaining errors** @@ -289,11 +289,11 @@ 73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** -⟶ k-最近鄰 - k-最近鄰演算法,又稱之為 k-NN,是一個非參數的方法,其中資料點的決定是透過訓練集中最近的 k 個鄰居而決定。它可以用在分類和迴歸問題上。 +⟶ k-最近鄰 - k-最近鄰演算法, 又稱之為 k-NN, 是一個非參數的方法, 其中資料點的決定是透過訓練集中最近的 k 個鄰居而決定。它可以用在分類和迴歸問題上。 74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** -⟶ 注意:參數 k 的值越大,偏差越大。k 的值越小,變異越大。 +⟶ 注意:參數 k 的值越大, 偏差越大。k 的值越小, 變異越大。 75. **Learning Theory** @@ -301,11 +301,11 @@ 76. **Union bound ― Let A1,...,Ak be k events. We have:** -⟶ 聯集上界 - 令 A1,...,Ak 為 k 個事件,我們有: +⟶ 聯集上界 - 令 A1,...,Ak 為 k 個事件, 我們有: 77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** -⟶ 霍夫丁不等式 - 令 Z1,..,Zm 為 m 個從參數 ϕ 的白努利分佈中抽出的獨立同分佈 (iid) 的變數。令 ˆϕ 為其樣本平均、固定 γ>0,我們可以得到: +⟶ 霍夫丁不等式 - 令 Z1,..,Zm 為 m 個從參數 ϕ 的白努利分佈中抽出的獨立同分佈 (iid) 的變數。令 ˆϕ 為其樣本平均、固定 γ>0, 我們可以得到: 78. **Remark: this inequality is also known as the Chernoff bound.** @@ -313,11 +313,11 @@ 79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** -⟶ 訓練誤差 - 對於一個分類器 h,我們定義訓練誤差為 ˆϵ(h),也可以稱為經驗風險或經驗誤差。定義如下: +⟶ 訓練誤差 - 對於一個分類器 h, 我們定義訓練誤差為 ˆϵ(h), 也可以稱為經驗風險或經驗誤差。定義如下: 80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: ** -⟶ 可能近似正確 (PAC) - PAC 是一個框架,有許多學習理論都證明其有效性。它包含以下假設: +⟶ 可能近似正確 (PAC) - PAC 是一個框架, 有許多學習理論都證明其有效性。它包含以下假設: 81: **the training and testing sets follow the same distribution** @@ -329,11 +329,11 @@ 83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** -⟶ 打散 (Shattering) - 給定一個集合 S={x(1),...,x(d)} 以及一組分類器的集合 H,如果對於任何一組標籤 {y(1),...,y(d)},H 都能打散 S,定義如下: +⟶ 打散 (Shattering) - 給定一個集合 S={x(1),...,x(d)} 以及一組分類器的集合 H, 如果對於任何一組標籤 {y(1),...,y(d)}, H 都能打散 S, 定義如下: 84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** -⟶ 上限定理 - 令 H 是一個有限假設類別,使 |H|=k 且令 δ 和樣本大小 m 固定,結著,在機率至少為 1−δ 的情況下,我們得到: +⟶ 上限定理 - 令 H 是一個有限假設類別, 使 |H|=k 且令 δ 和樣本大小 m 固定, 結著, 在機率至少為 1−δ 的情況下, 我們得到: 85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** @@ -345,7 +345,7 @@ 87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** -⟶ 理論 (Vapnik) - 令 H 已給定,VC(H)=d 且 m 是訓練資料級的數量,在機率至少為 1−δ 的情況下,我們得到: +⟶ 理論 (Vapnik) - 令 H 已給定, VC(H)=d 且 m 是訓練資料級的數量, 在機率至少為 1−δ 的情況下, 我們得到: 88. **Known as Adaboost** diff --git a/zh-tw/cs-229-unsupervised-learning.md b/zh-tw/cs-229-unsupervised-learning.md index 0f6d5ee34..54044637a 100644 --- a/zh-tw/cs-229-unsupervised-learning.md +++ b/zh-tw/cs-229-unsupervised-learning.md @@ -19,7 +19,7 @@ 4. **Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:** ⟶ -Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們可以得到底下這個不等式: +Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數, 我們可以得到底下這個不等式:
5. **Clustering** @@ -37,7 +37,7 @@ Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們 7. **Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:** ⟶ -潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數,這會讓問題的估計變得困難,我們通常使用 z 來代表它。底下是潛在變數的常見設定: +潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數, 這會讓問題的估計變得困難, 我們通常使用 z 來代表它。底下是潛在變數的常見設定:
8. **[Setting, Latent variable z, Comments]** @@ -61,13 +61,13 @@ Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們 11. **E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:** ⟶ -E-step: 評估後驗機率 Qi(z(i)),其中每個資料點 x(i) 來自於一個特定的群集 z(i),如下: +E-step: 評估後驗機率 Qi(z(i)), 其中每個資料點 x(i) 來自於一個特定的群集 z(i), 如下:
12. **M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:** ⟶ -M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重,用來分別重新估計每個群集,如下: +M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重, 用來分別重新估計每個群集, 如下:
13. **[Gaussians initialization, Expectation step, Maximization step, Convergence]** @@ -85,13 +85,13 @@ k-means 分群法 15. **We note c(i) the cluster of data point i and μj the center of cluster j.** ⟶ -我們使用 c(i) 表示資料 i 屬於某群,而 μj 則是群 j 的中心 +我們使用 c(i) 表示資料 i 屬於某群, 而 μj 則是群 j 的中心
16. **Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:** ⟶ -演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後,k-means 演算法重複以下步驟直到收斂: +演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後, k-means 演算法重複以下步驟直到收斂:
17. **[Means initialization, Cluster assignment, Means update, Convergence]** @@ -103,7 +103,7 @@ k-means 分群法 18. **Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:** ⟶ -畸變函數 - 為了確認演算法是否收斂,我們定義以下的畸變函數: +畸變函數 - 為了確認演算法是否收斂, 我們定義以下的畸變函數:
19. **Hierarchical clustering** @@ -115,13 +115,13 @@ k-means 分群法 20. **Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.** ⟶ -演算法 - 階層式分群法是透過一種階層架構的方式,將資料建立為一種連續層狀結構的形式。 +演算法 - 階層式分群法是透過一種階層架構的方式, 將資料建立為一種連續層狀結構的形式。
21. **Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:** ⟶ -類型 - 底下是幾種不同類型的階層式分群法,差別在於要最佳化的目標函式的不同,請參考底下: +類型 - 底下是幾種不同類型的階層式分群法, 差別在於要最佳化的目標函式的不同, 請參考底下:
22. **[Ward linkage, Average linkage, Complete linkage]** @@ -145,25 +145,25 @@ k-means 分群法 25. **In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.** ⟶ -在非監督式學習中,通常很難去評估一個模型的好壞,因為我們沒有擁有像在監督式學習任務中正確答案的標籤 +在非監督式學習中, 通常很難去評估一個模型的好壞, 因為我們沒有擁有像在監督式學習任務中正確答案的標籤
26. **Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:** ⟶ -輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離,輪廓係數 s 對於此一樣本點的定義為: +輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離, 輪廓係數 s 對於此一樣本點的定義為:
27. **Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as** ⟶ -Calinski-Harabaz 指標 - 定義 k 是群集的數量,Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices): +Calinski-Harabaz 指標 - 定義 k 是群集的數量, Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices):
28. **the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:** ⟶ -Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高,代表分群模型的表現越好。定義如下: +Calinski-Harabaz 指標 s(k) 指出分群模型的好壞, 此指標的值越高, 代表分群模型的表現越好。定義如下:
29. **Dimension reduction** @@ -181,19 +181,19 @@ Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高 31. **It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.** ⟶ -這是一個維度縮減的技巧,在於找到投影資料的最大方差 +這是一個維度縮減的技巧, 在於找到投影資料的最大方差
32. **Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:** ⟶ -特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,我們說 λ 是 A 的特徵值,當存在一個特徵向量 z∈Rn∖{0},使得: +特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n, 我們說 λ 是 A 的特徵值, 當存在一個特徵向量 z∈Rn∖{0}, 使得:
33. **Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:** ⟶ -譜定理 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn),我們得到: +譜定理 - 令 A∈Rn×n, 如果 A 是對稱的, 則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn), 我們得到:
34. **diagonal** @@ -211,25 +211,25 @@ Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高 36. **Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:** ⟶ -演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧,它會透過尋找資料最大變異的方式,將資料投影在 k 維空間上: +演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧, 它會透過尋找資料最大變異的方式, 將資料投影在 k 維空間上:
37. **Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.** ⟶ -第一步:正規化資料,讓資料平均為 0,變異數為 1 +第一步:正規化資料, 讓資料平均為 0, 變異數為 1
38. **Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.** ⟶ -第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n,即對稱實際特徵值 +第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n, 即對稱實際特徵值
39. **Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.** ⟶ -第三步:計算 u1,...,uk∈Rn,k 個正交主特徵向量的總和 Σ,即是 k 個最大特徵值的正交特徵向量 +第三步:計算 u1,...,uk∈Rn, k 個正交主特徵向量的總和 Σ, 即是 k 個最大特徵值的正交特徵向量
40. **Step 4: Project the data on spanR(u1,...,uk).** @@ -265,7 +265,7 @@ Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高 45. **Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:** ⟶ -假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生,si 為獨立變數,透過一個混合與非奇異矩陣 A 產生如下: +假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生, si 為獨立變數, 透過一個混合與非奇異矩陣 A 產生如下:
46. **The goal is to find the unmixing matrix W=A−1.** @@ -289,10 +289,10 @@ Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟 49. **Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:** ⟶ -在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下,其對數概似估計函數與定義 g 為 sigmoid 函數如下: +在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下, 其對數概似估計函數與定義 g 為 sigmoid 函數如下:
50. **Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:** ⟶ -因此,梯度隨機下降學習規則對每個訓練樣本 x(i) 來說,我們透過以下方法來更新 W: +因此, 梯度隨機下降學習規則對每個訓練樣本 x(i) 來說, 我們透過以下方法來更新 W: diff --git a/zh/cs-229-supervised-learning.md b/zh/cs-229-supervised-learning.md index 4a7f4bbb9..8f5015c49 100644 --- a/zh/cs-229-supervised-learning.md +++ b/zh/cs-229-supervised-learning.md @@ -12,7 +12,7 @@ 3. **Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.** -⟶ 给定一组数据点 {x(1),...,x(m)} 和与其对应的输出 {y(1),...,y(m)} , 我们想要建立一个分类器,学习如何从 x 预测 y。 +⟶ 给定一组数据点 {x(1),...,x(m)} 和与其对应的输出 {y(1),...,y(m)} , 我们想要建立一个分类器, 学习如何从 x 预测 y。
@@ -24,13 +24,13 @@ 5. **[Regression, Classifier, Outcome, Examples]** -⟶ [回归,分类,输出,例子] +⟶ [回归, 分类, 输出, 例子]
6. **[Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]** -⟶ [连续,类,线性回归,Logistic回归,SVM,朴素贝叶斯] +⟶ [连续, 类, 线性回归, Logistic回归, SVM, 朴素贝叶斯]
@@ -42,13 +42,13 @@ 8. **[Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]** -⟶ [判别模型,生成模型,目标,所学内容,例图,示例] +⟶ [判别模型, 生成模型, 目标, 所学内容, 例图, 示例]
9. **[Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary, Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]** -⟶ [直接估计P(y|x),估计P(x|y) 然后推导 P(y|x),决策边界,数据的概率分布,回归,SVMs,GDA,朴素贝叶斯] +⟶ [直接估计P(y|x), 估计P(x|y) 然后推导 P(y|x), 决策边界, 数据的概率分布, 回归, SVMs, GDA, 朴素贝叶斯]
@@ -60,61 +60,61 @@ 11. **Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).** -⟶ 假设 - 假设我们选择的模型是hθ 。 对于给定的输入数据 x(i),模型预测输出是 hθ(x(i))。 +⟶ 假设 - 假设我们选择的模型是hθ 。 对于给定的输入数据 x(i), 模型预测输出是 hθ(x(i))。
12. **Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:** -⟶ 损失函数 - 损失函数是一个 L:(z,y)∈R×Y⟼L(z,y)∈R 的函数,其将真实数据值 y 和其预测值 z 作为输入,输出它们的不同程度。 常见的损失函数总结如下表: +⟶ 损失函数 - 损失函数是一个 L:(z,y)∈R×Y⟼L(z,y)∈R 的函数, 其将真实数据值 y 和其预测值 z 作为输入, 输出它们的不同程度。 常见的损失函数总结如下表:
13. **[Least squared error, Logistic loss, Hinge loss, Cross-entropy]** -⟶ [最小二乘误差,Logistic损失,铰链损失,交叉熵] +⟶ [最小二乘误差, Logistic损失, 铰链损失, 交叉熵]
14. **[Linear regression, Logistic regression, SVM, Neural Network]** -⟶ [线性回归,Logistic回归,SVM,神经网络] +⟶ [线性回归, Logistic回归, SVM, 神经网络]
15. **Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:** -⟶ 成本函数 - 成本函数 J 通常用于评估模型的性能,使用损失函数 L 定义如下: +⟶ 成本函数 - 成本函数 J 通常用于评估模型的性能, 使用损失函数 L 定义如下:
16. **Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:** -⟶ 梯度下降 - 记学习率为 α∈R,梯度下降的更新规则使用学习率和成本函数 J 表示如下: +⟶ 梯度下降 - 记学习率为 α∈R, 梯度下降的更新规则使用学习率和成本函数 J 表示如下:
17. **Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.** -⟶ 备注:随机梯度下降(SGD)是根据每个训练样本进行参数更新,而批量梯度下降是在一批训练样本上进行更新。 +⟶ 备注:随机梯度下降(SGD)是根据每个训练样本进行参数更新, 而批量梯度下降是在一批训练样本上进行更新。
18. **Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:** -⟶ 似然 - 给定参数 θ 的模型 L(θ)的似然性用于通过最大化似然性来找到最佳参数θ。 在实践中,我们使用更容易优化的对数似然 ℓ(θ)=log(L(θ)) 。我们有 +⟶ 似然 - 给定参数 θ 的模型 L(θ)的似然性用于通过最大化似然性来找到最佳参数θ。 在实践中, 我们使用更容易优化的对数似然 ℓ(θ)=log(L(θ)) 。我们有
19. **Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:** -⟶ 牛顿算法 - 牛顿算法是一种数值方法,目的是找到一个 θ 使得 ℓ′(θ)=0. 其更新规则如下: +⟶ 牛顿算法 - 牛顿算法是一种数值方法, 目的是找到一个 θ 使得 ℓ′(θ)=0. 其更新规则如下:
20. **Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:** -⟶ 备注:多维泛化,也称为 Newton-Raphson 方法,具有以下更新规则: +⟶ 备注:多维泛化, 也称为 Newton-Raphson 方法, 具有以下更新规则:
@@ -138,13 +138,13 @@ 24. **Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:** -⟶ 正规方程 - 通过设计 X 矩阵,使得最小化成本函数时 θ 有闭式解: +⟶ 正规方程 - 通过设计 X 矩阵, 使得最小化成本函数时 θ 有闭式解:
25. **LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:** -⟶ LMS算法 - 通过 α 学习率,训练集中 m 个数据的最小均方(LMS)算法的更新规则也称为Widrow-Hoff学习规则,如下 +⟶ LMS算法 - 通过 α 学习率, 训练集中 m 个数据的最小均方(LMS)算法的更新规则也称为Widrow-Hoff学习规则, 如下
@@ -156,7 +156,7 @@ 27. **LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:** -⟶ LWR - 局部加权回归,也称为LWR,是线性回归的变体,通过 w(i)(x) 对其成本函数中的每个训练样本进行加权,其中参数 τ∈R 定义为 +⟶ LWR - 局部加权回归, 也称为LWR, 是线性回归的变体, 通过 w(i)(x) 对其成本函数中的每个训练样本进行加权, 其中参数 τ∈R 定义为
@@ -168,7 +168,7 @@ 29. **Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:** -⟶ Sigmoid函数 - sigmoid 函数 g,也称为逻辑函数,定义如下: +⟶ Sigmoid函数 - sigmoid 函数 g, 也称为逻辑函数, 定义如下:
@@ -180,13 +180,13 @@ 31. **Remark: there is no closed form solution for the case of logistic regressions.** -⟶ 备注:对于逻辑回归的情况,没有闭式解。 +⟶ 备注:对于逻辑回归的情况, 没有闭式解。
32. **Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:** -⟶ Softmax回归 - 当存在超过2个结果类时,使用softmax回归(也称为多类逻辑回归)来推广逻辑回归。 按照惯例,我们设置 θK=0,使得每个类 i 的伯努利参数 ϕi 等于: +⟶ Softmax回归 - 当存在超过2个结果类时, 使用softmax回归(也称为多类逻辑回归)来推广逻辑回归。 按照惯例, 我们设置 θK=0, 使得每个类 i 的伯努利参数 ϕi 等于:
@@ -198,13 +198,13 @@ 34. **Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:** -⟶ 指数分布族 - 如果可以用自然参数 η,也称为规范参数或链接函数,充分统计量 T(y) 和对数分割函数a(η)来表示,则称一类分布在指数分布族中, 函数如下: +⟶ 指数分布族 - 如果可以用自然参数 η, 也称为规范参数或链接函数, 充分统计量 T(y) 和对数分割函数a(η)来表示, 则称一类分布在指数分布族中, 函数如下:
35. **Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.** -⟶ 备注:我们经常会有 T(y)=y。 此外,exp(−a(η)) 可以看作是归一化参数,确保概率总和为1 +⟶ 备注:我们经常会有 T(y)=y。 此外, exp(−a(η)) 可以看作是归一化参数, 确保概率总和为1
@@ -216,13 +216,13 @@ 37. **[Distribution, Bernoulli, Gaussian, Poisson, Geometric]** -⟶ [分布,伯努利,高斯,泊松,几何] +⟶ [分布, 伯努利, 高斯, 泊松, 几何]
38. **Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:** -⟶ GLM的假设 - 广义线性模型(GLM)是旨在将随机变量 y 预测为 x∈Rn+1 的函数,并依赖于以下3个假设: +⟶ GLM的假设 - 广义线性模型(GLM)是旨在将随机变量 y 预测为 x∈Rn+1 的函数, 并依赖于以下3个假设:
@@ -276,31 +276,31 @@ 47. **Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:** -⟶ 合页损失 - 合页损失用于SVM,定义如下: +⟶ 合页损失 - 合页损失用于SVM, 定义如下:
48. **Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:** -⟶ 核 - 给定特征映射 ϕ,我们定义核 K 为: +⟶ 核 - 给定特征映射 ϕ, 我们定义核 K 为:
49. **In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.** -⟶ 在实践中,由 K(x,z)=exp(−||x−z||22σ2) 定义的核 K 被称为高斯核,并且经常使用这种核。 +⟶ 在实践中, 由 K(x,z)=exp(−||x−z||22σ2) 定义的核 K 被称为高斯核, 并且经常使用这种核。
50. **[Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]** -⟶ [非线性可分性,核映射的使用,原始空间中的决策边界] +⟶ [非线性可分性, 核映射的使用, 原始空间中的决策边界]
51. **Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.** -⟶ 备注:我们说我们使用“核技巧”来计算使用核的成本函数,因为我们实际上不需要知道显式映射φ,通常,这非常复杂。 相反,只需要 K(x,z) 的值。 +⟶ 备注:我们说我们使用“核技巧”来计算使用核的成本函数, 因为我们实际上不需要知道显式映射φ, 通常, 这非常复杂。 相反, 只需要 K(x,z) 的值。
@@ -324,7 +324,7 @@ 55. **A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.** -⟶ 生成模型首先尝试通过估计 P(x|y) 来模仿如何生成数据,然后我们可以使用贝叶斯法则来估计 P(y|x) +⟶ 生成模型首先尝试通过估计 P(x|y) 来模仿如何生成数据, 然后我们可以使用贝叶斯法则来估计 P(y|x)
@@ -360,7 +360,7 @@ 61. **Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]** -⟶ 解决方案 - 最大化对数似然给出以下解,k∈{0,1},l∈[[1,L]] +⟶ 解决方案 - 最大化对数似然给出以下解, k∈{0,1}, l∈[[1,L]]
@@ -384,13 +384,13 @@ 65. **CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.** -⟶ CART - 分类和回归树(CART),通常称为决策树,可以表示为二叉树。它们具有可解释性的优点。 +⟶ CART - 分类和回归树(CART), 通常称为决策树, 可以表示为二叉树。它们具有可解释性的优点。
66. **Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.** -⟶ 随机森林 - 这是一种基于树模型的技术,它使用大量的由随机选择的特征集构建的决策树。 与简单的决策树相反,它是高度无法解释的,但其普遍良好的表现使其成为一种流行的算法。 +⟶ 随机森林 - 这是一种基于树模型的技术, 它使用大量的由随机选择的特征集构建的决策树。 与简单的决策树相反, 它是高度无法解释的, 但其普遍良好的表现使其成为一种流行的算法。
@@ -408,13 +408,13 @@ 69. **[Adaptive boosting, Gradient boosting]** -⟶ [自适应增强, 梯度提升] +⟶ [自适应增强, 梯度提升]
70. **High weights are put on errors to improve at the next boosting step** -⟶ 在下一轮提升步骤中,错误的会被置于高权重 +⟶ 在下一轮提升步骤中, 错误的会被置于高权重
@@ -432,13 +432,13 @@ 73. **k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.** -⟶ k-最近邻 - k-最近邻算法,通常称为k-NN,是一种非参数方法,其中数据点的判决由来自训练集中与其相邻的k个数据的性质确定。 它可以用于分类和回归。 +⟶ k-最近邻 - k-最近邻算法, 通常称为k-NN, 是一种非参数方法, 其中数据点的判决由来自训练集中与其相邻的k个数据的性质确定。 它可以用于分类和回归。
74. **Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.** -⟶ 备注:参数 k 越高,偏差越大,参数 k 越低,方差越大。 +⟶ 备注:参数 k 越高, 偏差越大, 参数 k 越低, 方差越大。
@@ -450,13 +450,13 @@ 76. **Union bound ― Let A1,...,Ak be k events. We have:** -⟶ 联盟 - 让A1,…,Ak 成为 k 个事件。 我们有: +⟶ 联盟 - 让A1, …, Ak 成为 k 个事件。 我们有:
77. **Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:** -⟶ Hoeffding不等式 - 设Z1,...,Zm是从参数 φ 的伯努利分布中提取的 m iid 变量。 设 φ 为其样本均值,固定 γ> 0。 我们有: +⟶ Hoeffding不等式 - 设Z1, ..., Zm是从参数 φ 的伯努利分布中提取的 m iid 变量。 设 φ 为其样本均值, 固定 γ> 0。 我们有:
@@ -468,13 +468,13 @@ 79. **Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:** -⟶ 训练误差 - 对于给定的分类器 h,我们定义训练误差 ϵ(h),也称为经验风险或经验误差,如下: +⟶ 训练误差 - 对于给定的分类器 h, 我们定义训练误差 ϵ(h), 也称为经验风险或经验误差, 如下:
80. **Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions:** -⟶ 可能近似正确 (PAC) - PAC是一个框架,在该框架下证明了许多学习理论的结果,并具有以下假设: +⟶ 可能近似正确 (PAC) - PAC是一个框架, 在该框架下证明了许多学习理论的结果, 并具有以下假设:
@@ -492,19 +492,19 @@ 83. **Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:** -⟶ 打散 - 给定一个集合 S={x(1),...,x(d)} 和一组分类器 H,如果对于任意一组标签 {y(1),...,y(d)} 都能对分,我们称 H 打散 S ,我们有: +⟶ 打散 - 给定一个集合 S={x(1),...,x(d)} 和一组分类器 H, 如果对于任意一组标签 {y(1),...,y(d)} 都能对分, 我们称 H 打散 S , 我们有:
84. **Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:** -⟶ 上限定理 - 设 H 是有限假设类,使得 |H|=k 并且使 δ 和样本大小 m 固定。 然后,在概率至少为 1-δ 的情况下,我们得到: +⟶ 上限定理 - 设 H 是有限假设类, 使得 |H|=k 并且使 δ 和样本大小 m 固定。 然后, 在概率至少为 1-δ 的情况下, 我们得到:
85. **VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.** -⟶ VC维 - 给定无限假设类 H 的 Vapnik-Chervonenkis(VC) 维,注意 VC(H) 是由 H 打散的最大集合的大小。 +⟶ VC维 - 给定无限假设类 H 的 Vapnik-Chervonenkis(VC) 维, 注意 VC(H) 是由 H 打散的最大集合的大小。
@@ -516,52 +516,52 @@ 87. **Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:** -⟶ 定理 (Vapnik) - 设H,VC(H)=d ,m 为训练样本数。 概率至少为 1-δ,我们有: +⟶ 定理 (Vapnik) - 设H, VC(H)=d , m 为训练样本数。 概率至少为 1-δ, 我们有:
88. **[Introduction, Type of prediction, Type of model]** -⟶ [简介,预测类型,模型类型] +⟶ [简介, 预测类型, 模型类型]
89. **[Notations and general concepts, loss function, gradient descent, likelihood]** -⟶ [符号和一般概念,损失函数,梯度下降,似然] +⟶ [符号和一般概念, 损失函数, 梯度下降, 似然]
90. **[Linear models, linear regression, logistic regression, generalized linear models]** -⟶ [线性模型,线性回归,逻辑回归,广义线性模型] +⟶ [线性模型, 线性回归, 逻辑回归, 广义线性模型]
91. **[Support vector machines, Optimal margin classifier, Hinge loss, Kernel]** -⟶ [支持向量机,最优间隔分类器,合页损失,核] +⟶ [支持向量机, 最优间隔分类器, 合页损失, 核]
92. **[Generative learning, Gaussian Discriminant Analysis, Naive Bayes]** -⟶ [生成学习,高斯判别分析,朴素贝叶斯] +⟶ [生成学习, 高斯判别分析, 朴素贝叶斯]
93. **[Trees and ensemble methods, CART, Random forest, Boosting]** -⟶ 树和集成方法,CART,随机森林,提升] +⟶ 树和集成方法, CART, 随机森林, 提升]
94. **[Other methods, k-NN]** -⟶ [其他方法,k-NN] +⟶ [其他方法, k-NN]
95. **[Learning theory, Hoeffding inequality, PAC, VC dimension]** -⟶ [学习理论,Hoeffding不等式,PAC,VC维] +⟶ [学习理论, Hoeffding不等式, PAC, VC维] diff --git a/zh/cs-230-recurrent-neural-networks.md b/zh/cs-230-recurrent-neural-networks.md new file mode 100644 index 000000000..0655cd6b2 --- /dev/null +++ b/zh/cs-230-recurrent-neural-networks.md @@ -0,0 +1,676 @@ +**Recurrent Neural Networks translation** [[webpage]](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) + +
+ +**1. Recurrent Neural Networks cheatsheet** + +⟶ +循环神经网络简明指南 +
+ + +**2. CS 230 - Deep Learning** + +⟶ +CS 230 - 深度学习 +
+ + +**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]** + +⟶ +[概述, 网络结构, 循环神经网络的应用, 损失函数, 反向传播] +
+ + +**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]** + +⟶ +[处理长时间依赖性, 常见激活函数, 梯度消失/梯度爆炸, 梯度截断, 门控循环单元(GRU)/长短时记忆(LSTM), 门类型, 双向循环神经网络, 深度循环神经网络] +
+ + +**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]** + +⟶ +[词表示学习, 注解, 嵌入矩阵, Word2vec, Skip-gram, 负采样, GloVe] +
+ + +**6. [Comparing words, Cosine similarity, t-SNE]** + +⟶ +[词比较, 余弦相似度, t-SNE] +
+ + +**7. [Language model, n-gram, Perplexity]** + +⟶ +[语言模型, n-gram, 困惑度] +
+ + +**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]** + +⟶ +[机器翻译, 集束搜索/束搜索, 长度归一化, 误差分析, Bleu分数] +
+ + +**9. [Attention, Attention model, Attention weights]** + +⟶ +[注意力机制, 注意力模型, 注意力权重] +
+ + +**10. Overview** + +⟶ +概述 +
+ + +**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:** + +⟶ +传统RNN的结构 - 循环神经网络(Recurrent Neural Networks,RNNs), 是一类可以将之前的输出作为后续隐藏状态的输入的神经网络。通常可表示为以下形式: +
+ + +**12. For each timestep t, the activation a and the output y are expressed as follows:** + +⟶ +对于每一个时间步t,激活值a和输出y可表示如下: +
+ + +**13. and** + +⟶ +并且 +
+ + +**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.** + +⟶ +其中Wax,Waa,Wya,ba是在时间尺度上被整个网络共享的系数矩阵;g1,g2是相关的激活函数。 +
+ + +**15. The pros and cons of a typical RNN architecture are summed up in the table below:** + +⟶ +一个典型的RNN体系结构的优点和缺点可概括如下表: +
+ + +**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]** + +⟶ +[优点, 可处理任何长度的输入, 模型大小不会随输入大小的增加而增加, 计算时会考虑历史信息, 权重在整个时间尺度上被网络共享] +
+ + +**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]** + +⟶ +[缺点, 计算缓慢, 难以访问长时间的历史信息, 无法考虑未来时间步的输入对当前状态的影响] +
+ + +**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:** + +⟶ +循环神经网络的应用 - 循环神经网络(RNN)模型常用于自然语言处理和语音识别, 下表总结了循环神经网络(RNN)模型的不同应用场景: +
+ + +**19. [Type of RNN, Illustration, Example]** + +⟶ +[循环神经网络的类型, 图形表示, 示例] +
+ + +**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]** + +⟶ +[一对一, 一对多, 多对一, 多对多] +
+ + +**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]** + +⟶ +[传统神经网络, 音乐生成, 情感分类, 命名实体识别, 机器翻译] +
+ + +**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:** + +⟶ +损失函数 - 在循环神经网络的情况下, 所有时间步长的损失函数L是基于每个时间步长的损失来定义的, 其表示如下: +
+ + +**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:** + +⟶ +随时间反向传播算法(BPTT) - 反向传播在每个时间点完成。在时间步T, 损失函数L相对于权重矩阵W的导数表示如下: +
+ + +**24. Handling long term dependencies** + +⟶ +解决长时间依赖问题 +
+ + +**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:** + +⟶ +常用的激活函数 - 在循环神经网络(RNN)模型中常用的激活函数如下所示: +
+ + +**26. [Sigmoid, Tanh, RELU]** + +⟶ +[Sigmoid, 双曲正切函数(Tanh), 整流线性单元(RELU)] +
+ + +**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.** + +⟶ +梯度消失/梯度爆炸 - 梯度消失和梯度爆炸现象常出现在循环神经网络(RNN)模型中。其原因是该模型结构难以捕获长期依赖性, 因为乘法梯度会随着层数增加而呈指数递减/递增。 +
+ + +**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.** + +⟶ +梯度截断 - 一种用于解决反向传播时时而出现梯度爆炸问题的方法。通过限制梯度的最大值, 这种现象在实际中得到了相应的控制。 +
+ +**29. clipped** + +⟶ +截断 +
+ + +**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:** + +⟶ +门类型 - 为了解决消失梯度问题, 在某些类型的RNN中使用了特定的门, 并且通常有明确的目的。它们通常被写为Γ: +
+ + +**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:** + +⟶ +其中W,U,b是针对特定门的系数, σ是sigmoid激活函数。其主要的门类型可概括如下: +
+ + +**32. [Type of gate, Role, Used in]** + +⟶ +[门类型, 角色, 被用于] +
+ + +**33. [Update gate, Relevance gate, Forget gate, Output gate]** + +⟶ +[更新门, 关联门, 遗忘门, 输出门] +
+ + +**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]** + +⟶ +[过去多久的信息对现在来说是重要的?, 是否丢失以前的信息?,是否擦除该单元?, 展示单元的多少信息?] +
+ + +**35. [LSTM, GRU]** + +⟶ +[长短时记忆(LSTM), 门控循环单元(GRU)] +
+ + +**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:** + +⟶ +门控循环单元(GRU)/长短时记忆(LSTM) ― 门控循环单元(GRU)和长短时记忆(LSTM)可解决传统循环神经网络(RNNs)中遇到的梯度消失问题, 其中GRU是LSTM的一种推广。下表总结了每种结构的特性方程: +
+ + +**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]** + +⟶ +[特性, 门控循环单元(GRU), 长短时记忆(LSTM), 依赖项] +
+ + +**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.** + +⟶ +注:符号⋆表示两个向量之间的元素相乘。 +
+ + +**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:** + +⟶ +循环神经网络(RNN)模型的变种 - 下表列出了其他常用的RNN结构: +
+ + +**40. [Bidirectional (BRNN), Deep (DRNN)]** + +⟶ +[双向循环神经网络(Bidirectional RNN, BRNN), 深度神经网络(Deep RNN, DRNN)] +
+ + +**41. Learning word representation** + +⟶ +词表示学习 +
+ + +**42. In this section, we note V the vocabulary and |V| its size.** + +⟶ +在本节中,我们用V来表示词汇,用|V|来表示词汇大小。 +
+ + +**43. Motivation and notations** + +⟶ +动机和注解 +
+ + +**44. Representation techniques ― The two main ways of representing words are summed up in the table below:** + +⟶ +表示技术 - 两种主要的词表示方法的总结如下表所示: +
+ + +**45. [1-hot representation, Word embedding]** + +⟶ +[独热表示(one-hot), 词嵌入(word embedding)] +
+ + +**46. [teddy bear, book, soft]** + +⟶ +[泰迪熊, 书, 柔软的] +
+ + +**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]** + +⟶ +[以ow表示, 朴素方法, 没有相似信息, 以ew表示, 考虑词汇之间的相似性] +
+ + +**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:** + +⟶ +嵌入矩阵 - 对于给定的词汇w, 通过嵌入矩阵E可将该词汇的one-hot表示向量ow映射为词嵌入表示向量ew, E满足下式: +
+ + +**49. Remark: learning the embedding matrix can be done using target/context likelihood models.** + +⟶ +注:使用目标/上下文似然模型可以学习嵌入矩阵。 +
+ + +**50. Word embeddings** + +⟶ +词嵌入 +
+ + +**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.** + +⟶ +Word2vec ― Word2vec是一个旨在于通过估计给定词汇被其他词汇包围的可能性来学习词嵌入的框架。流行的模型包括skip-gram, 负采样和连续词袋(Continuous Bag-of-Words Model,CBOW)。 +
+ + +**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]** + +⟶ +[一只可爱的泰迪熊正在阅读, 泰迪熊, 柔软的, 波斯诗歌, 艺术] +
+ + +**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]** + +⟶ +[通过代理任务训练网络, 提取高级表示, 计算词嵌入] +
+ + +**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:** + +⟶ +Skip-gram ― skip-gram word2vec模型是一个通过评估任意给定目标词汇t与上下文词汇c一起出现的可能性来学习词嵌入的监督式学习框架。记与时间t相关联的参数为θt, 概率P(t|c)可写作: +
+ + +**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.** + +⟶ +注:在softmax部分的分母中总计所有词汇使得模型的计算代价十分高昂。CBOW是另一个word2vec模型,其使用周围的单词来预测给定的单词。 +
+ + +**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:** + +⟶ +负采样 - 它是基于逻辑回归的二分类器集合,旨在于评估给定上下文和给定目标词是如何同时出现的,其中模型被训练在k个反例和1个正例的集合上。对于一个给定的上下文单词c和一个目标单词t,其预测可由以下表达式进行表示: +
+ + +**57. Remark: this method is less computationally expensive than the skip-gram model.** + +⟶ +注:该模型相比skip-gram模型而言,其计算代价更小。 +
+ + +**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:** + +⟶ +GloVe ― GloVe模型,是词表示的全局向量(global vectors for word representation)的简称, 是一种使用共现矩阵X的词嵌入技术,其中Xi,j表示的是目标词汇i与上下文j共同出现的次数。其代价函数J可写为: +
+ + +**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0. +Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:** + +⟶ +其中f是加权函数使得Xi,j=0⟹f(Xi,j)=0。考虑到e和θ在该模型中的对称性,最终嵌入的单词e(final)w由下式给出: +
+ + +**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.** + +⟶ +注:所学单词的嵌入表示的各个部分不一定是可解释的。 +
+ + +**60. Comparing words** + +⟶ +词比较 +
+ + +**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:** + +⟶ +余弦相似度 - 单词w1和w2之间的余弦相似度可表示如下: +
+ + +**62. Remark: θ is the angle between words w1 and w2.** + +⟶ +注:θ是词w1和w2之间的夹角。 +
+ + +**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.** + +⟶ +t-SNE ― 全称为t-distributed Stochastic Neighbor Embedding。t-SNE是一种将高维嵌入表示降维至低维空间的技术。实际上,其常用于将词向量在2D空间中的可视化。 +
+ + +**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]** + +⟶ +[文学,艺术,书籍,文化,诗歌,阅读,知识,娱乐,惹人爱的、童年、善良、泰迪熊、柔软、拥抱、可爱、讨人喜欢的。] +
+ + +**65. Language model** + +⟶ +语言模型 +
+ + +**66. Overview ― A language model aims at estimating the probability of a sentence P(y).** + +⟶ +概述 - 语言模型的目标在于估计句子的概率P(y) +
+ + +**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.** + +⟶ +n-gram模型 - 该模型的思想很朴素,旨在通过计算一个词汇表达式(词汇组合)在训练数据中出现的次数来量化该表达式出现在语料库中的概率。 +
+ + +**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:** + +⟶ +困惑度-语言模型通常使用困惑度来进行度量,其也被称为PP,它可以被解释为利用词的数量进行归一化的数据集的逆概率。困惑度越低越好,其定义如下: +
+ + +**69. Remark: PP is commonly used in t-SNE.** + +⟶ +注:PP常用于t-SNE模型中。 +
+ + +**70. Machine translation** + +⟶ +机器翻译 +
+ + +**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:** + +⟶ +概述 - 机器翻译模型与语言模型类似,只是其前面有一个编码器网络。因此,机器翻译模型有时被称为条件语言模型。该模型目标是找到一个句子y,以便: +
+ + +**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.** + +⟶ +束搜索 - 它是一种启发式搜索算法,用于机器翻译和语音识别,以找到给定输入x的最有可能的句子y。 +
+ + +**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y|x,y<1>,...,y, Step 3: Keep top B combinations x,y<1>,...,y, End process at a stop word]** + +⟶ +[第1步:寻找最相似的B个单词y<1>, 第2步:计算条件概率y|x,y<1>,...,y, 第3步:保持最相似的B个组合x,y<1>,...,y,在停止词汇处结束进程] +
+ + +**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.** + +⟶ +注:如果束宽设置为1,则其与朴素贪婪搜索等价。 +
+ + +**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.** + +⟶ +束宽 - 束宽B是束搜索的参数。B的值越大,搜索结果越好,但是其性能会变慢并且内存占用增加,B的值越小,搜索结果越差,但是计算代价小。B的标准值大约为10。 +
+ + +**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:** + +⟶ +长度归一化 - 为提高数值稳定性,束搜索常被应用于以下归一化目标,常称为归一化对数似然目标,定义如下: +
+ + +**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.** + +⟶ +注:参数α可看做软化器,其值在0.5 ~ 1之间。 +
+ + +**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:** + +⟶ +误差分析 - 当获得较差的预测翻译ˆy时,可以通过执行以下错误分析来思考为什么我们没有得到好的翻译y: +
+ + +**79. [Case, Root cause, Remedies]** + +⟶ +[具体情况、根本原因、补救措施] +
+ + +**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]** + +⟶ +[波束搜索故障,RNN故障,增加波束宽度,尝试不同架构,正则化,获取更多数据] +
+ + +**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:** + +⟶ +bleu分数 ― 双语评估替换(bilingual evaluation understudy, bleu)分数通过基于n-gram精度计算相似度分数来量化机器翻译的质量。其定义如下: +
+ + +**82. where pn is the bleu score on n-gram only defined as follows:** + +⟶ +其中pn是n-gram上的bleu分数,定义如下: +
+ + +**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.** + +⟶ +注:简洁的惩罚项可以应用于短预测翻译,以防止人为夸大bleu分数。 +
+ + +**84. Attention** + +⟶ +注意力机制 +
+ + +**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α the amount of attention that the output y should pay to the activation a and c the context at time t, we have:** + +⟶ +注意力模型 - 该模型允许RNN关注被认为是重要的输入的特定部分,从而提高了所得到的模型在实际中的性能。通过注意α输出上下文的时间t,我们得到: +
+ + +**86. with** + +⟶ +和 +
+ + +**87. Remark: the attention scores are commonly used in image captioning and machine translation.** + +⟶ +注:注意力分数常用于图像字幕和机器翻译。 +
+ + +**88. A cute teddy bear is reading Persian literature.** + +⟶ +一只可爱的泰迪熊正在阅读波斯文学书。 +
+ + +**89. Attention weight ― The amount of attention that the output y should pay to the activation a is given by α computed as follows:** + +⟶ +注意力权重 - 输出y对激活量a的注意力程度(即注意力权重)由α给出,其计算如下: +
+ + +**90. Remark: computation complexity is quadratic with respect to Tx.** + +⟶ +注:计算复杂度是Tx的平方。 +
+ + +**91. The Deep Learning cheatsheets are now available in [target language].** + +⟶ +现已提供[中文语言]版本的深度学习简明指南。 +
+ +**92. Original authors** + +⟶ +原作者 +
+ +**93. Translated by X, Y and Z** + +⟶ +由X,Y和Z翻译 +
+ +**94. Reviewed by X, Y and Z** + +⟶ +由X,Y和Z审阅 +
+ +**95. View PDF version on GitHub** + +⟶ +在Github上查看PDF版本 +
+ +**96. By X and Y** + +⟶ +由X和Y +