Return to Answer

refer to the correct term for this phenomenon

edited Jun 3, 2019 at 6:45

Mr Tsjolder from codidact

2.2k
2
19
32

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.

Addendum: For more pairs of loss functions and activations, you probably want to look for (canonical) link functions

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.

Addendum: For more pairs of loss functions and activations, you probably want to look for (canonical) link functions

replaced http://stats.stackexchange.com/ with https://stats.stackexchange.com/

edited Apr 13, 2017 at 12:44

1

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.

answered Feb 1, 2017 at 11:09

Mr Tsjolder from codidact

2.2k
2
19
32

I might be late to the party, but it seems that there are some things that need to be cleared out here.

First of all: the activation function $g(x)$ at the output layer often depends on your cost function. This is done to make the derivative $\frac{\partial C}{\partial z}$ of the cost function $C$ with respect to the inputs $z$ at the last layer easy to compute.

As an example, we could use the mean squared error loss $C(y, g(z)) = \frac{1}{2} (y - g(z))^2$ in a regression setting. By setting $g(x) = x$ (linear activation function), we find for the derivative $$\begin{align*} \frac{\partial C(y,g(z))}{\partial z} & = \frac{\partial C(y, g(z))}{\partial g(z)} \cdot \frac{\partial g(z)}{\partial z} \\ & = \frac{\partial}{\partial g(z)}\left(\frac{1}{2} (y - g(z))^2\right) \cdot \frac{\partial}{\partial z}\left(z\right) \\ & = - (y-g(z)) \cdot 1 \\ & = g(z) - y \end{align*}$$ You get the same, easy expression for $\frac{\partial C}{\partial z}$ if you combine cross-entropy loss with the logistic sigmoid or softmax activation functions.

This is the reason why linear activations are often used for regression and logistic/softmax activations for binary/multi-class classification. However, nothing keeps you from trying out different combinations. Although the expression for $\frac{\partial C}{\partial z}$ will probably not be so nice, it does not imply that your activation function would perform worse.

Second, I would like to add that there are plenty of activation functions that can be used for the hidden layers. Sigmoids (like the logistic function and hyperbolic tangent) have proven to work well indeed, but as indicated by Jatin, these suffer from vanishing gradients when your networks become too deep. In that case ReLUs have become popular. What I would like to emphasise though, is that there are plenty more activation functions available and different researchers keep on looking for new ones (e.g. Exponential Linear Units (ELUs), Gaussian Error Linear Units (GELUs), ...) with different/better properties

To conclude: When looking for the best activation functions, just be creative. Try out different things and see what combinations lead to the best performance.