In the Andrew NG's tutorial on Machine Learning, he takes the first derivative of the error function then takes small steps in direction of the derivative to find minima. (Gradient Descent basically)
In the elements of statistical leaning book, it found the first derivative of the error function, equated it to zero then used numerical methods to find out the root (In this case Newton Raphson)
In paper both the methods should yield similar results. But numerically are they different or one method is better than the other ?