Gradient descent over entire network weight vector
Easily generalized to arbitrary directed graphs
Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times)
Often include weight momentum
Minimizes error over training examples Will it generalize well to subsequent examples?
Training can take thousands of iterations slow!
Using network after training is very fast