Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different inital weights
Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as training progresses