Stochastic approximation methods (SAMs) have been extensively applied in deep learning.
Adaptively setting the learning rate can significantly accelerate the learning process. On the other hand, numerous applications have demonstrated that asynchronous (async) parallel computing can achieve much better parallelization speed-up than sync-parallel computing. Therefore, to train a neural network with parallel computing resources, it is paramount to design algorithms that enjoy advantages from both adaptive learning and async-parallel computing. This proposal pursues this direction. We aim at designing stable and efficient SAMs that have both features of “adaptive” and “async-parallel” and can achieve double acceleration over a non-adaptive and non-parallel SAM. We propose an adaptive SAM that utilizes partial second-order derivative information of a problem. Preliminary tests on training a neural network show that it can outperform Adam, a popularly used adaptive SAM. Also, we propose a hierarchical tree structure that allows partial coordination between computing nodes, for implementing an adaptive SAM in the async-parallel manner. With the proposed implementation, the parallel method can have the same theoretical guarantee as its non-parallel counterpart, and furthermore, by optimizing the tree structure, we can achieve near-linear parallelization speed-up. An exemplary application of the proposed algorithm is deep learning with graph structured data, whereby the scale of the graphs and their highly imbalanced sizes necessitate asynchronous training parallelization. We will use the developed method to train graph neural networks.
So far, our numerical results on a machine learning problem show that async-parallel computing can successfully accelerate the adaptive stochastic gradient method.