Heavy-tailed nature of stochastic gradient descent in deep learning : theoretical and empirical analysis

Thanh Huy Nguyen

Résumé

In this thesis, we are concerned with the Stochastic Gradient Descent (SGD) algorithm. Specifically, we perform theoretical and empirical analysis of the behavior of the stochastic gradient noise (GN), which is defined as the difference between the true gradient and the stochastic gradient, in deep neural networks. Based on these results, we bring an alternative perspective to the existing approaches for investigating SGD. The GN in SGD is often considered to be Gaussian for mathematical convenience. This assumption enables SGD to be studied as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context that suggests that the GN is better approximated by a "heavy-tailed" alpha-stable random vector. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a Lévy motion. Firstly, to justify the alpha-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and exhibits heavy-tails. Secondly, under the heavy-tailed GN assumption, we provide a non-asymptotic analysis for the discrete-time dynamics SGD to converge to the global minimum in terms of suboptimality. Finally, we investigate the metastability nature of the SDE driven by Lévy motion that can then be exploited for clarifying the behavior of SGD, especially in terms of `preferring wide minima'. More precisely, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of SGD, viewed as a discrete-time SDE, is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we describe how the error depends on the algorithm and problem parameters. We illustrate our metastability results with simulations on a synthetic model and neural networks. Our results open up a different perspective and shed more light on the view that SGD prefers wide minima.

Dans cette thèse, nous nous intéressons à l'algorithme du gradient stochastique (SGD). Plus précisément, nous effectuons une analyse théorique et empirique du comportement du bruit de gradient stochastique (GN), qui est défini comme la différence entre le gradient réel et le gradient stochastique, dans les réseaux de neurones profonds. Sur la base de ces résultats, nous apportons une perspective alternative aux approches existantes pour étudier SGD. Le GN dans SGD est souvent considéré comme gaussien pour des raisons mathématiques. Cette hypothèse permet d'étudier SGD comme une équation différentielle stochastique (SDE) pilotée par un mouvement brownien. Nous soutenons que l'hypothèse de la gaussianité pourrait ne pas tenir dans les contextes d'apprentissage profond et donc rendre inappropriées les analyses basées sur le mouvement brownien. Inspiré de phénomènes naturels non gaussiens, nous considérons le GN dans un contexte plus général qui suggère que le GN est mieux approché par un vecteur aléatoire à "queue lourde" alpha-stable. En conséquence, nous proposons d'analyser SGD comme une discrétisation d'une SDE pilotée par un mouvement Lévy. Premièrement, pour justifier l'hypothèse alpha-stable, nous menons des expériences sur des scénarios communs d'apprentissage en profondeur et montrons que dans tous les contextes, le GN est hautement non gaussien et présente des queues lourdes. Deuxièmement, sous l'hypothèse du GN à queue lourde, nous fournissons une analyse non asymptotique pour que la dynamique en temps discret SGD converge vers le minimum global en termes de sous-optimalité. Enfin, nous étudions la nature de métastabilité de la SDE pilotée par le mouvement de Lévy qui peut ensuite être exploitée pour clarifier le comportement de SGD, notamment en termes de "préférence de larges minima". Plus précisément, nous fournissons une analyse théorique formelle où nous dérivons des conditions explicites pour la taille de pas de sorte que le comportement de métastabilité de SGD, considéré comme une SDE en temps discret, est similaire à sa limite de temps continu. Nos résultats ouvrent une perspective différente et éclairent davantage l'idée selon laquelle SGD préfère les minima larges.

Heavy-tailed nature of stochastic gradient descent in deep learning : theoretical and empirical analysis

Nature à queue lourde de l'algorithme du gradient stochastique en apprentissage profond : analyse théorique et empirique

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager