{"title": "Wormholes Improve Contrastive Divergence", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 424, "abstract": "", "full_text": "Wormholes Improve Contrastive Divergence\n\nGeoffrey Hinton, Max Welling and Andriy Mnih\n\nDepartment of Computer Science, University of Toronto\n\n10 King\u2019s College Road, Toronto, M5S 3G5 Canada\n\n{hinton,welling,amnih}@cs.toronto.edu\n\nAbstract\n\nIn models that de\ufb01ne probabilities via energies, maximum likelihood\nlearning typically involves using Markov Chain Monte Carlo to sample\nfrom the model\u2019s distribution. If the Markov chain is started at the data\ndistribution, learning often works well even if the chain is only run for a\nfew time steps [3]. But if the data distribution contains modes separated\nby regions of very low density, brief MCMC will not ensure that different\nmodes have the correct relative energies because it cannot move particles\nfrom one mode to another. We show how to improve brief MCMC by\nallowing long-range moves that are suggested by the data distribution.\nIf the model is approximately correct, these long-range moves have a\nreasonable acceptance rate.\n\n1 Introduction\n\nOne way to model the density of high-dimensional data is to use a set of parameters, \u0398 to\ndeterministically assign an energy, E(x|\u0398) to each possible datavector, x [2].\n\n(1)\n\n(cid:82)\n\np(x|\u0398) = e\u2212E(x|\u0398)\ne\u2212E(y|\u0398)dy\n(cid:90)\n\n= \u2212 \u2202E(x|\u0398)\n\n+\n\n\u2202\u03b8j\n\nThe obvious way to \ufb01t such an energy-based model to a set of training data is to follow the\ngradient of the likelihood. The contribution of a training case, x, to the gradient is:\n\n\u2202 log p(x|\u0398)\n\n\u2202\u03b8j\n\np(y|\u0398) \u2202E(y|\u0398)\n\n\u2202\u03b8j\n\ndy\n\n(2)\n\nThe last term in equation 2 is an integral over all possible datavectors and is usually in-\ntractable, but it can be approximated by running a Markov chain to get samples from the\nBoltzmann distribution de\ufb01ned by the model\u2019s current parameters. The main problem with\nthis approach is the time that it takes for the Markov chain to approach its stationary distri-\nbution. Fortunately, in [3] it was shown that if the chain is started at the data distribution,\nrunning the chain for just a few steps is often suf\ufb01cient to provide a signal for learning.\nThe way in which the data distribution gets distorted by the model in the \ufb01rst few steps of\nthe Markov chain provides enough information about how the model differs from reality to\nallow the parameters of the model to be improved by lowering the energy of the data and\nraising the energy of the \u201cconfabulations\u201d produced by a few steps of the Markov chain.\nSo the steepest ascent learning algorithm implied by equation 2 becomes\n\n(cid:191)\n\n(cid:192)\n\n\u2206\u03b8j \u221d \u2212\n\n\u2202E(.|\u0398)\n\n\u2202E(.|\u0398)\n\n\u2202\u03b8j\n\ndata\n\n\u2202\u03b8j\n\nconf abulations\n\n(3)\n\n(cid:192)\n\n(cid:191)\n\n+\n\n\f(a)\n\n(b)\n\nFigure 1: a) shows a two-dimensional data distribution that has four well-separated modes. b) shows\na feedforward neural network that is used to assign an energy to a two-dimensional input vector. Each\nhidden unit takes a weighted sum of its inputs, adds a learned bias, and puts this sum through a logistic\nnon-linearity to produce an output that is sent to the next layer. Each hidden unit makes a contribution\nto the global energy that is equal to its output times a learned scale factor. There are 20 units in the\n\ufb01rst hidden layer and 3 in the top layer.\n\nwhere the angle brackets denote expected values under the distribution speci\ufb01ed by the\nsubscript.\n\nIf we use a Markov chain that obeys detailed balance, it is clear that when the training data\nis dense and the model is perfect, the learning procedure in equation 3 will leave the pa-\nrameters unchanged because the Markov chain will already be at its stationary distribution\nso the confabulations will have the same distribution as the training data.\n\nUnfortunately, real training sets may have modes that are separated by regions of very\nlow density, and running the Markov chain for only a few steps may not allow it to move\nbetween these modes even when there is a lot of data. As a result, the relative energies\nof data points in different modes can be completely wrong without affecting the learning\nsignal given by equation 3. The point of this paper is to show that, in the context of model-\n\ufb01tting, there are ways to use the known training data to introduce extra mode-hopping\nmoves into the Markov chain. We rely on the observation that after some initial training,\nthe training data itself provides useful suggestions about where the modes of the model are\nand how much probability mass there is in each mode.\n\n2 A simple example of wormholes\n\nFigure 1a shows some two-dimensional training data and a model that was used to model\nthe density of the training data. The model is an unsupervised deterministic feedforward\nneural network with two hidden layers of logistic units. The parameters of the model are\nthe weights and biases of the hidden units and one additional scale parameter per hidden\nunit which is used to convert the output of the hidden unit into an additive contribution to\nthe global energy. By using backpropagation through the model, it is easy to compute the\nderivatives of the global energy assigned to an input vector w.r.t. the parameters (needed in\nequation 3), and it is also easy to compute the gradient of the energy w.r.t. each component\nof the input vector (i.e the slope of the energy surface at that point in dataspace). The latter\ngradient is needed for the \u2019Hybrid Monte Carlo\u2019 sampler that we discuss next.\n\nThe model is trained on 1024 datapoints for 1000 parameter updates using equation 3. To\nproduce the confabulations we start at the datapoints and use a Markov chain that is a sim-\n\n\u22123\u22122\u221210123\u22123\u22122\u221210123hiddenlayerfirstsecondlayerhidden ffffEEjkjiWW\u03bbdataijjkkj\u03bbk\f(a)\n\n(b)\n\nFigure 2: (a) shows the probabilities learned by the network without using wormholes, displayed on\na 32 \u00d7 32 grid in the dataspace. Some modes have much too little probability mass. (b) shows that\nthe probability mass in the different minima matches the data distribution after 10 parameter updates\nusing point-to-point wormholes de\ufb01ned by the vector differences between pairs of training points.\nThe mode-hopping allowed by the wormholes increases the number of confabulations that end up in\nthe deeper minima which causes the learning algorithm to raise the energy of these minima.\n\npli\ufb01ed version of Hybrid Monte Carlo. Each datapoint is treated as a particle on the energy\nsurface. The particle is given a random initial momentum chosen from a unit-variance\nisotropic Gaussian and its deterministic trajectory along the energy surface is then simu-\nlated for 10 time steps. If this simulation has no numerical errors the increase, \u2206E, in\nthe combined potential and kinetic energy will be zero. If \u2206E is positive, the particle is\nreturned to its initial position with a probability of 1\u2212 exp(\u2212\u2206E). The step size is adapted\nafter each batch of trajectories so that only about 10% of the trajectories get rejected. Nu-\nmerical errors up to second order are eliminated by using a \u201cleapfrog\u201d method [5] which\nuses the potential energy gradient at time t to compute the velocity increment between time\nt \u2212 1\n2 to compute the position increment\nbetween time t and t + 1.\nFigure 2a shows the probability density over the two-dimensional space. Notice that the\nmodel assigns much more probability mass to some minima than to others. It is clear that\nthe learning procedure in equation 3 would correct this imbalance if the confabulations\nwere generated by a time-consuming Markov chain that was able to concentrate the con-\nfabulations in the deepest minima1, but we want to make use of the data distribution to\nachieve the same goal much faster.\n\n2 and uses the velocity at time t + 1\n\n2 and t + 1\n\nFigure 2b shows how the probability density is corrected by 10 parameter updates using a\nMarkov chain that has been modi\ufb01ed by adding an optional long-range jump at the end of\neach accepted trajectory. The candidate jump is simply the vector difference between two\nrandomly selected training points. The jump is always accepted if it lowers the energy. If it\nraises the energy it is accepted with a probability of exp(\u2212\u2206E). Since the probability that\npoint A in the space will be offered a jump to point B is the same as the probability that\nB will be offered a jump to A, the jumps do not affect detailed balance. One way to think\nabout the jumps is to imagine that every point in the dataspace is connected by wormholes\nto n(n \u2212 1) other points so that it can move to any of these points in a single step.\nTo understand how the long-range moves deal with the trade-off between energy and en-\ntropy, consider a proposed move that is based on the vector offset between a training point\n\n1Note that depending on the height of the energy barrier between the modes this may take too\n\nlong for practical purposes.\n\n\fthat lies in a deep narrow energy minimum and a training point that lies in a broad shallow\nminimum. If the move is applied to a random point in the deep minimum, it stands a good\nchance of moving to a point within the broad shallow minimum, but it will probably be re-\njected because the energy has increased. If the opposite move is applied to a random point\nin the broad minimum, the resulting point is unlikely to fall within the narrow minimum,\nthough if it does it is very likely to be accepted. If the two minima have the same free\nenergy, these two effects exactly balance.\n\nJumps generated by random pairs of datapoints work well if the minima are all the same\nshape, but in a high-dimensional space it is very unlikely that such a jump will be accepted\nif different energy minima are strongly elongated in different directions.\n\n3 A local optimization-based method\n\nIn high dimensions the simple wormhole method will have a low acceptance rate because\nmost jumps will land in high-energy regions. One way avoid to that is to use local optimiza-\ntion: after a jump has been made descend into a nearby low-energy region. The obvious\ndif\ufb01culty with this approach is that care must be taken to preserve detailed balance. We\nuse a variation on the method proposed in [7]. It \ufb01ts Gaussians to the detected low energy\nregions in order to account for their volume.\nA Gaussian is \ufb01tted using the following procedure. Given a point x, let mx be the point\nfound by running a minimization algorithm on E(x) for a few steps (or until convergence)\nstarting at x. Let Hx be the Hessian of E(x) at mx, adjusted to ensure that it is positive\nde\ufb01nite by adding a multiple of the identity matrix to it. Let \u03a3x be the inverse of Hx. A\nGaussian density gx(y) is then de\ufb01ned by the mean mx and the covariance matrix \u03a3x.\nTo generate a jump proposal, we make a forward jump by adding the vector difference d\nbetween two randomly selected data points to the initial point x0, obtaining x. Then we\ncompute mx and \u03a3x, and sample a proposed jump destination y from gx(y). Then we\nmake a backward jump by adding \u2212d to y to obtain z, and compute mz and \u03a3z, specifying\ngz(x). Finally, we accept the proposal y with probability\n\np = min(1,\n\nexp(\u2212E(y))\nexp(\u2212E(x0))\n\ngz(x0)\ngx(y)\n\n).\n\nOur implementation of the algorithm executes 20 steps of steepest descent to \ufb01nd mx and\nmz. To save time, instead of computing the full Hessian, we compute a diagonal approxi-\nmation to the Hessian using the method proposed in [1].\n\n4 Gaping wormholes\n\nIn this section we describe a third method based on \u201cdarting MCMC\u201d [8] to jump between\nthe modes of a distribution. The idea of this technique is to de\ufb01ne spherical regions on the\nmodes of the distribution and to jump only between corresponding points in those regions.\nWhen we consider a long-range move we check whether or not we are inside a worm-\nhole. When inside a wormhole we initiate a jump to some other wormhole (e.g. chosen\nuniformly); when outside we stay put in order to maintain detailed balance. If we make a\njump we must also use the usual Metropolis rejection rule to decide whether to accept the\njump.\n\nIn high dimensional spaces this procedure may still lead to unacceptably high rejection\nrates because the modes will likely decay sharply in at least a few directions. Since these\nridges of probability are likely to be uncorrelated across the modes, the proposed target\nlocation of the jump will most of the time have very low probability, resulting in almost\n\n\fcertain rejection. To deal with this problem, we propose a generalization of the described\nmethod, where the wormholes can have arbitrary shapes and volumes. As before, when\nwe are considering a long-range move we \ufb01rst check our position, and if we are located\ninside a wormhole we initiate a jump (which may be rejected) while if we are located\noutside a wormhole we stay put. To maintain detailed balance between wormholes we\nneed to compensate for their potentially different volume factors. To that end, we impose\nthe constraint\n\n(4)\non all pairs of wormholes, where Pi\u2192j is a transition probability and Vi and Vj are the\nvolumes of the wormholes i and j respectively. This in fact de\ufb01nes a separate Markov\nchain between the wormholes with equilibrium distribution,\n\nVi Pi\u2192j = Vj Pj\u2192i\n\ni = Vi(cid:80)\n\nP EQ\n\nj Vj\n\nThe simplest method2 to compensate for the different volume factors is therefore to sam-\nple a target wormhole from this distribution P EQ. When the target wormhole has been\ndetermined we can either sample a point uniformly within its volume or design some de-\nterministic mapping (see also [4]). Finally, once the arrival point has been determined we\nneed to compensate for the fact that the probability of the point of departure is likely to be\ndifferent than the probability of the point of arrival. The usual Metropolis rule applies in\nthis case,\n\n(cid:184)\n\n(cid:183)\n\nThis combined set of rules ensures that detailed balance holds and that the samples will\neventually come from the correct probability distribution. One way of employing this sam-\npler in conjunction with contrastive divergence learning is to \ufb01t a \u201cmixture of Gaussians\u201d\nmodel to the data distribution in a preprocessing step. The region inside an iso-probability\ncontour of each Gaussian mixture component de\ufb01nes an elliptical wormhole with volume\n\n(5)\n\n(6)\n\n(7)\n\nPaccept = min\n\n1,\n\nParrive\nPdepart\n\n2 \u03b1d(cid:81)d\n\ni=1 \u03c3i\n2 )\n\u0393(1 + d\n\nVellipse = \u03c0 d\n\nwhere \u0393(x) is the gamma function, \u03c3i is the standard deviation of the i\u2019th eigen-direction of\nthe covariance matrix and \u03b1 is a free parameter controlling the size of the wormhole. These\nregions provide good jump points during CD-learning because it is expected that the valleys\nin the energy landscape correspond to the regions where the data cluster. To minimize the\nrejection rate we map points in one ellipse to \u201ccorresponding\u201d points in another ellipse as\nfollows. Let \u03a3depart and \u03a3arrive be the covariance matrices of the wormholes in question.\nand let \u03a3 = U SU T be an eigenvalue decomposition. The following transformation maps\niso-probability contours in one wormhole to iso-probability contours in another,\ndepart(xdepart \u2212 \u00b5depart)\n\nxarrive \u2212 \u00b5arrive = \u2212UarriveS1/2\n\n\u22121/2\ndepartU T\n\narriveS\n\n(8)\n\nwith \u00b5 the center location of the ellipse. The negative sign in front of the transformation\nis to promote better exploration when the target wormhole turns out to be the same as\nthe wormhole from which the jump is initiated. It is important to realize that although\nthe mapping is one-to-one, we still need to satisfy the constraint in equation 4 because a\nvolume element dx will change under the mapping. Thus, wormholes are sampled from\nP EQ and proposed moves are accepted according to equation 6.\n\nFor both the deterministic and the stochastic moves we may also want to consider regions\nthat overlap. For instance, if we generate wormholes by \ufb01tting a mixture of Gaussians it\n\n2Other methods that respect the constraint 4 are possible but are suboptimal in the sense that they\n\nmix slower to the equilibrium distribution.\n\n\f(a)\n\n(b)\n\nFigure 3: (a) Dataset of 1024 cases uniformly distributed on 2 orthogonal narrow rectangles. (b)\nProbability density of the model learned with contrastive divergence. The size of each square indi-\ncates the probability mass at the corresponding location.\n\nis very hard to check whether these regions overlap somewhere in space. Fortunately, we\ncan adapt the sampling procedure to deal with this case as well. First de\ufb01ne narrive as\nthe total number of regions that contain point xarrive and similarly for ndepart. Detailed\nbalance can still be maintained for both deterministic and stochastic moves if we adapt the\nMetropolis acceptance rule as follows,\n\n(cid:183)\n\n(cid:184)\n\nPaccept = min\n\n1,\n\nFurther details can be found in [6].\n\nndepartParrive\nnarrivePdepart\n\n(9)\n\n5 An experimental comparison of the three methods\n\nTo highlight the difference between the point and the region wormhole sampler, we sampled\n1024 data points along two very narrow orthogonal ridges (see \ufb01gure 3a), with half of\nthe cases in each mode. A model with the same architecture as depicted in \ufb01gure 1 was\nlearned using contrastive divergence, but with \u201cCauchy\u201d nonlinearities of the form f(x) =\nlog(1 + x2) instead of the logistic function. The probability density of the model that\nresulted is shown in \ufb01gure 3b. Clearly, the lack of mixing between the modes has resulted\nin one mode being much stronger than the other one. Subsequently, learning was resumed\nusing a Markov chain that proposed a long-range jump for all confabulations after each\nbrief HMC run. The regions in the region wormhole sampler were generated by \ufb01tting\na mixture of two Gaussians to the data using EM, and setting \u03b1 = 10. Both the point\nwormhole method and the region wormhole method were able to correct the asymmetry in\nthe solution but the region method does so much faster as shown in \ufb01gure 4b. The reason\nis that a much smaller fraction of the confabulations succeed in making a long-range jump\nas shown in \ufb01gure 4a.\n\nWe then compared all three wormhole algorithms on a family of datasets of varying di-\nmensionality. Each dataset contained 1024 n-dimensional points, where n was one of 2,\n4, 8, 16, or 32. The \ufb01rst two components of each point were sampled uniformly from two\naxis-aligned narrow orthogonal ridges and then rotated by 45\u25e6 around the origin to ensure\nthat the diagonal approximation to the Hessian, used by the local optimization-based algo-\nrithm, was not unfairly accurate. The remaining n \u2212 2 components of each data point were\nsampled independently from a sharp univariate Gaussian with mean 0 and std. 0.02.\n\n\u22123\u22122\u221210123\u22123\u22122\u221210123\f(a)\n\n(b)\n\nFigure 4: (a) Number of successful jumps between the modes for point wormhole MCMC (dashed\nline) and region wormhole MCMC (solid line). (b) Log-odds of the probability masses contained\nin small volumes surrounding the two modes for the point wormhole method (dashed line) and the\nregion wormhole method (solid line). The log-odds is zero when the probability mass is equal in both\nmodes.\n\nThe networks used for comparison had architectures identical to the one depicted in Figure\n1 in all respects except for the number and the type of units used. The second hidden\nlayer consisted of Cauchy units, while the \ufb01rst hidden layer consisted of some Cauchy and\nsome sigmoid units. The networks were trained for 2000 parameter updates using HMC\nwithout wormholes. To speed up the training, an adaptive learning rate and a momentum\nof 0.95 were used. We also used a weight decay rate of 0.0001 for weights and 0.000001\nfor scales. Gaussian noise was added to the last n \u2212 2 components of each data point. The\nstd. of the noise started at 0.2 and was gradually decreased to zero as training progressed.\nThis prevented HMC from being slowed down by the narrow energy ravines resulting from\nthe tight constraints on the last n \u2212 2 components.\nAfter the model was trained (without wormholes), we compared the performance of the\nthree jump samplers by allowing each sampler to make a proposal for each training case and\nthen comparing the acceptance rates. This was repeated 25 times to improve the estimate\nof the acceptance rate. In each sampler, HMC was run for 10 steps before offering points\nan opportunity to jump.\n\nThe average number of successful jumps between modes per iteration is shown in the table\nbelow.\n\nDimensionality\n\nNetwork\n\narchitecture\n\nSimple\n\nwormholes\n\nOptimization-\n\nbased\n\nRegion\n\nwormholes\n\n2\n4\n8\n16\n32\n\n10+10, 2\n20+10, 4\n20+10, 6\n40+10, 8\n50+10, 10\nRelative\nrun time\n\n10\n6\n3\n1\n1\n1\n\n15\n17\n19\n13\n9\n2.6\n\n372\n407\n397\n338\n295\n1\n\nThe network architecture column shows the number of units in the hidden layers with each\nentry giving the number of Cauchy units plus the number of sigmoid units in the \ufb01rst hidden\nlayer and the number of Cauchy units in the second hidden layer.\n\n0204060801000100200300400500600parameter updatesaccepted long range jumps020406080100\u22128\u22127\u22126\u22125\u22124\u22123\u22122\u22121012paramater updateslog\u2212odds\f6 Summary\n\nMaximum likelihood learning of energy-based models is hard because the gradient of the\nlog probability of the data with respect to the parameters depends on the distribution de\ufb01ned\nby the model and it is computationally expensive to even get samples from this distribution.\nMinimizing contrastive divergence is much easier than maximizing likelihood but the brief\nMarkov chain does not have time to mix between separated modes in the distribution3.\nThe result is that the local structure around each data cluster is modelled well, but the\nrelative masses of different cluster are not. In this paper we proposed three algorithms\nto deal with this phenomenon. Their success relies on the fact that the data distribution\nprovides valuable suggestions about the location of the modes of a good model. Since the\nprobability of the model distribution is expected to be substantial in these regions they can\nbe successfully used as target locations for long-range moves in a MCMC sampler.\n\nThe MCMC sampler with point-to-point wormholes is simple but has a high rejection rate\nwhen the modes are not aligned. Performing local gradient descent after a jump signi\ufb01-\ncantly increases the acceptance rate, but only leads to a modest improvement in ef\ufb01ciency\nbecause of the extra computations required to maintain detailed balance. The MCMC sam-\npler with region-to-region wormholes targets its moves to regions that are likely to have\nhigh probability under the model and therefore has a much better acceptance rate, provided\nthe distribution can be modelled well by a mixture. None of the methods we have pro-\nposed will work well for high-dimensional, approximately factorial distributions that have\nexponentially many modes formed by the cross-product of multiple lower-dimensional dis-\ntributions.\nAcknowledgements This research was funded by NSERC, CFI, OIT. We thank Radford Neal and\nYee-Whye Teh for helpful advice and Sam Roweis for providing software.\n\nReferences\n\n[1] S. Becker and Y. LeCun.\n\nImproving the convergence of back-propagation learning with sec\nond-order methods. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proc. of the 1988\nConnectionist Models Summer School, pages 29\u201337, San Mateo, 1989. Morgan Kaufman.\n\n[2] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In Advances in\n\nNeural Information Processing Systems, 2001, 2001.\n\n[3] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Com-\n\nputation, 14:1771\u20131800, 2002.\n\n[4] C. Jarzynski. Targeted free energy perturbation. Technical Report LAUR-01-2157, Los Alamos\n\nNational Laboratory, 2001.\n\n[5] R.M. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report\n\nCRG-TR-93-1, University of Toronto, Computer Science, 1993.\n\n[6] C. Sminchisescu, M.Welling, and G. Hinton. Generalized darting monte carlo. Technical report,\n\nUniversity of Toronto, 2003. Technical Report CSRG-478.\n\n[7] H. Tjelemeland and B.K. Hegstad. Mode jumping proposals in mcmc. Technical report, Nor-\nwegian University of Science and Technology, Trondheim, Norway, 1999. Rep. No. Statistics\nno.1/1999.\n\n[8] A. Voter. A monte carlo method for determining free-energy differences and transition state\n\ntheory rate constants. 82(4), 1985.\n\n3However, note that in cases where the modes are well separated, even Markov chains that run for\nan extraordinarily long time will not mix properly between those modes, and the results of this paper\nbecome relevant.\n\n\f", "award": [], "sourceid": 2400, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Andriy", "family_name": "Mnih", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}