<title>Quantum molecular dynamics on massively parallel computers</title>  <A HREF="http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page1.gif><IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page1-ic.gif></a>  <cite> The first four planes of carbon (from left to right) at the end of a Quantum Molecular Dynamics simulation on the Cray T3D for the graphitization of a diamond surface</cite> <p>  <center> <A name="f"><h1>Dynamique molculaire quantique sur ordinateurs massivement parallles</h1></a> <address>par  Alessandro De Vita, Andrew Canning*, Giulia Galli, Franois Gygi, Francesco Mauri  and Roberto Car, IRRMA - EPFL   [* Cray Research Switzerland]</address> </center> <p> <font size=7> En </font> Dynamique Molculaire Quantique (QMD), les atomes bougent sous l'action de forces drives directement de l'tat fondamental lectronique. Les lectrons participant aux liaisons chimiques, qui se font et se dfont pendant le mouvement des atomes, sont traits explicitement selon un modle de mcanique quantique. Ceci implique que les simulations standard ncessitent une trs grandes quantits de ressources informatiques, mme pour traiter des petits systmes de l'ordre d'une centaine d'atomes. Les simulations sont bases sur la physique des atomes qui bougent sous l'action d'interaction mutuelle de courte porte: cette classe de problme est intrinsquement parallle (parce que locale). Cette caractristique rend la QMD particulirement bien adapte aux ordinateurs parallles pouvant fournir la puissance de traitement ncessaire  traiter des problmes encore plus grands, et sur des chelles de temps encore plus longues. Nous prsentons dans cet article le portage de deux codes sur l'ordinateur Cray T3D. Le premier, d'ordre O(<img align=middle src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif>), est un code ab initio de Dynamique Molculaire, alors que le second d'ordre O(N) est bas sur le modle dit <i>Tight Binding.</i> Ces codes ont t optimiss et tests sur des systmes de production. Quelques nouveaux rsultats physiques applicables au comportement structurel de systmes bass sur l'lement carbone ont pu ainsi tre obtenus.  <p> <B> EPFL Supercomputing Review - n. 6 - nov. 94</B><p> <p> <A HREF="http://sawww.epfl.ch/SIC/SA/publications/SCR94/scr-94.html#f><IMG SRC="http://sawww/SIC/SA/publications/retour.gif"></A><p> <p> <hr>  <A HREF="http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page1.gif><IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page1-ic.gif></a>  <cite> The first four planes of carbon (from left to right) at the end of a Quantum Molecular Dynamics simulation on the Cray T3D for the graphitization of a diamond surface</cite> <p> <center> <A name="a"><h1>Quantum molecular dynamics on massively parallel computers</h1></a>  <address>by Alessandro De Vita, Andrew Canning*, Giulia Galli, Franois Gygi, Francesco Mauri  and Roberto Car, IRRMA - EPFL   [* Cray Research Switzerland]</address></center> <p> <font size=7> In </font> Quantum Molecular Dynamics algorithms atoms move under the influence of forces which are directly derived from the instantaneous electronic ground state. Electrons participating in chemical bonds, which are forming and breaking during the atomic motion, are treated explicitly in a full quantum mechanical way. This means that standard simulations require extremely large computer resources, even to treat small systems of the order of a hundred atoms. The physics underlying the simulations is one of atoms which move according to an effectively short-ranged mutual interaction, so that this class of problems is parallel (since local) in nature. This opens the way for the use of parallel computers, which can provide the computing power necessary to treat larger systems for longer time scales.  In this article we discuss the porting of two codes to the Cray T3D parallel computer. The first code is a first-principles molecular dynamics code and the second is an O(N) scaling quantum molecular dynamics code based on the tight binding model. These codes have been optimised and tested on real production systems. Some novel physical results have been already obtained for the structural behaviour of carbon-based systems.  <p> <hr> <p>  <h2>contents</h2> <ul> <li><a href=#1>Introduction</a> <li><a href=#2>Parallel codes</a> <li><a href=#3>First-Principles Quantum MD code</a> <li><a href=#4>Tight Binding Quantum MD</a> <li><a href=#5>Conclusion</a> <li><a href=#6>Acknoledgments</a> <li><a href=#7>Bibliography</a></ul><p>    <a name="1"><h2>Introduction</h2></a>   Molecular Dynamics (MD) computer simulations numerically solve Newton's equations of motion for an ensemble of atoms. The atomic trajectories are obtained by numerical integration for appropriately long time intervals, and equilibrium properties can be calculated as temporal averages over the observation time. Provided that the interaction between atoms is represented with realistic potentials, MD techniques allow accurate dynamical modelling of complex systems such as bulk solids, liquids and molecules. Numerical experiments based on MD are very useful, since they provide details of the atomic dynamics that would not otherwise be available in real experiments. MD techniques can also be used in a predictive manner to construct and study the stability and behaviour of new materials, and as such are a promising tool for chemical design by computer.<p> In quantum MD simulations the validity of the Born-Oppenheimer (BO) adiabatic approximation is assumed, according to which the electrons adjust instantaneously to the motion of the atomic nuclei, and remain close to their ground state at each MD step. The electronic ground-state energy as a function of the nuclear coordinates constitutes the so-called BO potential energy surface. The nuclear motion on this energy surface is assumed to obey classical mechanics. The above two assumptions are satisfied for a very large class of systems of interest in material science. Therefore, a crucial issue in MD simulations is the accurate modelling of the BO surface. <p> In the Car-Parrinello (CP) first-principles MD scheme <a href=#11>[1]</a> the energy surfaces are calculated in the framework of Density Functional Theory (DFT). In DFT the quantum many-body problem of the interacting electrons is recast into a set of single particle Schrdinger equations for non-interacting electrons immersed in a self-consistent potential. The CP scheme was the first approach to efficiently combine DFT and MD methods leading the way for large scale dynamical simulations of quantum systems. The method is first-principles in the sense that there are no free parameters in the system that are adjusted to reproduce experimental data. The atomic numbers and atomic positions (plus the initial velocities) are sufficient to define the system. In the CP scheme one considers a fictitious dynamical system whose degrees of freedom are the nuclear coordinates (RI) and the expansion coefficients ci,<b>G </b> of the occupied one-electron orbitals. The orbitals are normally expanded on a set of plane-waves (PW) up to some cut-off energy in the <i>G </i>vectors:<p> <center><IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page23.gif>  (1)<p></center> In the global electron-ion parameter space the potential energy surface is given by the energy functional E (ci,<B>G</B>, RI). The BO potential energy surface corresponds to the minimum of E (ci,<B>G</B>, RI) with respect to the ci,<B>G</B> 's. In the CP scheme the parameters ci,<B>G</B>  and RI  evolve with a fictitious dynamics which keeps them very close to the BO surface. Therefore, the CP algorithm does not require a self-consistent electronic structure minimization at each MD time step. <p> In the case of Tight Binding (TB) models we write down a Hamiltonian for an electron-ion system with free parameters, chosen to fit experimental data and first-principles simulations. In this sense it is not a first-principles scheme and different sets of parameters and Hamiltonians are required for different types of systems. The wavefunctions are expanded on a basis set of atomic orbitals localised on the atomic positions. This much simpler basis set allows larger systems to be studied than the first-principle schemes. In our simulations using the (TB) model at each (MD) step we minimise the energy functional with respect to the electronic degrees of freedom to move down to the BO surface.<p> In recent years quantum MD simulations have been used to study a variety of systems. First-principles MD simulations are now performed in many laboratories and there is a growing list of achievements in solving problems for real materials which were inaccessible until a few years ago. These include physical properties of amorphous states, the investigation of processes relevant to semiconductor technology, the study of atomic clusters, surface reconstructions and chemisorption phenomena. <p> Quantum MD simulations are extremely demanding in computer power and memory, and it has only been through the advancement of computer technology (most notably large vector machines) that it has become possible to study systems of the order of a hundred atoms over reasonable time scales. This has made accessible the study of certain materials and phenomena of technological interest but to go further we now require more powerful computers with more memory.  Parallel machines take advantage of low price RISC and memory chips to build computers with very cost effective high peak performance and memory. However, they have to be programmed with new parallel languages which can often involve restructuring of existing serial codes and a rethinking of our problems in parallel. Another way forward is to develop new algorithms which allow larger systems to be studied than before with the same computer resources but having a similar accuracy to existing techniques. At IRRMA we are exploring both of these approaches with the porting of codes to the T3D parallel machine and the development of new algorithms whose computational cost scales linearly with the system size. <p> The computational cost of standard Quantum MD schemes, which use extended wavefunctions, scales as <img align=middle src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif> (N is the number of atoms in the system). This is due to the fact that at each step we have to orthogonalise the wavefunctions to satisfy the Pauli principle.  This can be done through a direct orthogonalisation procedure on the wavefunctions once they have evolved under the fictitious dynamics or an implicit orthogonalisation through the minimisation of an energy functional containing terms that force the orthogonality at the minimum. In both cases the computational cost scales with <img align=middle src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif>. The main idea of O(N) schemes is that we work with wavefunctions that are localised in a given region of space <a href=#12>[2]</a>. These regions must be chosen to be large enough so that the errors introduced by localisation do not affect the computed properties. In most materials, effective interactions are short-range so that these materials can be modelled well with localised wavefunctions. The local representation of the wavefunctions means that we only require a small number of components to represent them. The number of components remains constant as we increase the system size, in contrast to the standard methods. The second important idea in the O(N) scheme used in our code <a href=#13>[3]</a> is to work with an energy functional, with implicit orthogonality constraints which only contains matrix products and no matrix inversions, which are computationally very costly. With this formulation the number of calculations for each MD step scales linearly with N.   <a name="2"><h2>Parallel codes</h2></a>  We now turn to some general considerations about the porting of the O(<img align=middle src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif>) and O(N) Quantum Molecular Dynamic codes onto a massively parallel supercomputer, the Cray T3D at the EPFL in Lausanne. This machine belongs to a more general class of parallel computers (called MIMD: Multiple Instruction Multiple Data) whose main characteristic is the course-grained architecture, with a relatively limited number of processors (in general a few hundred), each of which is capable of a desktop workstation level of performance. The CPUs are linked by a network of communication channels, which allows synchronisation and high bandwidth data exchange.<p> Each processing element contains a CPU and its own substantial amount of local (private) memory (a few Mwords). When the data stored in such memory is required on a different processor for the execution of the program, communication routines are needed. A first strategy of parallel implementation consists of using a <i>global data addressing </i>programming model. In this programming model the programmer specifies his variables with a syntax practically equivalent to the one used in serial codes, and the compiler transparently looks after all communications (the T3D comes equipped with one of these models, called CRAFT). A second strategy consists of  handling the communication between processors with appropriate instructions explicitly included in the source code. This involves a greater programming effort, but also usually achieves, at the current state of the art, better execution efficiency. <p> We chose for the porting this second parallel programming strategy, which is called Message Passing and makes use of libraries of communication routines. These routines are invoked by the source code with the usual subroutine call syntaxis, the passed arguments being the data vectors which need to be communicated and the logical addresses of the sending and of the receiving processors. The goal of any effective porting scheme is to minimise the efficiency loss due to "parallelisation overheads", so that the execution time scales (almost) linearly with the inverse of the number of processors used. These overheads can derive from <i>residual sequential computations </i>(not every portion of the source code can run in parallel), or from<i> uneven distribution of computations </i>(the data structures might not be exactly divisible between processors, or the single processor's workload might not be known in advance). A further, and very important, source of efficiency loss is represented by <i>overheads of data communication, </i>since normally the processors involved in the communication are inactive while data is transferred.   <a name="3"><h2>First-Principles Quantum MD code</h2></a>  <h3>Features of the data layout</h3>  This code is based on an O(<img src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif>) scaling algorithm. The parallelisation strategy adopted for the porting of this code is naturally determined by the layout of the data structures, and by the analysis of the operations which have to be performed on such data by the code to accomplish a dynamical timestep of the CP algorithm. The basic data ingredients of a quantum molecular dynamics code are (1) the electron wavefunctions, which are represented by vectors of data, and (2) the matrices which represent linear operations on the electronic wavefunctions. The expansion coefficients of the wavefunctions are the memory intensive data of the problem, and any "data driven" parallelisation scheme of the algorithm is mainly about how to distribute such data among the processors. Another important quantity in the problem is the electron density r(<b>r</b>) given by  <p><center <IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page25.gif>  (1) (2)<p>></center> This density is used in the algorithm to evaluate a few contribution terms to the total potential felt by each electron in the physical system, such as the electrostatic repulsion with other electrons. Since all these operations are performed in real space, Fast Fourier Transformations (FFT) between <B>G</B>-space and real space have to be used.   Parallelisation strategy  The parallelisation strategy adopted in the porting of the O(<img align=middle src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif>) code is based on the distribution among processors of columns of <B>G</B>-vectors in the reciprocal space FFT mesh (and in the real space mesh connected by the transformation)<a href=#15>[5]</a> . The scalar products within orbitals and all the real space quantities are efficiently computed, since they correspond to integrals whose domain is distributed among processors, so that only the integration subtotals have to be communicated between computing nodes. At the same time, the three-dimensional FFT are communication intensive operations within the present scheme, and optimal communication routines "dedicated" to the specific problem had to be coded to maintain efficiency of computations. The data matrices required by the algorithm were also distributed (above all, for memory reasons), and specific communication routines were coded to perform the necessary distributed matrix algebra. Finally, the electronic orbitals can be chosen to be <i>real </i> functions, so that two of them can be packed into a complex-to-complex FFT, while only half of the memory allocation which would be necessary for complex orbitals is actually needed. This technique does not introduce any extra communications between computing nodes, if the data distribution is properly handled.  Code Performance  The performance of the parallel code compares well with vector versions of the same code. Software optimisation was performed to maximise the use of library routines, while all the important data transfers between processors were handled with machine dedicated high performance shared memory communications (the SHMEM library). As a result, 46 % of the computation time of a typical production run is spent in highly optimised linear algebra library routines on private data, and 20 % in performing on-node one-dimensional FFTs and non-library calculations. In the first few months of availability of the T3D in Lausanne (128 processors), this meant 66 % of the code was running on the full machine at about 4.7 GFLOPS, with a global sustained performance well in excess of 3 GFLOPS.  Calculations performed on a 256 processor Cray T3D machine in Eagan, Minnesota, showed that computation times for production applications on bigger and smaller partitions deviated only a few percent from ideal linear scaling.  The first completed application  The ultimate goal of the algorithm analysis and of the parallel implementation described above is, of course, to make feasible long molecular dynamics simulations on large physical systems. The system we selected for a first application is a  (111) oriented diamond slab. Diamond is a metastable form of carbon with a wide electronic band gap and an unusually favourable combination of characteristics such as high mechanical strength, excellent thermal conductivity, and high resistance to radiation and temperature. The study of diamond growth processes is therefore of both scientific and technological interest. The (111) surface of diamond is a growth plane for chemical vapor deposited carbon films. Its structural behaviour as a function of temperature plays an important role in determining the morphology of these films <a href=#14>[4]</a>  which are widely studied to develop anti-reflective hard coatings. With the purpose of investigating thermal properties and structural transitions of the system, we have performed extended ab initio dynamical simulations on a supercell containing a slab of 192 carbon atoms composed of 12 (111) planes, and 9.5  of vacuum. The system was structurally relaxed and gradually heated up: at the temperature of 2900 degreee K we observed a structural transition of the slab, from the original diamond phase to a layered graphitic structure (see figure 1), and obtained detailed insight into the phase transition phenomenon at the atomic level, on a time resolution inaccessible by experiment. <p> <cite> <A HREF="http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page24a.gif><IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page24a-ic.gif></a> Figure 1a: Graphitization of a diamond slab.   The 192-atom system in the diamond starting state. Even with the thermal disorder at 2900degrees K the diamond structure is clearly visible.  <p> <A HREF="http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page24b.gif><IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page24b-ic.gif></a> Figure 1b: Graphitization of a diamond slab.   The system after graphitization takes place: note the planar structure corresponding to weakly bonded carbon planes</cite><p> <p> <a name="4"><h2>Tight Binding Quantum MD</h2></a>  <h3>Features of the data layout</h3>  This code is based on an O(N) algorithm. The TB code which was ported to the T3D uses a TB Hamiltonian for carbon systems [<a href=#16>[6]</a> . We work with two electronic states localised on each site which are expanded in 4 atomic orbitals. The localised region is centered on the atom and is typically taken up to the 2nd or 3rd nearest neighbour. This means that the wavefunctions of the electrons in the localised region are typically described by 20-30 small (2 x 4) matrices. At each molecular dynamics step the wavefunctions are quenched down to their ground state using a Conjugate Gradient (CG) minimisation. This means that for each molecular dynamics step we typically need around 15-20 CG gradient steps (although the exact number is very system dependent). A typical system that was used for test runs is shown in figure 2.  <p> <cite><A HREF="http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page26.gif><IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page26-ic.gif></a>  Figure 2: Fullerenes (C60) arriving on the reconstructed diamond(111) surface (The system used has a supercell containing a 3072 atom slab of diamond). The deposition of fullerences on a semiconducting substrate is of great technological interest to investigate the possibility of building up a hard diamond like surface on the substrate </cite><p>  The localisation of the wavefunctions introduces a sparse nature to all the wavefunctions and their products which are calculated throughout the code. This means that all the variables are sparse coded in that we store only the non-zero values plus an index matrix indicating the row position this corresponds to in the full matrix. All calculations are then done through indirect addressing. A typical calculation in the code (such as the overlap of the wavefunctions) would then involve pulling in two (2x4) matrices, from essentially random memory locations, multiplying them together and writing the result to another random memory location. The sparse nature of the data in this code is significantly different from other sparse code problems  since when we are performing the CG steps, the positions of the non-zero elements are not changing even if their values are changing. We can exploit this fact to speed up the code by precalculating index lists at each MD step which we then use in the CG steps to pick out the wavefunction values needed for all the calculations.    <h3>Parallelisation strategy</h3>  We implemented the parallel version of the code on the T3D using message passing rather than CRAFT to have flexibility of data layout for the load balancing and control over the communication structures.   There is a large amount of data communication required in the parallel version of this code so fast data communications are extremely important.  The parallel distribution of data on the T3D was chosen to reflect the natural parallel structure of the problem and to minimise data communications. Since the wavefunctions are localised on sites we distributed the data such that each processor has complete wavefunctions and performs all the calculations associated with the sites they are centered on. This should be contrasted with the spatial decomposition method we used to parallelize the O(<img align=middle src=http://sawww.epfl.ch/SIC/SA/publications/SCR94/n3.gif>) code where each processor has only a part of each of the wavefunctions.  <p> With the site decomposition data structure for the <i>O(N) </i>code, we only have to communicate wavefunctions (and their derivatives) at each of the CG steps. All the other data required to update the wavefunctions will be local. The data communication structure is organised as follows. At each MD step we make an index list of all the off-PE (Processing Element) wavefunctions required for calculations on that PE and also a list of which PE's they are on. At each CG step, before any calculations are performed, each processor copies into a dummy data array all the off-PE wavefunctions it needs for its calculations using the index lists calculated at the MD step.  Load Balancing is a very important issue in this program since the number of calculations can vary greatly from site to site. If we imagine, for example, that we are performing quantum molecular dynamics on a system with a surface (such as in figure 2), then the wavefunctions associated with the sites at the surface overlap with other wavefunctions to a much lesser extent than those in the bulk. Consequently the amount of calculation associated with the surface sites (which is roughly proportional to the total area of overlap they have with other sites) is much lower.  It should also be noted that different parts of the code use different neighbourhoods for their calculations (this is due to the hopping terms in the TB Hamiltonian), so if we load balance for one part of the code using one of the neighbourhoods, it is not necessarily load balanced for the others. The degrees of freedom we have in the problem for load balancing are that we can vary the number of sites on each processor and also choose exactly which sites go to which processor. Due to the atomic motion sites can move in and out of localisation regions and consequently the amount of calculations associated with each site varies during the simulation. Therefore, we require a dynamic load balancing scheme that can, if the atomic positions have significantly changed, reorder which sites lie on which processor during a run. This should be contrasted with the plane-wave code in which the wavefunctions are extended throughout the whole system:  even if the atoms are moving, the amount of calculation associated with each wavefunction remains constant.  <p> A simple load balancing algorithm, which minimises the variance of the calculation time on the processors, was used. We have many subroutines which must be individually load balanced because they are separated by barriers. In each subroutine we time the calculations associated with each site on each processor. A cost function is then constructed which is the sum of the variances of the times of calculation on each processor and in each subroutine. This cost function is minimised at the MD time step with a frequency which depends on the speed of the dynamics in the system under study. The data which stores the state of each wavefunction and the atomic positions is then reorganised in the memory of the machine to match the new loadbalanced mapping. In this way we carry out a dynamic load balancing during a given dynamical run. We have found that our algorithm will easily loadbalance down to about a few percent: the percentage being the ratio of barrier wait time to calculation time.<p>  Up to this point we have made no mention of the relationship between the spatial structure of the real problem and that on the machine. It is advantageous to have the sites on a given processor corresponding to a local spatial area of the real system as opposed to them being scattered throughout the system. This reduces the communications required for each processor since many wavefunctions, on a given PE, will overlap with the same wavefunctions on other PE's. The way we introduce a locality mapping of the real problem onto our processors is by distributing the sites in a local way at the start of any run and also by adding a locality term into the cost function. In this way we favour spatial clustering of the sites on each PE. The behaviour of this cost function is still under investigation to see how it performs with a number of systems having different geometries. It should be noted that loadbalancing is a much bigger issue in this code than locality since, depending on the system under study, bad load balancing can easily reduce the efficiency of the code by 50 % or more; on the other hand, a bad spatial mapping of the sites onto the PE's can double or treble the communications which are typically only about 10 % of the run time.   <h3>Performance</h3>  Due to the indirect addressing of all the calculations involved, this code performs well below peak performance on different machines. The vector version of this code runs at an average speed of about 250 MFLOPS on a single processor of the Cray C90 and the speed varies greatly depending on the system under study. At each time step the vector code performs exactly the same number of floating point calculations as the parallel code. On the T3D the routines doing the calculations (no communications) run at an average speed of about 20MFLOPS per processor. The communications, like the number of calculations, scale linearly with the number of atoms in the problem due to their local nature. This makes the problem very scalable on parallel machines. The table below shows the speed of the program for different machine sizes for the 3072 atom carbon slab shown in figure 2.<p> <IMG SRC=http://sawww.epfl.ch/SIC/SA/publications/SCR94/6-94-page27.gif>  (1)<p>  The deviation from linear scaling is caused by the fact that, as we double the number of processors, we half the amount of calculations done on each processor, but the amount of communications per processor stays approximately the same.  However, scaling up the system size with the number of processors would give an almost linear speedup curve. On a 64 processor machine the code spent about 77 % of its time in calculations, 15 % in communication and 8 % in barrier wait time. <p> On a 128 processor T3D we can perform an MD step in about 15s for the 3072 atom system when choosing a localisation region extending to the 3rd nearest neighbour.  The time step is 7.7 x 10-16 seconds so we can simulate 1ps of real time in 1300 MD steps.    <a name="5"><h2>Conclusions</h2></a>  We have sucessfully ported two large Quantum MD codes to the T3D which both perform well on this machine. We have shown that both codes can be efficiently programmed in parallel while requiring a lot of data communication which is global in the case of the first-principles code. Therefore, the fast communication network on the  T3D was very important in gaining good performances with these codes.  Both these codes are now being used in production runs to give new physical results some of which have been reported in this paper.   <a name="6"><h2>Acknowledgements</h2></a>  This work was done as part of the PATP (Parallel Application Technology Program) joint project between the EPFL and Cray Research. Support is also acknowledged from the Swiss National Science Foundation. In the course of this work we have benefited from useful discussions with other PATP collaborators and staff at Cray Research, Eagan, U.S.A.   <a name="7"><h2>Bibliography</h2></a>  <ol><li><a name="11">R. Car and M. Parrinello, Phys. Rev. <b>55</b>, 2471 (1985).</a> <li><a name="12">G. Galli and M. Parrinello, Phys. Rev. Lett. <b>69</b>, 3547 (1992).</a> <li><a name="13">F. Mauri, G. Galli and R. Car, Phys. Rev. B <b>47</b>, 9973 (1993).</a> <li><a name="14">W. R. L. Lambrecht <i>et al.</i>, Nature <b>364</b>, 607 (1993).</a> <li><a name="15">L. J. Clarke, I. Stich and M. C. Payne, Comput. Phys. Commun. <b>72</b>, 14 (1992).</a> <li><a name="16">C. Xu, C. Wang, C. Chan and K. Ho, J. Phys.: Condens Matter <b>4</b>, 6047 (1992).</a></ol>   <p> <hr><p>  <B>article paru  EPFL Supercomputing Review - n. 6 - nov. 94</B><p> <p> <A HREF="http://sawww.epfl.ch/SIC/SA/SCR94/scr-94.html "><IMG SRC="http://sawww.epfl.ch/SIC/SA/publications/som-1.gif"></A> <a href=http://www.epfl.ch/cgi-bin/sendmail.pl?henriette.raposo@sic.adm.epfl.ch">   <IMG SRC="http://sawww.epfl.ch/SIC/SA/publications/com-1.gif"></a><br> <p>  
