Fig. 1: Silo discharge
Hybrid Parallelization in LIGGGHTS
DEM has become a viable tool for simulating a variety of industrial processes. As simulations become larger and more complex, DEM codes such as LIGGGHTS must find new ways to deliver needed performance to complete these simulations in reasonable time.
To better tackle load-imbalance, which is often found in granular simulations, a hybrid parallelization using both MPI and OpenMP has been developed. MPI domain decomposition is still used to generate multiple subdomains. This allows us to distribute the workload
among multiple compute nodes. OpenMP is then used to fully utilize all resources on each node. Threaded versions of all major computational steps, including particle-particle and particle-wall interactions were created and achieve better load-balancing. In the second year of this project we completed and validated our approach using several real-world benchmarks. One studied test case was a discharge of 1.5 million particles from a silo (Fig. 1).
Fig. 1: Silo discharge
Particle-particle interactions dominate this computation. Serial runs of 100,000 time steps took 2 days and 16 hours on one 32-core AMD cluster blade. On four blades, or 128 cores, MPI-only completed in 1 hour and 49 minutes. Our hybrid uses a simpler domain decomposition and through load-balancing minimizes processor idle-time (Fig. 2). It finished within 59 minutes, which is a 44% improvement.
Fig. 2: Runtime improvement seen in the Silo discharge example with 128 cores.
We also simulated a mixing process which contains a complex rotating mesh geometry interacting with about 770,000 large particles (Fig. 3). This 50,000 time step benchmark took 15 hours 38 minutes to complete on a single processor. A MPI-only simulation with 128 cores was able to reduce simulation time to 35 minutes. Due to a simpler domain decomposition, our hybrid could reduce MPI communication overheads which were introduced by rotating mesh geometry. It finished within 20 minutes on 128 cores, which is 42% faster than using MPI-only.
Fig. 3: Mixing process
Both benchmarks illustrate the benefit of multiple decomposition layers. One layer for mapping the physical layout of computing hardware. And another focusing on balancing the workload among compute resources.
Overall this development also improved baseline performance of LIGGGHTS because various internals needed to be adapted.
All of these changes will be made available to a broader audience in 2015. This will help sort out remaining bugs and improve quality through Open-Source collaboration.
Fig. 4: Runtime improvement seen in the Mixing benchmark with 128 cores