SST Validation Activity

Mission

The SST-VA mission is to strive to improve the robustness and efficiency of SST through industry-standard and/or innovative testing procedures (combined with feedback to SST verification efforts), so that users can have a high degree of confidence in SST.

Goals

Our goals are to use and/or develop software in the form of miniapps and microbenchmarks to stress test SST to gain confidence in SST’s ability to accurately simulate real-world hardware. To date, the STREAM, LMBench and GUPS microbenchmarks have been prominent in our efforts, which have focused on the Ariel, MemHierarchy, VaultSim and DramSim components of SST (see below for details).

Team Management Structure and Members

Oversight: SST management team - Si Hammond, Arun Rodrigues, Scott Hemmert

Validation Team members: Bob Benner, Jagan Jayaraj

Validation software

Our primary validation software is the collection of Mantevo miniapps (e.g. miniFE, miniAMR, miniGhost, etc.), supplemented by various microbenchmarks.

Miniapps

Micro-benchmarks

Results

The SST validation studies for memory focus on both the bandwidth and latency. Working with lat_mem_rd benchmark for memory read latency from lmbench suite, we discovered a bug on systems where the Intel PIN tool for the Ariel processor model was not able to follow a forked child despite setting the right PIN parameters.

We moved to use BlackjackBench for benchmarking the cache latencies because it does not fork any children. The objective is to validate the cache latencies one level of cache hierarchy at a time. The number of memory accesses that Ariel reports is very close to the count estimated from the application source. Since the simulations are more or less deterministic, ignoring the slight non-determinism due to PIN, we do not have to repeat the benchmark multiple times. The load latencies change exactly at cache size boundaries of L1, and L2. The average load latencies as reported by the benchmark are very close to the configured L1 latency when the data fits in L1. However the latencies drop for L2 accesses, and is not correct. The memHierarchy itself appears to be computing the latencies correctly. The reason for the disprecancy is due to the complex interaction between Ariel and the application, especially when making the gettimeofday system call.

Another part of the memory study focuses on the bandwidth and latencies of the main memory backends. In particular, an HMC like high bandwidth memory backend called VaultSim was compared against DDR3 (DramSim). Care must be taken in configuring the systems as it is easy to get the bandwidth wrong, especially with a network model and multiple links in the mix. The clocks of the memory and the directory controllers must also be set appropriately to achieve the specified bandwidth for a given flit size. If the bandwidths are not set properly even for one of the components or links in the critical path, you will notice that the packets wait a lot longer in the flow control buffers, and the accesses take 100s of cycles longer to complete.

Nevertheless, this study brought a need to modify memNIC to support bigger (and configurable) buffer sizes to support the higher bandwidths, and augmenting cache statistics with better latency breakdowns. At a coarse level, we observe that the VaultSim configuration is faster than DDR3 for the miniapps. At a finer grain, we are working on establishing that the latencies between the cache hierarchy and the memory backends are exactly what we expect.



Support Files

gups.c - GUPS Random Memory Access Benchmark quads.py - Input file for sst to run miniSMAC2D miniapp batch_script_for_sst_and_minismac2d - Batch file for running miniSMAC2D on sst-devel compute nodes