balar

The balar library provides an interface between SST and GPGPU-Sim, a cycle-level simulator modeling contemporary graphics processing units (GPUs) running GPU computing workloads written in CUDA. It supports two execution modes: trace-driven and direct-execution.

Trace-driven: balar is driven by a test CPU that consume CUDA API traces (not to be confused with SASS traces for Accel-Sim) and launch CUDA calls.
Direct-execution: a CPU model execute a CUDA binary linked with custom libcudart and dispatch CUDA API calls to balar and GPGPU-Sim.

The BalarTestCPU component is a trace-based test CPU that is included inside balar folder (./testcpu/) to run simulations with CUDA API call traces and data collected from a real GPU. It works by consuming a trace file and associated CUDA memory copy data files. The cudaMemcpyH2D data payload is collected for program correctness. The cudaMemcpyD2H data is collected to validate computation.

BalarMMIO is responsible for relaying CUDA API requests from SST to GPGPU-Sim. Currently it supports running with CUDA traces without a real CPU model (with BalarTestCPU) or with a Vanadis core running RISCV + CUDA binary with a custom CUDA runtime (libcudart_vanadis inside ./tests/vanadisLLVMRISCV/). This mode has been tested with a subset of Rodinia 2.0 benchmark kernels in unittest.

The dmaEngine component performs memory data transfers between SST cache memory space and simulator memory space. It is required as balar will read/write the CPU data (i.e. cudaMemcpy() with vanadis) and place them into GPGPU-Sim's memory space for functional simulation. In addition, dmaEngine is also used to read CUDA dispatch packet and write return value for the custom CUDA runtime.

At a Glance

Source code: sst-elements/.../balar
SST name: balar
Maturity Level: Prototype (2)
Development Path: Active
Last Released: SST 15.0

warning

Support for trace-driven mode currently is limited as it was used as early stage validation for balar implementation. It has only been tested with a simple integer vector add example.

We are working on providing a more robust version of this with the new NVBit release. Including better trace format and better computation validation.

Required dependencies

CUDA Version 11.0+ is recommended
GPGPUSim Use the dev branch from accel-sim/gpgpu-sim_distribution

Optional dependencies

LLVM For compiling RISCV + CUDA binary, 18.x.x+ should work (specifically, builds after resolving llvm/llvm-project#57544 should work)
RISCV GNU Toolchain For compiling RISCV + CUDA binary. Generally all recent distribution should work, we have only tested on tag 2024.08.06-nightly.
gpu-app-collection For running unittest with Rodinia 2.0 kernels
Test Docker image You can also opt in the prebuilt docker image with all dependencies installed except for GPGPU-Sim:
- OS: Ubuntu 22.04
- CUDA: 11.7
- LLVM: 18.1.8
- RISCV: 2024.08.06-nightly
- ```
# Pull prebuilt image
docker pull ghcr.io/accel-sim/accel-sim-framework:SST-Integration-Ubuntu-22.04-cuda-11.7-llvm-18.1.8-riscv-gnu-2024.08.06-nightly
```

balar

Required dependencies

Optional dependencies

📄️ QuickStart

📄️ Tracing CUDA Program

📄️ Compiling RISCV + CUDA

📄️ Balar In Depth

Required dependencies​

Optional dependencies​

📄️ QuickStart

📄️ Tracing CUDA Program

📄️ Compiling RISCV + CUDA

📄️ Balar In Depth

Required dependencies

Optional dependencies