Skip to main content

Checkpoint and restart

In SST 14, SST began bringing back checkpoint/restart support. Elements are not automatically checkpoint-able. To enable checkpoint/restart, developers must create a function in each element that describes how to serialize the element. A simulation can be checkpointed if all of the elements it uses support checkpoint (i.e., are serializable).

The checkpoint/restart infrastructure is intended to enable users to recover from node failure or to comply with job timeouts on machines with job time limits. Thus the assumption is that restart will occur on the same or a similar machine with the same version of SST. SST's serialization uses C++ typeids which are not guaranteed to be portable across architectures or compilers.

This guide describes

  1. Support status
  2. How to checkpoint and restart a simulation
  3. How to add checkpoint support to an element library
  4. Planned improvements

Support status

Select the version of SST you are using to see version-specific information. As support evolves, the repository information will be updated.

As of SST 15.0, the serialization API for checkpoint is stable.

  • Parallel support: All simulations, parallel and serial, can be checkpointed. Checkpoint and restart require the same parallelization (e.g., if checkpoint was generated with SST running two threads, SST must be restarted using two threads).
  • Serialization
    • The SST_SER macro replaces now deprecated ser & and ser | operators
    • A similar SST_SER_NAME macro can take an alias name for mapping as well as options from the SerOption enum
    • SerOption includes: SerOption::as_ptr (replaced ser | and SST_SER_AS_PTR), SerOption::as_ptr_elem, SerOption::map_read_only and SerOption::no_map
  • Serializable types
    • API available: Component, SubComponent, ComponentExtension, Module, PortModule
    • POD types (including std::string) and pointers to POD types
    • All C++ standard containers, std::tuple, std::pair
    • SST types: Link, TimeConverter, Output, RNG:Random, RNG:RandomDistribution, SharedArray, SharedMap, SharedSet, UnitAlgebra, Statistic, StatisticOutput
    • SST Handlers: Clock::Handler2, Event::Handler2
    • SST Interfaces: StdMem, SimpleNetwork
  • Checkpoint triggers
    • Simulated time, using the --checkpoint-period command line option.
    • Real (wall) time, using the --checkpoint-wall-period command line option.
    • On a signal, by using the --sigusr1=sst.rt.checkpoint or --sigusr2=sst.rt.checkpoint command line options.
  • Output format
    • Checkpoints are placed in a directory named <prefix> which is "checkpoint" by default. Use --checkpoint-prefix to change the prefix. If the directory already exists, SST will create a new one named <prefix>_<num> using a unique integer for <num>.
      • The checkpoint directory will be placed in SST's output directory. By default, this is the current working directory. It can be modified using --output-directory.
    • Within the directory, there is a subdirectory for each checkpoint which is named <checkpoint_name>. The default <checkpoint_name> is <prefix>_<id>_<simulated_time> where <id> is an incrementing integer and <simulated_time> is the simulated time at checkpoint. <checkpoint_name> name can be modified using --checkpoint-name-format.
    • Within the checkpoint subdirectory are a number of files.
      • Registry: A plaintext file named <checkpoint_name>.sstcpt>. This is the file to specify when restarting.
      • Globals: A binary file containing global checkpoint data named <checkpoint_name>_globals.bin
      • Data files: One per thread/rank and named <checkpoint_name>_<rank>_<thread>.bin

Creating a checkpoint

The command line option to create a checkpoint changed between SST 14.0 and SST 14.1 and the repositories. Select your version.

There are three options for triggering checkpoints: (1) on a simulation time interval, (2) on a wall (real) time interval, and (3) by sending SST a signal. A prefix to use for checkpoint directories and file names can also be given (default is "checkpoint").

Enabling checkpoints at a simulation time interval on the command line
$ sst --checkpoint-sim-period=250us configuration.py
$ sst --checkpoint-sim-period=250us --checkpoint-prefix="simulation" configuration.py
Enabling checkpoints on a wall time interval on the command line
$ sst --checkpoint-wall-period=10m configuration.py
$ sst --checkpoint-wall-period=10m --checkpoint-prefix="simulation" configuration.py
Enabling checkpoint when a SIGUSR1 or SIGUSR2 is sent to SST
$ sst --sigusr1=sst.rt.checkpoint configuration.py
$ sst --sigusr2=sst.rt.checkpoint --checkpoint-prefix="simulation" configuration.py

Any command line parameter can be set in the configuration file instead
import sst

sst.setProgramOption("checkpoint-wall-period", "10m") # Enable on a frequency in terms of wall time
sst.setProgramOption("checkpoint-prefix", "simulation") # Optional
# sst.setProgramOption("checkpoint-sim-period", "250us") # Uncomment to enable on a frequency in terms of simulated time
# sst.setProgramOption("sigusr1", "sst.rt.checkpoint") # Uncomment to trigger a checkpoint when SST receives SIGUSR1
# sst.setProgramOption("sigusr2", "sst.rt.checkpoint") # Uncomment to trigger a checkpoint when SST receives SIGUSR2

Simulated times must include units and SI-prefixes are valid. Wall times must be formatted as HH::MM::SS, MM::SS, or SS. ALternately, an integer with a time unit (h, m, or s) can be given.

If checkpoints are enabled, SST creates a directory named <prefix> in the SST output directory (unless set, the current working directory) where checkpoints will be placed. If a directory called <prefix> already exists, SST will name the directory <prefix>_<num> and increment <num> to the smallest integer that is a new directory name. (e.g., if "checkpoint_0" exists, SST will use "checkpoint_1"). An example directory structure is shown below.

current working directory
|--checkpoint
|--checkpoint_1_1000
|--checkpoint_1_1000.sstcpt
|--checkpoint_1_1000_globals.bin
|--checkpoint_1_1000_0_0.bin
|--checkpoint_1_1000_1_0.bin
|--checkpoint_2_2000
|--checkpoint_2_2000.sstcpt
|--checkpoint_2_2000_globals.bin
|--checkpoint_2_2000_0_0.bin
|--checkpoint_2_2000_1_0.bin

Each generated checkpoint has a <checkpoint_name> which is <prefix>_<id>_<simulated_time> by default. <id> is a unique (incrementing) integer and <simulated_time> is the simulated time when the checkpoint is created. Each checkpoint creates a subdirectory named <checkpoint_name>. In that subdirectory is a plaintext checkpoint registry file with the suffix ".sstcpt" and a number of global files (one per thread/rank). The registry file should be given to SST to restart the simulation.

Restarting from a checkpoint

To restart from a checkpoint, use the --load-checkpoint option and pass the checkpoint file (with suffix ".sstcpt") instead of a configuration file.

$ sst --load-checkpoint checkpoint/checkpoint_0_1000/checkpoint_0_1000.sstcpt

Adding checkpoint support to element libraries

Follow the steps below for each class in your element library. This will tell SST how to serialize your data structures. Depending on your object type, some steps may be skipped.

Note: The same serialization engine is used both for the existing Event serialization in parallel simulations and for checkpointing. Thus, the steps are similar to the steps needed to make Events serializable. Like Event serialization, checkpointing requires data to be explicitly added to the serialization stream. Unlike Event serialization, checkpointing tracks and tags pointers to objects so that upon restart, a pointer will correctly point to the recreated object.

Step 1: Inherit from the Serializable class - non-Element classes ONLY

note

Skip this step if either of the following apply:

  1. Your class is an SST Element or inherits from a class that is already serializable (examples: SST::Component, SST::SubComponent, SST::Module, SST::Event, etc).
  2. Your object is non-polymorphic

For all other classes, inherit from SST's serializable class as shown.

Inheriting from SST's serializable class
#include <sst/core/serialization/serializable.h>

class MyClass : public SST::Core::Serialization::serializable {
...

Step 2: Change Clock and Event handler functions

Change the handler type from Handler to Handler2 and move the function pointer from the constructor parameters into the template parameters as shown. The old handler types are not serializable and will cause errors if you attempt to serialize them.

// Clock handler without metadata
SST::Clock::HandlerBase* old_handler =
new SST::Clock::Handler<ComponentName>(this, &ComponentName::HandlerFunction); // Not serializable
SST::Clock::HandlerBase* new_handler =
new SST::Clock::Handler2<ComponentName, &ComponentName::HandlerFunction>(this); // Serializable

// Clock handler with metadata
SST::Clock::HandlerBase* old_handler =
new SST::Clock::Handler<ComponentName, uint32_t>(this, &ComponentName::HandlerFunction, 0); // Not serializable
SST::Clock::HandlerBase* new_handler =
new SST::Clock::Handler2<ComponentName, &ComponentName::HandlerFunction, uint32_t>(this, 0); // Serializable

// Event handler without metadata
SST::Link::HandlerBase* old_handler =
new SST::Event::Handler<ComponentName>(this, &ComponentName::HandlerFunction); // Not serializable
SST::Link::HandlerBase* new_handler =
new SST::Event::Handler2<ComponentName, &ComponentName::HandlerFunction>(this); // Serializable

// Event handler with metadata
SST::Event::HandlerBase* old_handler =
new SST::Event::Handler<ComponentName, uint32_t>(this, &ComponentName::HandlerFunction, 0); // Not serializable
SST::Event::HandlerBase* new_handler =
new SST::Event::Handler2<ComponentName, &ComponentName::HandlerFunction, uint32_t>(this, 0); // Serializable

Step 3: Add a default constructor

During deserialization, SST will call this constructor to create an object before populating its state from the checkpoint.

MyClass();

Step 4: Add a serialization function

SST's serializer calls serialize_order on elements during both serialization and deserialization. Add this function to a public section of your class definition:

public:
void serialize_order(SST::Core::Serialization::serializer& ser) override;

Next, if your class inherits from another, call that class's serialize_order function.

void MyClass:serialize_order(SST::Core::Serialization::serializer& ser)
{
BaseClass::serialize_oder(ser);
}

Last, add all class data members that need to be saved to the checkpoint using the macros SST_SER. This was changed in the SST 15.0 Release and the API is now stable. The macro looks like this:

SST_SER(var)        // No options needed
SST_SER(var, opt) // Options needed
SST_SER_NAME(var, name, opt) // Provide an alias name to be used for mapping/interactive introspection

The options determine how SST serializes the data member.

  • If you are adding a data member that is not itself a pointer but other pointers to it exist and will be serialized, you will need to specify the option SerOption::as_ptr. See note below. This replaces the SST_SER_AS_PTR macro from SST 14.x.
    • Specifically for containers, if you have pointers to non-pointer elements in a container, use the option SerOption::as_ptr_elem when serializing the container.
  • In all other cases you may use SST_SER without options.
  • Combine options using "|" if more than one option applies
  • Additional options are available to control mapping (i.e., interactive introspection)
    • SerOption::map_read_only: Do not allow a user to modify this object in an interactive console
    • SerOption::no_map: Do not serialize this object when mapping
  • The SST_SER macro also replaces the ser & syntax used previously for event serialization.

Note: The SerOption::as_ptr and SerOption::as_ptr_elem options work only for non-pointer data and add both the data to the checkpoint and register pointers to the data. These operators are only required if the object being serialized is not a pointer and you have a pointer elsewhere to the object that you intend to serialize as well. Using these macros allow a serialized pointer to correctly link itself to the serialized data. These options cause the serializer to consume both extra space in the checkpoint and add a (small) overhead during serialization. The non-pointer data must serialize before any pointers to the data are serialized, otherwise the deserialized data will not be correct.

By way of example:

/* Class variables */
//
SST::Link* link;
SST::Output* output;
SST::TimeConverter tc;
int count;
std::string name;
std::vector<uint32_t> nums;
std::string pointedToObj;
std::string* pointer = &pointedToObj;

/* Serialization function */
void serialize_order(SST::Core::Serialization::serializer& ser) override
{
BaseClass::serialize_order(ser); // Assuming this class has a base class called 'BaseClass'. ALWAYS call the base class's serialize_order function prior to adding data from this class.
SST_SER(link);
SST_SER(output);
SST_SER(tc);
SST_SER(count);
SST_SER(name);
SST_SER(nums);
SST_SER(pointedToObj, SerOption::as_ptr); // Track a pointer to this object
SST_SER(pointer);
}

Many data types are supported natively in SST's serialization libraries. These include:

  • POD types including std::string
  • All standard containers (std::array, std::map, std::unordered_map, std::vector, etc.)
  • Standard types: std::atomic, std::pair, std::tuple
  • SST types: Link, TimeConverter, Output, RNG:Random, RNG:RandomDistribution, SharedArray, SharedMap, SharedSet, UnitAlgebra, Statistic, StatisticOutput
  • Handlers: Clock::Handler2, Event::Handler2
  • Pointers to serializable types

Note that non-polymorphic classes are special cases in the serialization library. Because there is no way to reference the class with a different type, there is no need to inherit from Serializable or call the ImplementSerializable macro (step 5 below). You may simply add a serialize_order() function to the class as listed above and the class will be compatible with the serializer.

Step 5: Add the appropriate serialization macro

note

Skip this step if your class is non-polymorphic.

Classes that inherit from SST::Core::Serialization::serializable must call the ImplementSerializable macro with their class name. Note that the macro changes the class access specifier. Be careful where you place it, place an access specifier directly after the macro, or put the macro at the end of the class definition.

  • ImplementSerializable leaves the class in a private section
  • ImplementVirtualSerializable leaves the class in a public section
MyClass : public SST::Component
{
public:
// Class functions

ImplementSerializable(MyClass)
public: // Make sure to restore the specifier if needed
};

Pure virtual classes must use ImplementVirtualSerializable instead of ImplementSerializable.

Put it together: An example

Here is an example of what this looks like. Note we have removed all but the relevant-to-checkpointing pieces of the class (ELI macros, most class functions, etc.).

example.h
#ifndef SST_EXAMPLE_CHECKPOINT_HEADER
#define SST_EXAMPLE_CHECKPOINT_HEADER

#include <sst/core/component.h>
#include <sst/core/link.h>
#include <sst/core/rng/marsaglia.h>

namespace SST {
namespace ExampleSSTLibrary {

class example : public SST::Component { // Step 1 - SST::Component already inherits from Serializable
public:
// ELI macros would be here

// Public class members would be here

// Step 3 - Add default constructor, serialize_order, and serializable macro
example();
void serialize_order(SST::Core::Serialization::serializer& ser) override;
ImplementSerializable(SST::ExampleSSTLibrary::example)
private:
// Private member functions would be here

// Private data members - to be serialized in serialize_order
int64_t param0;
uint32_t param1;
std::string param2;
Statistic<uint64_t> stat;
SST::Output* out;
std::vector<std::string> stringVec;
SST::Link* link;
RNG::Random* rng;
};

} } // End namespaces
#endif
example.cc
// Rest of class defined here

example::example() : Component() {}

void example::serialize_order(SST::Core::Serialization::serializer& ser) {
// MUST call parent's serialize_order FIRST
Component::serialize_order(ser);

// Now, serialize everything we need to save
SST_SER(param0)
SST_SER(param1)
SST_SER(param2)
SST_SER(stat)
SST_SER(out)
SST_SER(stringVec)
SST_SER(link)
SST_SER(rng)
}

Advanced notes on serialization

While many types are serializable already within SST's serialization engine, complex types may require additional work.

Background

See the serialization documentation in the Core API for more detail on serialization. Checkpoint uses three stages of serialization (a fourth, MAP, exists for interactive debugging). They are: (1) SIZER, (2) PACK, and (3) UNPACK. When writing a checkpoint, SIZER and PACK are used to serialize the state of simulation and write it to file. During restart, UNPACK is used to deserialize the checkpointed data. In each stage, the serializer will call serialize_order on each object. Thus, if the default serialization does not work for a particular type, a custom serialization can be implemented by querying the serializer's mode in the serialize_order function and taking the appropriate action to serialize the type. During SIZER, the serialize_order function should add the size of the data to be serialized using the SST_SER macro. During PACK, serialize_order should add the data to be added to the serialization stream using SST_SER with the same options Finally, during UNPACK, serialize_order should read the serialized data from the stream, again using SST_SER. The same macro options should be used in all three stages. The fourth stage, MAP, should be defined in the case switch statements and can be left empty.

Example: Serializing UnitAlgebra

As an example, the code snippet below shows how SST::UnitAlgebra is serialized. UnitAlgebra has two components, a "unit" struct that contains a numerator (string) and denominator (string) and a "value" object that is a custom type called decimal_fixedpoint.

The "unit" struct consists of strings so UnitAlgebra serializes the struct's members using the & operator in all three serializer stages (SIZER, PACK, and UNPACK). To serialize decimal_fixedpoint, SST converts it to a string, stores the string, and then, on deserialization, re-initializes a new decimal_fixedpoint type using the stored string.

Note that the SST_SER macro is only useful for class members - unit.numerator and unit.denominator in the example below. For temporaries, use the raw (& or |) serialization operators.

    void serialize_order(SST::Core::Serialization::serializer& ser) override
{
// Do the unit - these are both strings, no special handling required
SST_SER(unit.numerator);
SST_SER(unit.denominator);

// For value, first convert to a string
// Then, reinitialize from the string
switch ( ser.mode() ) {
case SST::Core::Serialization::serializer::SIZER:
case SST::Core::Serialization::serializer::PACK:
{
std::string s = value.toString(0);
// The next line will get the size of the string during SIZER
// and will pack the string into the serialization stream during PACK
SST_SER(s);
break;
}
case SST::Core::Serialization::serializer::UNPACK:
{
// Extract the serialized string from the stream
std::string s;
SST_SER(s);
// Reinitialize 'value' using the string
value = sst_big_num(s);
break;
}
case SST::Core::Serialization::serializer::MAP:
{
// Case required but no implementation needed
break;
}
}
}

Planned improvements

  • Add ability to query if an element supports checkpoint
  • Options to limit number of checkpoints kept
  • Additional flexibility when restarting SST including ability to change parallelization and re-partition components.
    • The current implementation allows overriding command line options that govern output such as --verbose and --output-directory when restarting.