Man Linux: Main Page and Category List

NAME

       cr_checkpoint - checkpoints a process, process group, or session.

SYNOPSIS

       cr_checkpoint [options] ID

DESCRIPTION

       Invoking  cr_checkpoint  causes  a  process (with or without all of its
       descendants), all processes within a process group,  or  all  processes
       within  a session, to be checkpointed.  The result is a checkpoint file
       (or a directory with one checkpoint file per process) that contains all
       the   state  needed  to  restart  the  process(es)  at  a  later  time.
       Checkpointed processes can be restarted via cr_restart(1).

       To be checkpointed by cr_checkpoint, a process must have  the  libcr.so
       library  (or  one  of  its  relatives) loaded.  This can be achieved by
       starting the program with cr_run(1), or  by  linking  your  application
       with  -lcr.   Or, the library may be loaded by other libraries you have
       linked with (such as a checkpoint-ready MPI library), or your  system’s
       parallel  job startup script, etc.  Check your system documentation for
       details.

   File creation/replacement
       By default (or if --atomic is passed)  cr_checkpoint  creates  the  new
       context  file/directory  atomically:   either the checkpoint fails (and
       any existing context file/directory is unchanged), or it appears in the
       directory  ready  to  be used by cr_restart.  If an existing checkpoint
       with the same file name exists, it will either be be unmodified (if the
       new  checkpoint  fails  for  any  reason),  or replaced atomically (via
       rename(2).  If --backup[=NAME] is passed, any existing checkpoint  will
       be  backed  up  instead,  either  to  NAME or with a numbered extension
       (.~1~, .~2~, etc., with more recent checkpoints having higher numbers).
       If  --clobber  is  passed,  the  checkpoint will immediately remove any
       existing checkpoint files, and will write the checkpoint  directly  out
       into  the target file/directory: this option uses less disk space if an
       existing checkpoint is present, since the old checkpoint is immediately
       discarded,  but if the checkpoint fails, the pre-existing checkpoint is
       lost.  Finally, if --noclobber is passed, then the checkpoint will fail
       if the target file/directory exists.

   File sync
       By  default  (or  when --sync is passed), cr_checkpoint waits until the
       checkpoint is complete in memory, and additionally  calls  fsync(2)  on
       all files and directories involved in the checkpoint (including back-up
       files) to disk before exiting.  Passing  --nosync  causes  these  fsync
       calls to be skipped.

   Timeout
       A maximum timeout in seconds can be set for a checkpoint via the --time
       flag: if the checkpoint takes  longer  than  this,  cr_checkpoint  will
       print an error mesage and exit with an error.  If a timeout occurs, the
       state of the process or  processes  that  were  being  checkpointed  is
       undefined.

   Signals
       By default checkpointed processes continue to run after a checkpoint is
       complete.  Alternatively, you may specify that  they  be  stopped  (via
       --stop), or terminated/aborted/killed (via --term, --abort, or --kill).
       This is done by sending the appropriate signal to every process that is
       part  of the checkpoint.  If the processes were stopped at the time the
       checkpoint was requested, then --cont may be used to  send  SIGCONT  to
       all processes after the checkpoint is completed.

   Memory mapped files
       By default, checkpoints do not include any files that are mmap()ed into
       the process address space unless they are already unlinked at the  time
       the  checkpoint  is  taken.   This  is a space/time saving optimization
       under the assumption that the files required will still be present (and
       uncorrupted) at restart time.  Typically the largest savings comes from
       not saving the executable file or  dynamic  (a.k.a  shared)  libraries.
       However,  options  exist to cause the checkpoint to save these files as
       well.  The flag  --save-exe  will  cause  the  executable  file  to  be
       included  in the context file.  The flag --save-private will include in
       the context file any files that are mapped with the  MAP_PRIVATE  flag,
       which  under Linux includes the executable and dynamic/shared libaries.
       The flag --save-shared is for saving files that  are  mapped  with  the
       MAP_SHARED  flag.   Note  that this is not the flag you want for shared
       libraries.  At restart any file saved by these  flags  will  be  mapped
       into  the process regardless of whether any file exists at the original
       location.  If there  is  file  at  the  original  location  it  remains
       untouched  by  the  restart.   Finally  --save-all and --save-none will
       cause all (or none) of these optional mmaped files to  be  saved.   The
       default  is  --save-none.   When passing multiple of these options they
       are processed from left to  right  with  all  options  being  additive,
       except  for  --save-none which cancels the effects of any these options
       appearing earlier.

   Checkpointing ptrace()ed processes
       There  is  (currently)  no  way  to  fully  transparently   deal   with
       checkpoints   of  processes  that  are  being  traced  with  ptrace(2).
       Therefore, the default behavior (also available via --ptraced-error) is
       to  return  an  error  if  any  of the processes to be checkpointed are
       currently  being  ptraced.   However,  there  are  two  other  possible
       behaviors to choose among:

       --ptraced-skip
              Ptraced   processes   will   be   siliently  excluded  from  the
              checkpoint.  No error is generated unless this results  in  zero
              processes checkpointed.

       --ptraced-allow
              Ptraced  processes  will  be  checkpointed  just  like any other
              processes.  WARNING: Because the checkpointed  process  and  the
              BLCR kernel module must interact using signals and system calls,
              the debugger (or other tracer) may need to ‘continue’ the target
              process(es), possibly more than once, to allow the checkpoint to
              complete.

   Checkpointing ptrace()ing processes
       There  is  (currently)  no  way  to  fully  transparently   deal   with
       checkpoints  of  processes  that  are  tracing  other  processes  using
       ptrace(2).   Therefore,  the  default  behavior  (also  available   via
       --ptracer-error)  is  to  return an error if any of the processes to be
       checkpointed  are  currently   ptracing   other   processes.    However
       --ptracer-skip  is available to cause cr_checkpoint to silently exclude
       such processes from the checkpoint.  No error is generated in that case
       unless this would result in zero processes checkpointed.

OPTIONS

   General options:
       -v, --verbose
              print progress messages to stderr.

       -q, --quiet
              suppress error/warning messages to stderr.

       -?, --help
              print this message and exit.

       --version
              print version information and exit.

   Options for scope of the checkpoint:
       -T, --tree
              ID  identifies  a process id.  It and all of its descendants are
              to be checkpointed.  This is the default.

       -p, --pid, --process
              ID identifies a single process id.

       -g, --pgid, --group
              ID identifies a process group id.

       -s, --sid, --session
              ID identifies a session id.

   Options for destination location of the checkpoint:
       -c, --cwd
              checkpoint   saved   as   a   single   ’context.ID’   file    in
              cr_checkpoint’s working directory (default).

       -d, --dir DIR
              checkpoint  saved  in  new  directory DIR, with one ’context.ID’
              file per process (unimplemented).

       -f, --file FILE
              checkpoint saved as FILE.

       -F, --fd FD
              checkpoint written to an open file descriptor.

   Options for creation/replacement policy for checkpoint files:
       --atomic
              checkpoint created/replaced atomically (default).

       --backup[=NAME]
              checkpoint  created  atomically,  and  any  existing  checkpoint
              backed up to NAME or *.~1~, *.~2~, etc.

       --clobber
              checkpoint written incrementally to target, overwriting any pre-
              existing checkpoint.

       --noclobber
              checkpoint will fail if the target file exists.

              These  options  are  ignored  if  the  destination  is  a   file
              descriptor.

   Options for signal sent to process(es) after checkpoint:
       --run  no signal sent: continue execution (default).

       -S, --signal NUM
              signal NUM sent to all processess.

       --stop SIGSTOP sent to all processes.

       --term SIGTERM sent to all processes.

       --abort
              SIGABRT sent to all processes.

       --kill SIGKILL sent to all processes.

       --cont SIGCONT sent to all processes.

              Options  in this group are mutually exclusive.  If more than one
              is given then only the last will be honored.

   Options for file system synchronization (default is --sync):
       --sync fsync checkpoint file(s) to disk (default).

       --nosync
              do not fsync checkpoint file(s) to disk.

   Options to save optional portions of memory:
       --save-exe
              save the executable file.

       --save-private
              save private  mapped  files.   (executables  and  libraries  are
              mapped this way)

       --save-shared
              save shared mapped files.  (System V IPC is mapped this way).

       --save-all
              save all of the above.

       --save-none
              save none of the above (the default).

   Options for ptraced processes (default is --ptraced-error):
       --ptraced-error
              return  an error if a checkpoint is requested of a process being
              ptraced.

       --ptraced-skip
              ptraced processes are  silently  excluded  from  the  checkpoint
              request.  If the checkpoint scope is --tree, then this will also
              exclude any children of such processes.  No  error  is  produced
              unless this results in zero processes checkpointed.

       --ptraced-allow
              checkpoint   ptraced  processes  normally.   WARNING:  This  may
              require  the  tracer  to  "continue"  the  target   process(es),
              possibly more than once.

   Options for processes ptracing others (default is --ptracer-error):
       --ptracer-error
              return  an error if a checkpoint is requested of a process which
              is ptracing others.

       --ptracer-skip
              processes  ptracing  others  are  silently  excluded  from   the
              checkpoint  request.   If  the  checkpoint scope is --tree, then
              this will also exclude any children of such processes.  No error
              is  produced unless this results in zero processes checkpointed.

   Options for kernel log messages (default is --kmsg-error):
       --kmsg-none
              don’t report any kernel messages.

       --kmsg-error
              on checkpoint failure, report  on  stderr  any  kernel  messages
              associated with the checkpoint request.

       --kmsg-warning
              report  on  stderr  any  kernel  messages  associated  with  the
              checkpoint request, regardless of success or failure.   Messages
              generated  in  the  absence  of  failure  are  considered  to be
              warnings.

              Options in this group are mutually exclusive.  If more than  one
              is  given then only the last will be honored.  Note that --quiet
              suppresses all stderr output, including these messages.

   Misc Options:
       -t, --time SEC
              allow  only  SEC  seconds  for  target  to  complete  checkpoint
              (default: wait indefinitely).

EXAMPLES

       To  checkpoint  the  process with process ID 23452, saving its state to
       file context.23452:

              cr_checkpoint -p 23452

       To checkpoint all the processes in process group 68473, and  save  them
       to file groupie:

              cr_checkpoint -g -f groupie 68473

       To  checkpoint  all  the  process  in  session  8362, and save separate
       ’context.PID’ files for each process in directory ’my_checkpoints’:

              cr_checkpoint -s -d my_checkpoints 8362

BUGS

       Some features in this manpage may be unimplemented.

AUTHORS

       Jason Duell, Paul Hargrove, and Eric Roman, Lawrence Berkeley  National
       Laboratory.

REPORTING BUGS

       Bug  reports may be filed on the web at http://mantis.lbl.gov/bugzilla.

SEE ALSO

       cr_restart(1), cr_run(1)