Batch processing (cluster)

For models that are computationally intensive, it is suitable to outsource the model generation and solving to the Phoenix cluster. Similar to the local workflow, this process can also be automated. The basic idea is to only transfer the (very small in size) model scripts from the local PC to the HPC cluster, and only download the extracted result files back to the local PC to minimize data traffic and workload on the local machine.

../_images/batch_remote_workflow_html.svg

A simple batch processing script on the local computer is used to initiate and control the pipeline on the cluster.

Using the batch processing script (cluster)

General remarks

The script ...GUW\python\batch_remote.py provides a customizable template for batch processing of model files on the Phoenix cluster. It is similar to the local batch processing script, but more parameters need to be set and there are also a few other things to consider.

  • The main difference when working on the Phoenix cluster is that ressources like CPU cores, RAM and computing time have to be allocated for each job. The Phoenix cluster uses SLURM workload manager, which lets users allocate resources by submitting job scripts which specify the commands to run and the required resources. These resource parameters have to be set for all stages of the FE analysis, like preprocessing, solving and extracting results. The batch wrapper functions provided by GUWlib automatically create and dispatch the necessary job files during the batch process.

  • Subsequent stages of the provided script (preprocessing & solving, postprocessing, download of results) have to be run independently as the local script has no information about whether a previous stage is completed on the Phoenix cluster.

  • The content of the ...\GUW\python\ directory has to be copied to the Phoenix cluster, as described in the setup section. Specify the path to this directory in the remote_guwlib_path variable at the top of the batch_remote.py script.

Example usage

In the following example, the ...GUW\python\batch_remote.py script is set up to upload, preprocess and solve the model files models\examples\example_01.py, models\examples\example_02.py and models\examples\tutorial.py on the Phoenix cluster.

  • For preprocessing (.PY files >> ABAQUS/CAE >> .INP files), 1 computing node with 10 cores is allocated on the standard partition and the estimated time to process all files is set to 1 h. If the process exceeds this time, the job is cancelled due to time-out.

  • For solving (.INP files >> ABAQUS/Explicit >> .ODB files), 1 computing node with all of its 20 cores and a maximum time of 12 h is allocated on the standard partitionb for each of the previously generated .INP files. The jobs in this stage are executed strictly in series (one after the other) as to avoid blocking to many ABAQUS licenses.

from guwlib.functions_batch.remote import *

# path to the remote location of GUWlib
remote_guwlib_path = '/beegfs/work/<username>/GUW/pythonpython'
preprocessing = True
postprocessing = False
download = False

# preprocessing and solving -----------------------------------------------------------------+
if preprocessing:
    # model files (.PY) to process
    model_file_paths = ['models/examples/example_01.py',
                        'models/examples/example_02.py',
                        'models/examples/tutorial.py', ]

    # SLURM parameters for preprocessing, solving, and postprocessing
    # parameters for preprocessing apply to all models, make sure that the total time
    # (max_time) is sufficient
    slurm_preprocessing = {"n_nodes": 1,
                           "n_tasks_per_node": 10,
                           "partition": "standard",
                           "max_time": "0-1:0:0"}

    # parameters apply to the solving process of one simulation each
    slurm_solving = {"n_nodes": 1,
                     "n_tasks_per_node": 20,
                     "partition": "standard",
                     "max_time": "0-12:0:0"}

    # call the batch function to upload the model files, initiate automated preprocessing
    # and solving
    build_and_solve(model_files_local=model_file_paths,
                    remote_guwlib_path=remote_guwlib_path,
                    cae_slurm_settings=slurm_preprocessing,
                    solver_slurm_settings=slurm_solving,
                    hostname='phoenix.hlr.rz.tu-bs.de', port=22)
  • For the stage of postprocessing (.ODB files >> ABAQUS/CAE >> .NPZ files), 10 min of computing time on the standard partition with 10 CPUs is allocated for each of the .ODB files. A maximum of 5 parallel executions of ABAQUS/CAE is specified (ABAQUS/CAE only blocks 1 token per instance).

  • The results/ directory (relative to the GUWlib path) is specified as the location to scan recursively for unprocessed results. The executed script will simply try to process all directories that contain an .ODB file, but no .NPZ file. One could, however, also explicitly specify the directories to search (e.g. `results/tutorial/lc_0_burst_load_case/). Forward backslashes are used on Linux to specify file paths!

  • After the results extraction is done, download_results() can be run to automatically download the .NPZ files that were written during the last call of extract_results().

# postprocessing ----------------------------------------------------------------------------+
if postprocessing:
    # parameters apply to the extraction process of one .ODB file each
    slurm_postprocessing = {"n_nodes": 1,
                            "n_tasks_per_node": 10,
                            "partition": "standard",
                            "max_time": "0:10:0"}

    # remote location where to look for .ODB files that are ready for results extraction
    directories_to_scan = ['results/', ]

    # data to extract (field or history)
    data_to_extract = 'history'

    # call the batch function for automated result export
    extract_results(directories_to_scan=directories_to_scan, data_to_extract=data_to_extract,
                    remote_guwlib_path=remote_guwlib_path,
                    cae_slurm_settings=slurm_postprocessing, max_parallel_cae_instances=5,
                    hostname='phoenix.hlr.rz.tu-bs.de', port=22)
    print("Make sure to check the status of the current post-processing job and download the "
          "results after the job is completed.")


Tricks for working with ABAQUS on the Phoenix Cluster

Some commands or tools are particularly useful when monitoring and managing ABAQUS jobs and are described shortly in this section. For a more beginner/complete guide on how to work on Linux using the CLI, look up e.g. the Phoenix documentation.

Submit, monitor and cancel jobs

  • A job file can be submitted via sbatch:

    $ sbatch jobile.job
    
  • To show your jobs that are currently in the queue, use squeue:

    $ squeue -u <username>
    

    When you have ABAQUS jobs running, that were created with the guwlib batch processing scripts, the ouput might look something like this:

    [username@login01 ~]$ squeue -u <username>
               JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             2676018  standard example_ username PD        0:0      1 (Dependency)
             2676017  standard example_ username R     0:12:21      1 node001
    

    In this case, two jobs are queued up, the first job (2676017) is running for 12 minutes and the second one is pending, waiting for the first job to be completed.

  • To see the live output of a job, you can tail the logfile that is associated with this job. To see the file path of the log-file for a given job id, run scontrol:

    $ scontrol show <JOBID>
    

    and then, with the file location of the log file, run:

    $ tail -f <log_file_path>
    

    to continously monitor the output that is written to the log file.

  • Another helpful tool to analyze the workload caused by a job is top, which shows a dynamic overview of running processes and system resources. For a specific job, check which node it runs on, connect to that node via SSH, and run top, e.g.:

    $ ssh node001
    $ top
    
  • To cancel a job, run scancel:

    $ scancel <JOBID>
    
  • To see an overview of completed jobs (including failed jobs), use sacct:

    $ sacct -u <username>
    

Current ABAQUS licenses in use

  • To see how many ABAQUS licenses are currently in use, run

    $ abaqus licensing lmstat -a
    

    on any computer within the TU BS network with ABAQUS installed.

File transfer

  • FileZilla is a good GUI tool for SSH file transfer.


Illustration of the cluster batch processing pipeline

It can be helpful to understand how the different scripts and functions interact with each other when working with the batch scripts or when debugging or extending them.

Building and solving

The batch process of building and solving is visualized below.

  • The build_and_solve() function uploads the specified model files (.PY) as well as an initial .JOB (preproc.job) file to the cluster via SSH.

  • The preproc.job file is submitted on the cluster and calls the cluster_pre.py script with respective command line arguments, specifying which .PY files to preprocess.

  • The cluster_pre.py script iterates over the list of .PY files, submitting them to ABAQUS/CAE as subprocesses. For each model script (.PY), one or more .INP files are created, depending on the number of load cases specified.

  • For each .INP file created, the cluster_pre.py script also writes a .JOB file. The job files are generated to call the ABAQUS solver on the respective .ODB files, and they’re generated in a way such that the \((i+1)\) th job is only started after the \(i\) th job is finished.

  • Each ABAQUS solver job will read the respective .INP file and write out an .ODB file.

Typically, the jobs calling the ABAQUS solver are resource-intensive and will run much longer than the preprocessing job. Enough resources (CPUs, computing time) have to be allocated for these jobs.

Extracting results

The process of extracting results is also visualized below.

  • The extract_results() function uploads a job script (postproc.job) to the the cluster via SSH, calling the cluster_post.py script on the cluster and specifying a list of directories to check for unprocessed results.

  • The cluster_post.py script scans the provided directories and their subdirectories for folders that contain .ODB, but no .NPZ files, and for each file found, it writes out a .JOB file.

  • The job files are generated to call ABAQUS/CAE along with a history / field export helper script to open the .ODB file and export the relevant data to an .NPZ file. Job files are generated in a way such that a specified maximum of CAE instances can run simultaneously (in this example, a maximum of 2 CAE processes will run at the same time).

  • As a result, for each .ODB file, an .NPZ file is generated, containing the extracted field or history data in a compact NumPy binary format. Also, a .TXT file with a list of the file paths to all .NPZ files extracted in the previous step is written out, allowing for a convenient batch download.


Batch wrapper functions (cluster)

guwlib.functions_batch.remote.build_and_solve(model_files_local, remote_guwlib_path, cae_slurm_settings, solver_slurm_settings, hostname='phoenix.hlr.rz.tu-bs.de', port=22)

Wrapper for the process of uploading model files (.PY) to a remote host (e.g. Phoenix Cluster) and initiating the preprocessing and solving pipeline by calling the cluster_pre.py script on the remote machine.

Data exchange between client and host is established via Secure Shell and respective command line arguments to the cluster_pre.py script. The host (e.g. Phoenix Cluster) is expected to have SLURM resource manager installed, has to be available via SSH, and guwlib python modules have to be available at remote_guwlib_path.

Parameters:
  • model_files_local (list[str]) – List of the model (.PY) files to upload and process on the cluster.

  • remote_guwlib_path (str) – Path to the directory that contains the guwlib module on the remote machine.

  • cae_slurm_settings (dict) – SLURM settings for the generation of ABAQUS .INP files.

  • solver_slurm_settings (dict) – SLURM settings for the ABAQUS solver.

  • hostname (str) – Name of the SSH host.

  • port (int) – Port of the SSH host.

Returns:

None

guwlib.functions_batch.remote.extract_results(directories_to_scan, data_to_extract, remote_guwlib_path, cae_slurm_settings, max_parallel_cae_instances, hostname='phoenix.hlr.rz.tu-bs.de', port=22)

Wrapper for the process of converting or extracting field / history data from .ODB files to .NPZ files at the specified locations on a remote host by calling the cluster_post.py script on the remote machine.

Data exchange between client and host is established via Secure Shell and respective command line arguments to the cluster_post.py script. The host (e.g. Phoenix Cluster) is expected to have SLURM resource manager installed, has to be available via SSH, and guwlib python modules have to be available at remote_guwlib_path.

Parameters:
  • directories_to_scan (list[str]) – dirs to scan for unprocessed .ODB files (relative to remote_guwlib_path)

  • data_to_extract (str) – type of data to extract ('field' or 'history')

  • max_parallel_cae_instances (int) – Maximum number of CAE instances to run simultaneously during extraction.

  • remote_guwlib_path (str) – Path to the directory that contains the guwlib module on the remote machine.

  • cae_slurm_settings (dict) – SLURM settings for ABAQUS/CAE which is used to read in .ODB files.

  • hostname (str) – Name of the SSH host.

  • port (int) – Port of the SSH host.

Returns:

None

guwlib.functions_batch.remote.download_results(remote_guwlib_path, hostname='phoenix.hlr.rz.tu-bs.de', port=22)

Wrapper for the process of downloading previously extracted results to the local machine. During the extraction process, a .TXT file with filepaths of all generated .NPZ files is placed at the location of remote_guwlib_path.

Parameters:
  • remote_guwlib_path (str) – Path to the directory that contains the guwlib module on the remote machine.

  • hostname (str) – Name of the SSH host.

  • port (int) – Port of the SSH host.

Returns:

None