Monitor CPU and Memory

General Note

Monitoring CPU (Central Processing Unit) and Memory usage is important in order to ensure the performance and stability of a system.

CPU usage: Monitoring CPU usage helps to identify performance bottlenecks, such as CPU-intensive processes, that may slow down the system. It also helps to detect potential security issues such as malware or rogue processes that consume a high amount of CPU resources.
Memory usage: Monitoring memory usage helps to identify memory leaks, which are a common cause of system crashes and instability. It also helps to ensure that the system has enough memory resources available to handle peak loads and avoid swapping, which can slow down the system.

By monitoring CPU and Memory usage, administrators can proactively identify and resolve performance issues, avoid system crashes, and ensure a stable and efficient system.

Administrators can also make sure your jobs use the right amount of RAM and the right number of CPUs helps you and others using the clusters use these resources more effeciently, and in turn get work done more quickly.

Running Jobs

If your job is already running, you can check on its usage, but will have to wait until it has finished to find the maximum memory and CPU used. The easiest way to check the instantaneous memory and CPU usage of a job is to ssh to a compute node your job is running on. To find the node you should ssh to, run:

Once you are on the compute node, run either ps or top.

`top`

In Jupyter Notebook, you can monitor CPU and memory usage by using the %system magic command. For example, the following command will show you the current CPU usage in Jupyter Notebook:

  `%system top -b -n 1 | grep Cpu`

The output will look like this:

  Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

The following command will show you the current memory usage in Jupyter Notebook:

  `%system free -m`

The output will look like this:

              total        used        free      shared  buff/cache   available
  Mem:          125          10         114           0           0         114
  Swap:           0           0           0

In the terminal, you can monitor CPU and memory usage using the top command in Linux and the task manager in Windows.

On Linux, the top command displays the system's resource usage and the processes that are using the most resources. By default, top displays information about CPU and memory usage.

In Windows, the Task Manager provides information about the performance of the system and the applications running on it. To open the Task Manager, press Ctrl + Shift + Esc or right-click the taskbar and select Task Manager. The Task Manager provides information about CPU and memory usage, as well as other system performance metrics.

Note: The specifics of monitoring CPU and memory usage may vary depending on the operating system and the tools available.

`squeue`

The squeue command is used to view the status of jobs and nodes in a Slurm job scheduling system. When you run squeue --me, it will show the status of the jobs that are associated with the current user.

Here is an example of the output you might see when you run the squeue --me or squeue -- netidcommand:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
56789  debug       job1     user1  R       1:23      2 node001,node002
67890  long        job2     user1  PD      0:00      4 (Resources)

The output shows the following information:

JOBID: a unique identifier for each job.
PARTITION: the partition or queue to which the job is assigned.
NAME: the name of the job.
USER: the username of the user who submitted the job.
ST: the status of the job. The possible values are:
PD (pending): the job is waiting to run.
R (running): the job is currently running.
CA (cancelled): the job was cancelled by the user or a system administrator.
CF (configuring): the job is in the process of being allocated resources.
CG (completing): the job has finished running, but the completion script is still executing.
CD (completed): the job has finished running and all processes have completed.
F (failed): the job terminated with non-zero exit code or other failure condition.
TIME: the total amount of time the job has been allocated to run or the amount of time since it was last running.
NODES: the number of nodes the job is running on.
NODELIST: a list of the nodes the job is running on or the reason the job is pending.

`sacct`

The sstat command is used to view the detailed statistics of a Slurm job or step. When you run sstat JOBID -a --format=AveCPU,AveVMSize, it will show the average CPU utilization and average virtual memory size of the specified job.

Here is a guide to using sacct to monitor resource usage 1. Obtain the job ID: Before using the sstat command, you will need to determine the job ID for the job you want to monitor. You can use the squeue command to list all of the jobs in the Slurm cluster, along with their job IDs.

Use the sstat JOBID -a --format=AveCPU,AveVMSize To monitor the average CPU utilization and average virtual memory size for a specific job.
View the results, Here is an example of result display in the terminal.

    AveCPU  AveVMSize 
---------- ----------  
00:00:00      7324K   
00:00:09   1006052K

Interpret the results
AveCPU Average (system + user) CPU time of all tasks in job.
AveVMSize Average Virtual Memory size of all tasks in job.

The average CPU utilization value indicates the average amount of CPU time that the job has used over the course of its execution. A higher value indicates that the job is using more CPU resources. The average virtual memory size value indicates the average amount of virtual memory that the job is using. A higher value indicates that the job is using more memory.you can monitor the resource usage of Slurm jobs and make decisions about how to allocate resources more effectively. If a job is using an excessive amount of CPU or memory resources, you may need to adjust its resource allocation or move it to a different node in the cluster.

For other job accounting fileds with explaination here You can also use the more flexible sacct to get that info, along with other more advanced job queries.