Parallel cyclecloud slrum process stacked while reading input file

Eran Arad 0 Reputation points
2024-08-23T13:09:15.8333333+00:00

I am running a parallel job, using openmpi, on cycle cloud, using slrum (batch) on 5 nodes, 120 cores in each.

The job starts with reading the computational mesh, by each core. While reading, a mesh lock file appears for each core and disappears when it finishes reading. The problem: a new mesh lock file then appears. The process is stacked, though squeue shows as if it keeps running.

Running on a single node with 120 cores using mpirun directly (without sbatch) works fine, so it is slrum issue. Used to work for many runs. Actually worked yesterday. But now, though I restarted the scheduler several times, the process get stacked. Anyone has an idea how to resolve the issue?

Azure CycleCloud
Azure CycleCloud
A Microsoft tool for creating, managing, operating, and optimizing high-performance computing (HPC) and big compute clusters in Azure.
65 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.