CLUSTER UPGRADE to SLURM and Rocky 9.2

We are pleased to announce that an upgrade to the cluster is underway.  We have upgraded two compute nodes and invite you to try them out: compute-094 and compute-111. You can ssh to them directly from the current login nodes until we announce the availability of a new login node, which will be temporarily named login31 (and later renamed to jhpce01).

The most significant change will be the switch in schedulers from SGE (Sun Grid Engine) to SLURM (Simple Linux Utility for Resource Management).  The SGE codebase is not actively maintained, and the newest version is about 10 years old at this point. SLURM on the other hand is more widely used, with regular patches and updates made available.

SLURM and SGE are conceptually similar, with the notion of “jobs”, “nodes”, “partitions” (known as “queues” in SGE), and resource allocation for RAM and cores. However the commands and options between the two schedulers are different.  An orientation to using SLURM on the JHPCE cluster is available, and we will be providing training sessions for end users as we get closer to the cutover date. There are also documents and example code files in /jhpce/shared/jhpce/slurm on the test nodes.

We will also be upgrading the operating system from Centos Linux 7.9 to Rocky Linux 9.2. Both Centos and Rocky are built off of the same RedHat source code, and are binary compatible with Redhat Linux. 

We are standing up parts of the new cluster alongside the old, with the intention of moving more compute nodes over as we flesh out the new cluster’s capabilities.

We hope to finish in time for the resumption of school, but will press on if that deadline passes.

We will continue to use modules to manage the user environment with respect to different packages.    Because of the upgrade to the OS, current modules should be recompiled.  If you have helped build modules in the past, we would greatly appreciate your help doing so again. These are the new module directories and their content:

  /jhpce/shared/jhpce – the systems admin staff

  /jhpce/shared/community – you good folks

  /jhpce/shared/libd – Lieber Institute

Please use the bithelp mailing list for discussions about the new cluster – problems, solutions, requests.

Thank you for your interest and participation!

Jeffrey

Posted in JHPCE Announcements | Comments Off on CLUSTER UPGRADE to SLURM and Rocky 9.2

JHPCE unavailable April 21st – May 1st for ARCH Cooling maintenance

Dear JHPCE community,

There is going to be a scheduled downtime for the JHPCE cluster from Friday, April 21 starting at 5:00 PM and going until Monday, May 1st at 5:00 PM.  The ARCH/Bayview Colocation facility will be down for preventative maintenance on the HVAC system, and the JHPCE cluster will need to be shut down in order to accommodate this work.

Posted in JHPCE Announcements | Comments Off on JHPCE unavailable April 21st – May 1st for ARCH Cooling maintenance

Globus Server update on JHPCE, Saturday Dec. 10th at 9:00 PM

Dear JHPCE community,

We will be upgrading the Globus Server software on the JHPCE cluster this Saturday evening, December 10th, starting at 9:00 PM.  We expect that the upgrade will take 1 hour to complete.  During the upgrade, the Globus endpoint will be unavailable.  This upgrade is being done primarily to install a new Globus certificate on the server.

If you use the Globus Personal Connect application on your local laptop or desktop, you should have received notice from Globus that you will need to update your application.  Further information on updating your Globus Personal Connect can be found at  https://docs.globus.org/ca-update-2022/#globus_connect_personal  

Please email us at bitsupport if you have any questions.

Posted in JHPCE Announcements | Comments Off on Globus Server update on JHPCE, Saturday Dec. 10th at 9:00 PM

Please be judicious in your use of the email option in sbatch

One commonly used feature on the JHPCE cluster is the “send me an email when my job completes” option in SLURM. This option can be enabled by adding the “–mail-type=FAIL,END –mail-user=john@jhu.edu” options to your qsub command.

$ sbatch --mail-type=FAIL,END --mail-user=john@jhu.edu script2

This option is very convenient for longer running jobs on the cluster. It allows one to submit their jobs, and let them run on the cluster without having to continually login to the cluster to check the status of your job.

This option though can cause problems when running thousands of jobs or tasks on the cluster. Most email servers employ heuristics to detect spam email, and the sudden appearance of thousands of email messages over a short period of time can trigger this. This may result in one’s email or domain account getting locked, or a nasty note from the email support team. When this happens, it may take time to unravel and restore access to one’s email account.

We do go over this during the JHPCE orientation, so this is a gentle reminder to be use caution when using the email option in qsub. The slides for the JHPCE orientation can be downloaded from the Orientation page https://jhpce.jhu.edu/register/orientation/ Please email us at bitsupport if you have any questions.

Posted in JHPCE Announcements | Comments Off on Please be judicious in your use of the email option in sbatch

Heat issue at MARCC colocation – JHPCE cluster unavailable.

Dear JHPCE community,

Update: 2022-06-22 13:00 – The cooling issue has been resolved, and the cluster is once again available.

There is currently an issue at the MARCC colocation facility with the cooling system.  We had a number of compute nodes on the JHPCE cluster that overheated and have crashed, as well as a couple of storage arrays.  At this point, as a precautionary measure, we are planning on shutting down as much as we can until the colling issue is resolved.  Please consider the JHPCE cluster unavailable at this point.  We will update you as the issue progresses.

Posted in Cluster Status Updates | Comments Off on Heat issue at MARCC colocation – JHPCE cluster unavailable.

JHPCE Cluster to be unavailable from April 11th – April 15th for scheduled preventative maintenance on the HVAC equipment

The JHPCE cluster will be unavailable from April 11th – April 15th in order to accommodate scheduled preventative maintenance to be done on the HVAC system at the MARCC datacenter. We are planning to take the JHPCE cluster down beginning at 6:00 AM on Monday April 11th.  We are expecting that the cluster will be available by Friday, April 15th at 5:00 PM.

If temperatures allow, we may be able to bring some storage resources and the transfer node online, but at this point, please plan for all cluster resources to be unavailable for the duration of the maintenance.  Please let us know if you have any questions about this upcoming downtime.

Thank you for your understanding, and we apologize for any inconvenience.

Posted in JHPCE Announcements | Comments Off on JHPCE Cluster to be unavailable from April 11th – April 15th for scheduled preventative maintenance on the HVAC equipment

2021-08-14 JHPCE cluster unavailable due to cooling issues at datacenter

The JHPCE cluster is currently down due to cooling issues at the Bayview/MARCC datacenter. We will keep you advised as the status changes.

Posted in Cluster Status Updates | Comments Off on 2021-08-14 JHPCE cluster unavailable due to cooling issues at datacenter

Setting per-user job limit on JHPCE cluster

As of June 17th, 2021, we are imposing a limit of 10,000 submitted jobs per user. Previously, there had been no limit, and this has caused issues in the past where the cluster scheduler was overloaded when there were 100s of thousands of jobs in the queue. Going forward, if you try to submit more than 10,000 jobs, you will receive the following error:

Unable to run job: job rejected: only 10000 jobs are allowed per user (current job count: 10000)

You will need to either submit your jobs in smaller batches, or, preferably, use an array job to submit your jobs. Arrays jobs can be used to submit multiple instances of the same script where different arguments or data is used for each instance. Please see https://jhpce.jhu.edu/question/how-do-i-run-array-jobs-on-the-jhpce-cluster for more details and examples.

Posted in JHPCE Announcements | Comments Off on Setting per-user job limit on JHPCE cluster

2021-05-24 – JHPCE cluster unavailable due to cooling issue

The JHPCE cluster is currently unavailable due to cooling issues at the Bayview/MARCC datacenter where the JHPCE cluster is located.  We apologize for any inconvenience, and we will keep you up to date as we are made aware of any changes in the situation.

Posted in Cluster Status Updates | Comments Off on 2021-05-24 – JHPCE cluster unavailable due to cooling issue

Rebooting one DCL storage system Friday, April 2, 8:00AM – 9:00AM

Dear JHPCE community,

We will be rebooting one of the DCL01 storage servers this Friday morning in order to resolve an issue with one of the filesystems on that server. The following directories will be unavailable this Friday, April 2, between 8:00AM and 9:00 AM.

/dcl01/ajaffe
/dcl01/arking
/dcl01/ccolantu
/dcl01/chaklab
/dcl01/dawson
/dcl01/guallar
/dcl01/hpm
/dcl01/klein
/dcl01/ladd
/dcl01/mathias1
/dcl01/moghekar
/dcl01/pienta
/dcl01/shukti
/dcl01/song1
/dcl01/tin
/dcl01/wang

Please try to limit your access to these directories during this maintenance window. Typically, jobs that that are accessing these directories will simply pause while the server is being rebooted, and then continue once the server comes online again, however it is best to try to minimize the activity against the affected directories.

Thank you for your understanding. Please let us know if you have any questions.

Posted in Cluster Status Updates | Comments Off on Rebooting one DCL storage system Friday, April 2, 8:00AM – 9:00AM