Quickstart to using the CRC
Logging in
When you first connect, you are logged onto a shared login node.
- DO NOT DO ANY HEAVY COMPUTATION HERE. This is a shared resource and should only be used for light tasks like editing files, submitting jobs, etc.
- From the CRC docs: “Small processing which is not disruptive/resource intensive can be done on the front ends. This is normally pre-processing or post-processing after completion of UGE jobs.”
You can only access the CRC servers (and many other resources/docs) if you are connected to the ND network.
- If you are off-campus, you need to connect to the campus VPN to access these resources.
- For more information on connecting to the campus VPN, see here: https://inside.nd.edu/task/all/campus-vpn---students
Transferring files
There are several ways to transfer files to and from your local machine and the CRC. For our purposes, we will primarily use scp and GitHub.
- It’s easiest to clone your GitHub repository on the CRC and then pull/push changes as needed.
- For all other files that aren’t tracked by Git (e.g., big data/results files), you can use
scpto transfer files.scp(“secure copy”) is a command-line tool for copying files securely between computers.
Running jobs on the CRC
The CRC uses a job scheduler to manage resources and job queues. Think of a job as a script that you want to run on the CRC. What are the main steps to run a job on the CRC?
- Write a script (.R/.py file) that you want to run.
- Make sure the necessary dependencies (packages, conda, quarto, etc.) are installed.
- Write a job submission script.
- Submit the job to the CRC.
- Monitor the job’s progress.
Walking through each of these steps in turn next…
Write a script
You can write your script in any text editor on the CRC. Alternatively (and preferably), you can do all of your code development locally, push to your GitHub repository, and pull the changes on the CRC.
For this demonstration, we will be running a simple script that runs leave-one-out cross-validation using a random forest, applied to the TCGA breast cancer dataset. See either the scripts/parallel_example.R or scripts/parallel_example.py files in the course GitHub repository for the full script.
Install dependencies
Installing dependencies on the CRC can be done in a similar way to how you would install them on your local machine (e.g., using install.packages(...) in R or pip/conda install ... in Python). However, if we have setup a reproducible environment tool like renv in R or conda in Python, this makes our life much easier.
Since installing these dependencies can be somewhat time-consuming, let’s run this inside an interactive node (instead of the login node) to be considerate of others. To launch an interactive job, you can use the qlogin command:
qloginWrite a job submission script
Writing the job submission script is probably the “newest” part of this process. This script essentially tells the CRC what resources you need and how to run the job.
For our purposes, we will use the generic job submission scripts provided in the course GitHub repository. More specifically, we will start by using the submit_r_job.sh and submit_python_job.sh scripts in the parallelization/job_scripts directory to run R and python scripts, respectively.
Typically, these job submission scripts will have the following structure:
- Specify the resources you need (e.g., number of cores).
- Load the necessary modules.
- Run the script you want to run.
Below is a generic job submission script (saved as job_scripts/submit_r_job.sh) for running an R script on the CRC.
#!/bin/bash
#$ -M netid@nd.edu # Email address for job notification
#$ -m abe # Send mail when job begins, ends and aborts
#$ -pe smp 24 # Specify parallel environment and legal core size
#$ -q long # Specify queue
#$ -N job_name # Specify job name
module load R
cd ../ # run the R script from the project root directory to activate renv
Rscript ${1}.RBelow is a generic job submission script (saved as job_scripts/submit_python_job.sh) for running a Python script on the CRC.
#!/bin/bash
#$ -M netid@nd.edu # Email address for job notification
#$ -m abe # Send mail when job begins, ends and aborts
#$ -pe smp 24 # Specify parallel environment and legal core size
#$ -q long # Specify queue
#$ -N job_name # Specify job name
module load python
conda activate dsip_parallel
cd ../
python ${1}.pyNotes:
- Should change
-Margument to your email address. - Should change
-Nargument to an informative name for your job. ${1}is the first argument passed to the job submission script and serves as a placeholder for the name of the script you want to run.- This is so that you don’t have to write a new
.shfile for every script you want to run.
- This is so that you don’t have to write a new
- Should change
dsip_parallelto the name of your desired conda environment. -pe smp XXspecifies the number of cores you want to use for your job. You should changeXXto the number of cores you want to use.
Helpful CRC documentation:
Submit the job to the CRC
Monitor the job’s progress
To monitor the job submission status, you can use the qstat command:
qstat -u $USERThis will show all of the jobs in the queue or running that are submitted by you.
After the job has finished running, you can check the output file to see the results. By default, the output file will be named <job_name>.o<job_id>, where <job_name> is the name of the job and <job_id> is the job’s ID.
If you want to see more detailed information about the behavior of your job processes while they are running, you can use use the Xymon GUI Tool. All you need to do is click on the link to the CRC machine that your job is running on. To figure out which machine your job is running on, you can use the qstat command. Look for the column that looks something like long@d32cepyc204.crc.nd.edu
Summary
To summarize, the main steps to run a job on the CRC are:
Preliminary set up
- Write a script that you want to run.
- See
scripts/parallel_example.R(or.py) for examples
- See
- Install the necessary dependencies.
- Write a job submission script.
- See
job_scripts/submit_r_job.sh(or_python_) for examples
- See
Running the job
Submit the job to the queue:
qsub <job_submission_script.sh>Monitor the job’s progress:
qstat -u $USERor check on the Xymon GUI Tool.
Most common mistakes
- Be sure that you are in the desired directory when you submit your job.
- If you are not in the correct directory, the job will not be able to find the script you want to run, or the script may not be able to access the necessary data files.
- Don’t forget to save the job’s outputs/results to a file. If you don’t, these results will disappear once the job is finished running.
- Make sure you are requesting the correct number of cores.
- If you request more cores than what your script uses, you are wasting resources.
- If you request fewer cores than what your script uses, your job might run into weird resource errors.
- Double check that all of your dependencies are installed.
- Make sure you are submitting the job from a login node.
Before submitting a large job, it is a good idea to test your job submission script with a small test job to make sure everything is working as expected (e.g., use 2 cores instead of 24 cores).
Other helpful commands and resources
Managing modules:
- To load a module, you can use the
module loadcommand followed by the module name. - To unload a module, you can use the
module unloadcommand followed by the module name. - To see a list of all available modules, you can use the
module availcommand. - To see which modules you have loaded, you can use the
module listcommand. - More information about modules on the CRC can be found here.
- To load a module, you can use the
If you need to delete a job from the queue, you can use the
qdelcommand followed by the job ID:qdel <job_id>The job ID is the first column of the
qstatoutput.To check your current disk usage, you can use the
quotacommand:quotaYou are initially allotted 100GB of storage space on the CRC (although it is possible to ask for more (with appropriate justification) and/or request scratch space).
Helpful bash commands:
ls: list files in the current directoryls path/to/directory: list files in a specific directoryls -al: list all files in the current directory (including hidden files)ls -al path/to/director: list all files in a specific directory (including hidden files)cd path/to/directory: change directorypwd: print current working directoryrm path/to/file: remove a filerm -r path/to/directory: remove a directory and all of its contentsrm -rf path/to/directory: force remove a directory and all of its contents without promptingmv path/to/file path/to/new_location: move a filecp path/to/file path/to/new_location: copy a fileman command: get help on a specific command
vimis a popular text editor that can be used directly in command line.- To open a file in
vim, typevim path/to/file. - To edit a file in
vim, typeito enter insert mode. - To exit insert mode, press
esc. - To save and exit, type
:wq(w = write, q = quit) - To exit without saving, type
:q!(q = quit, ! = force)
- To open a file in
Interactive jobs
We have already seen that you can spurn an interactive job using the qlogin command. This qlogin command will give you access to an interactive compute node. You can also request a specific number of cores using: qlogin -pe smp <num> flag (e.g., to request 2 cores, you would type qlogin -pe smp 2). Another way to request an interactive job is to use the qrsh command, e.g.,
qrsh -q long -pe smp 1However, rather than working in terminal, it can often be more convenient to work in an interactive R or Python session using VS Code, Positron, RStudio, or Jupyter Labs. We will walkthrough how to use each of these tools on the CRC next.
Open OnDemand is a web-based portal that provides access to the CRC’s high performance computing resources and a user-friendly interface for managing jobs, files, and interactive applications.
To access Open OnDemand, go to https://ondemand.crc.nd.edu/ and log in with your ND NetID and password. Using Open OnDemand, you can launch interactive applications using VS Code, RStudio, JupyterLab, and more. You can also manage your files and submit batch jobs through the Open OnDemand interface.
More information about Open OnDemand can be found in the CRC documentation: https://docs.crc.nd.edu/resources/ood.html.
To connect to a remote server (e.g., CRC) using VS Code or Positron:
Open VS Code or Positron.
On the left, go to “Extensions” (the tab with the square boxes), and install the
Remote - SSHextension. This might be installed by default in Positron.If you are connecting to the CRC for the first time using VS Code or Positron, you will need to add a new SSH host. This step only needs to be done once. To add a new SSH host:
Click on the
><icon on the bottom left of the screen.Click on the
Connect to Host...option.Click on the
Add New SSH Host...(orAdd host to SSH config file...) option.If you are using VS Code, type in
ssh <NetID>@crcfe01.crc.nd.eduorssh <NetID>@crcfe02.crc.nd.eduin theHostfield and hit enter. If you are using Positron, Positron will open up a file where you can add the following lines to add the CRC (1 and/or 2) as a new SSH host:Host crcfe01.crc.nd.edu HostName crcfe01.crc.nd.edu User <NetID> Host crcfe02.crc.nd.edu HostName crcfe02.crc.nd.edu User <NetID>Make sure to replace
<NetID>with your actual ND NetID and save the file (Cmd+Son Mac orCtrl+Son Windows/Linux).In VS Code, you may now be prompted to select the configuration file to use. If so, select the first option (e.g.,
/Users/<NetID>/.ssh/config) that appears in the dropdown menu. This file will be used to store the SSH configuration.
We have just added the CRC as a new SSH host.
To connect to the CRC, we can now:
- Click on the
><icon on the bottom left of the screen. - Click on the
Connect to Host...option. - Click on the
crcfe01.crc.nd.eduorcrcfe02.crc.nd.eduoption. - Enter your ND NetID password when prompted.
- Click on the
Everything you do in VS code will now be done on the CRC!
CRC documentation: https://docs.crc.nd.edu/general_pages/r/rstudio.html#rstudio
Download
fastXOpen
fastX. Click on the+button (top left) to add a new connection using the following settings:- Host:
crcfe01.crc.nd.eduorcrcfe02.crc.nd.edu - Username: your ND NetID
- Port: Leave blank
- Name: crc1 or crc2
- Host:
Double click on the connection you just created to connect to the CRC.
Enter your ND NetID password when prompted.
Click on the
+button (top left) and then “Ok” to start a new terminal session. You should now be in a terminal session on the CRC.In the terminal, load in the RStudio module by typing the following command:
module load RstudioDo NOT load the module for
R. The module forRis loaded whenRstudiois loaded.Open Rstudio by typing the following command in your terminal:
rstudio
RStudio should now be open and ready to use!
Login to the CRC with the
-Xflag so that the RStudio GUI can load:ssh -X <NetID>@crcfe01.crc.nd.eduThe
-Xflag enables X11 forwarding, allowing you to run graphical applications on a remote server and display them on your local machine.In the terminal, load in the RStudio module by typing the following command:
module load RstudioDo NOT load the module for
R. The module forRis loaded whenRstudiois loaded.Open Rstudio by typing the following command in your terminal:
rstudio
RStudio should now be open and ready to use!
This tutorial is adapted from the CRC documentation:
- https://docs.crc.nd.edu/popular_modules/conda.html#conda
- https://docs.crc.nd.edu/general_pages/j/jupyter.html
To set up JupyterLab on the CRC, you will need to complete the following steps:
If you have used conda on the CRC before, skip to step 2. If you have never used conda on the CRC, you first need to perform some initial setup to set up conda (only need to do once). In your CRC terminal, run the following commands:
module load conda conda init source ~/.bashrc module unload condaTo verify that conda was successfully installed, type
conda infoin your terminal. It should print out information about the conda installation.If you already have an existing conda environment with
jupyterlabandipykernelinstalled, you can skip to step 3. Otherwise, follow these steps to create a new conda environment that is compatible with JupyterLab:module load conda conda create --name YOUR_ENV_NAME conda activate YOUR_ENV_NAME conda install jupyterlab conda install ipykernelMake your conda environment available in JupyterLab by running the following command. You only need to do this one time for each conda environment you want to use in JupyterLab:
python -m ipykernel install --user --name=YOUR_ENV_NAME --display-name="YOUR_ENV_NAME"(Optional) If you would like to run R in JupyterLab, you can set up an R kernel by following these steps. You only need to do this once:
- Load the R module by typing the following in your terminal:
module load R - Open R by typing the following in your terminal:
R - Install the
IRkernelpackage by typing the following in R:install.packages("IRkernel") - Install the R kernel by typing the following in R:
IRkernel::installspec() - Exit R by typing
q()
- Load the R module by typing the following in your terminal:
Open a new terminal and
sshinto the CRC with the-Yflag:ssh -Y <NetID>@crcfe01.crc.nd.eduThe
-Yflag in ssh enables trusted X11 forwarding. It works similarly to-Xbut bypasses some security restrictions.Access a compute node by running the following command:
qrsh -q long -pe smp 1Inside the compute node, launch a Jupyter notebook:
jupyter lab --no-browser --ip='0.0.0.0'You will see something a lot of text with part of it looking like:
To access the server, open this file in a browser: file:///afs/crc.nd.edu/user/t/ttang4/.local/share/jupyter/runtime/jpserver-2636692-open.html Or copy and paste one of these URLs: http://d32cepyc193.crc.nd.edu:8888/lab?token=XXXXX http://127.0.0.1:8888/lab?token=XXXXXNote the server name and its port number (e.g.,
d32cepyc193.crc.nd.edu:8888). Also note the token numberXXXXX(i.e., everything that comes aftertoken=)Access the Jupyter notebook using SSH tunneling. On your local machine and in a separate terminal window, run the following command:
ssh <NetID>@crcfe01.crc.nd.edu -L 8888:d32cepyc193.crc.nd.edu:8888 -Nwhere
8888:d32cepyc193.crc.nd.edu:8888is replaced by the<port>:<server>:<port>identifiers from step 8. Then enter your password. If there are no errors, the command line will hang. This is normal.Open a web browser on your local machine and navigate to http://localhost:8888. You should see the JupyterLab interface. Enter the token number from step 8 when prompted.
JupyterLab should now be open and ready to use!
Job Arrays
So far, we have learned how to parallelize tasks (e.g., for loops) within your R/Python script using the future package in R or joblib in Python. In addition to parallelizing tasks within a script, you can also parallelize tasks by submitting multiple jobs to the CRC using job arrays. At a high level, a job array is a collection of jobs that are submitted to the CRC as a single batch.
As an example, we previously submitted one job to run leave-one-out cross-validation for the random forest model. Now suppose that we also wanted to run a second job to run leave-one-out cross-validation for a different model (e.g., k nearest neighbors). While we could submit two separate jobs to the CRC, it is also possible to submit a single job array with two jobs: one job running leave-one-out CV for the random forest model and the other job running leave-one-out CV for the k nearest neighbors model.
To demonstrate how to submit a simple job array, let’s implement the aforementioned example, where we want to submit a job array with two jobs:
- Job 1: Leave-one-out cross-validation for the random forest model
- Job 2: Leave-one-out cross-validation for the k nearest neighbors model
and each job will use C > 1 cores to parallelize the for loop computation.
- The main R/Python scripts that we will be using are
scripts/parallel_example_with_args.R(or.py).- Note that these scripts are slightly modified versions of the original
parallel_example.R(or.py) scripts. The main difference is that we now allow these scripts to accept a command line argument--array_id(or--model), which takes in an integer (or character string), indicating whether to use an"rf"(ifarray_id=1) or"knn"(ifarray_id=2) model.
- Note that these scripts are slightly modified versions of the original
- The main job submission scripts that we will be using are
job_scripts/submit_r_job_array.sh(orsubmit_python_job_array.sh).- The only addition to these scripts is the
-tflag, which specifies the range of the job array. Setting-t 1-2will submit two jobs to the CRC — one job with${SGE_TASK_ID}=1and a second job with${SGE_TASK_ID}=2. (If we wanted to submit 10 jobs (indexed from 1 to 10), we would set-t 1-10.) - Note: the job array ID is stored in the environment variable
SGE_TASK_ID.
- The only addition to these scripts is the
To finally submit the job array, we can run the following command in the terminal:
# for R users:
qsub -N parallel_array_example_r -pe smp 2 submit_r_job_array.sh scripts/parallel_example_with_args
# for Python users:
qsub -N parallel_array_example_py -pe smp 2 submit_python_job_array.sh scripts/parallel_example_with_argsor
# for R users:
qsub -N parallel_arrayname_example_r -pe smp 2 submit_r_job_arrayname.sh scripts/parallel_example_with_args
# for Python users:
qsub -N parallel_arrayname_example_py -pe smp 2 submit_python_job_arrayname.sh scripts/parallel_example_with_argsNote: the -pe smp 2 flag is not necessary and is only added to overwrite the original job submission script’s request for 24 cores since this is purely for demonstration.
If your tasks do not map directly to a simple integer range, you can also use the -t flag to specify a list of tasks to run. For example, -t 1,3,5,7,9 would run the job with SGE_TASK_ID=1, SGE_TASK_ID=3, SGE_TASK_ID=5, SGE_TASK_ID=7, and SGE_TASK_ID=9.
Why use job arrays?
- The main advantage of using job arrays is that it allows you to submit multiple jobs at once, essentially parallelizing across possibly many different machines.
- Due to how the job scheduler works, your jobs will receive higher priority if you submit it as a single job array than multiple separate jobs.
- From the CRC docs: “If you find that you need to frequently submit 50 or more different jobs, we request that you implement those tasks within a job array. Grid engine is able to handle arrays much more efficiently than tens or hundreds of individual scripts from a single user. Fewer individual tasks reduces load on the job scheduler and improves overall performance.”
Combining job arrays with parallelization within a script can be a very powerful way to speed up computations. This essentially gives you the ability to parallelize your code in two different “axes” or to do some type of nested parallelization strategy without having to modify your code too much.
CRC documentation on job arrays: https://docs.crc.nd.edu/new_user/quick_start.html#job-arrays
Job Dependencies
Sometimes, you may have a set of jobs that depend on each other. For example, you may have a job array that fits various models, and you want to run a final job that aggregates the results from all of the models. In this case, you would want to make sure that the final job only runs after all of the model-fitting jobs have completed.
To specify these job dependencies, you can use the -hold_jid flag, followed by the job ID to wait for, when submitting a job to the CRC. The driver_r.sh and driver_python.sh scripts show how to do this. To run the driver script, you can type in terminal: sh driver_r.sh or sh driver_python.sh.