

This project is still at an early stage of development. Breaking changes might occur.

Local workflow#

This workflow assumes that you have experience running commands in the command line. If you are not sure what a command line is, you may prefer the Jupyterhub workflow for now.

Open terminal#

Open a terminal on Della by either:

  1. Open a terminal in the browser using MyDella

  2. Open a terminal locally and ssh to Della (i.e. ssh

NB: Research Computing provides the following help article for connecting to the clusters via SSH.

Install library locally#

The following workflow uses conda environment. You are welcome to use virtualenv instead.

  1. Load the anaconda module

module purge
module load anaconda3/2023.3
  1. Create a new conda environment:

conda create -n lip python

I am naming it lip as short for llm-inference-platform. This is arbitrary.

  1. Activate the new conda environment:

conda activate lip
  1. Install the llm-inference-platform package

pip install git+

(Optional) Download a model#

During development you can use a model pre-downloaded from my (@muhark) directory. In the future we aim to have a shared directory.

If you wish to use your own model, you can use the model download functionality. The following command downloads bigscience/bloom-560m.

llm-inference-platform model-dl \
    --repo-id bigscience/bloom-560m \
    --revision main \

Run deploy command#

The following command launches a SLURM job that runs a Singularity container serving inference for the requested model. The example below has pre-populated values for our initial test.

(If you downloaded a model in the previous step, you can change the parameters accordingly).

llm-inference-platform deploy --name bigscience/bloom-560m
                              --revision main
                              --cache-dir /scratch/gpfs/mj2976/shared/models

If all goes well, you will get the following message in the output

[21:28:17 llmip] INFO: Model deployed successfully. Here are your options to connect to the model:
[21:28:17 llmip] INFO: 1. If you are working on the server running this scripts, no steps are necessary.
   Simply connect to localhost:38761.
[21:28:17 llmip] INFO: 2. If you are working somewhere else run the following command:
  ssh -N -f -L localhost:8000:della-l08g6:8000
   Afterwards, connect to localhost:8000

Copy the command under INFO: 2.:

  ssh -N -f -L localhost:8000:della-l08g6:8000

This command will forward the API serving the LLM from the compute node to your local machine.


You may need to do some authentication steps here, depending on how you connect to the HPC.

Use endpoint!#

You can now use the endpoint via a Web API! You can test it with the following command:

curl localhost:<PORT>/generate \
    -X POST \
    -d '{"inputs":"Hello World!","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

where <PORT> is either 8000, or whatever port you assigned in place of the 8000 in localhost:8000 in the step above.

Disconnecting/Cleaning Up#

You can close the application by pressing <Ctrl-C> in the terminal where you ran the deploy command. Please do not spam <Ctrl-C>.