README#

Repo for developing hosted LLM inference solution on Princeton’s Della cluster.

Developers:

  • @muhark

  • @klieret

Solution#

In broad terms

  • LMs are hosted using HuggingFace Text Generation Inference container

  • Users request configuration via web GUI (Gradio?), generates SLURM request

  • TGI service forwarded back to GUI/API at user-facing web server

Development Practices#

  • Where possible, task=issue=branch.

  • No pushing directly to main.

Installation

pip3 install --editable '.[dev,test,docs]'

Please also install the pre-commit hook:

pipx run pre-commit install

Alternatively, you can run pre-commit manually with nox -s lint.

In addition, nox provides the following:

  • To run addition python lint checks, run nox -s pylint

  • To build the documentation, run nox -s docs (the resulting documentation will be rendered at docs/_build/html/index.html)

Components#

Inference Container#

We are using the text-generation-inference (TGI) container from HuggingFace (HF).

This is their optimized production-grade solution for serving LLM inference in their own products.

It primarily consists of two components:

  • Web server (written in Rust): serves endpoints and manages request batching

  • LM engine (Python): runs models compatible with the HF ecosystem, with various optimizations baked in.

HF provides this as a Docker container, which we are using via Singularity on compute nodes.

Web GUI#

WIP: solution still under discussion.

Web GUI that basically exposes all of the TGI options via dropdown menus with explanations of what they do/the trade-off they provide.

Model Selection:

Users will be able to select models from either a pre-designated list (stored in a shared read-only path) or point the container to their own custom checkpoints.

Resource Selection:

For known architectures, optimal resource requests can be calculated with testing and then used to populate the form.