Local development notes#
Testing on local hardware (Alienware R15 X2 laptop).
Software Information#
> nvidia-container-cli info
NVRM version: 525.125.06
CUDA version: 12.0
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3080 Ti Laptop GPU
Brand: GeForce
GPU UUID: GPU-bd3fa8f3-45af-3192-f030-7b9c4825eb29
Bus Location: 00000000:01:00.0
Architecture: 8.6
> nvidia-container-cli --version
cli-version: 1.13.5
lib-version: 1.13.5
build date: 2023-07-18T11:38+00:00
build revision: 66607bd046341f7aad7de80a9f022f122d1f2fce
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Testing 6 Sept 2023#
Downloaded docker container using quoted example:
docker pull ghcr.io/huggingface/text-generation-inference:1.0.3
Initially deployment script run_llama-7b.sh
.
Need to provide Huggingface Hub token for authentication.
Initial model download takes time (shared model directory essential)
No parameters crashed due to insufficient VRAM.
Second attempt with following parameters deployed successfully:
Startup time was less than 30 seconds.
[...]
--quantize bitsandbytes \
--max-batch-prefill-tokens=1024 \
--max-total-tokens=2048 \
[...]
Test query:
$ curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
Response:
{"generated_text":"\nWhat is Deep Learning? Deep learning is a subset of machine learning that is based on artificial neural"}⏎
Logs:
2023-09-06T13:47:01.731233Z INFO HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=curl/7.88.1 otel.kind=server trace_id=d6e2778955d17bf523e3dc568754ef3e}:generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: 20, return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="986.532642ms" validation_time="681.374µs" queue_time="188.125µs" inference_time="985.66325ms" time_per_token="49.283162ms" seed="None"}: text_generation_router::server: router/src/server.rs:289: Success
Other notes:
VRAM usage is very high: 15814MiB / 16384MiB (96.5%)
Inference time 49ms / token