Skip to main content

Serving LLMs with Wisp

It's very simple to serve LLMs on the cloud with Wisp. If you haven't done so yet, check out Quickstart to understand how Wisp works. Otherwise, keep reading here!

In practice, you can host your models with Wisp using any technology that exposes a port. We're using vLLM as it greatly simplifies the process, and supportd Docker containers.

The LLM

We'll use vLLM to host a Mistral-7B model with Docker. To use other models, see the Documentation for vLLM.

Configuration

If you haven't done yet, run wisp init to create the configuration file. Open wisp-config.yml and enter the following information:


setup:
project: local

run:
docker run --runtime nvidia --gpus all \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1

resources:
accelerators:
  compute_capability: 7.0+
vram: 6+
memory: 4+

io:
# Expose port 8000 from the Docker container on the server
ports: 8000
# Require login with a Wisp account for the endpoint
secure_endpoint: true

Launch the Server

We're ready to launch the server! In your terminal, run:

wisp run

Wisp will pull the image and run it using the command supplied. The command will output an external IP you can access through your browser.

You can see your job, stats and cost overview for your job in the dashboard under Jobs.