Edge LLM Gateway v1.4.2

An OpenAI-compatible edge endpoint that fronts your self-hosted inference backend (Ollama, vLLM, llama.cpp, TGI). Deployed close to your users, talks to your GPU box over a single authenticated upstream.

Quick start Models Health Docs GitHub

Why

Self-hosting an LLM means your GPU is in one place and your users are in many. Cold-region latency and TLS handshake cost dominate small-token requests. Edge LLM Gateway terminates TLS at the edge, multiplexes connections to your backend, and exposes a familiar OpenAI-style API so existing SDKs just work.

Getting started

curl https://your-deployment.vercel.app/api/v1/chat/completions?auth_token=YOUR_TOKEN \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Configuration

INFERENCE_BACKEND_URL — URL of your inference backend (e.g. https://ollama.example.com)
BACKEND_PATH — path on the backend that receives requests (default /v1/completions)
GATEWAY_API_KEY — shared secret passed as ?auth_token=

Features

OpenAI-compatible /v1/chat/completions surface
Streaming responses (SSE) passthrough
Token-based authentication
Stateless — deploy as many regions as you like
Zero-dep edge runtime, cold start < 50ms