An OpenAI-compatible edge endpoint that fronts your self-hosted inference backend (Ollama, vLLM, llama.cpp, TGI). Deployed close to your users, talks to your GPU box over a single authenticated upstream.
Self-hosting an LLM means your GPU is in one place and your users are in many. Cold-region latency and TLS handshake cost dominate small-token requests. Edge LLM Gateway terminates TLS at the edge, multiplexes connections to your backend, and exposes a familiar OpenAI-style API so existing SDKs just work.
curl https://your-deployment.vercel.app/api/v1/chat/completions?auth_token=YOUR_TOKEN \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
INFERENCE_BACKEND_URL — URL of your inference backend (e.g. https://ollama.example.com)BACKEND_PATH — path on the backend that receives requests (default /v1/completions)GATEWAY_API_KEY — shared secret passed as ?auth_token=/v1/chat/completions surface