Models
Real names. Real GPUs. No surprises.
GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.
Chat / LLM
liveApache 2.0AWQ-int4qwen2.5-32b-instruct
Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.
Context
8,192 tokens
Hardware
RTX 5090
Endpoint
/v1/chat/completions
Capabilities
- Chat completions (OpenAI-compatible)
- Streaming SSE
- Tool / function calling
- JSON mode (response_format)
- Multilingual: English, Chinese, Spanish, French, German, etc.
Speech-to-text
liveMITfp16whisper-large-v3-turbo
OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.
Context
30 second windows
Hardware
RTX 5090
Endpoint
/v1/audio/transcriptions
Capabilities
- OpenAI-compatible /v1/audio/transcriptions
- Multipart upload (file + model + optional language/prompt/temperature)
- 100+ languages with auto-detection
- verbose_json response with segment-level timestamps and confidence
- Voice-activity detection (VAD) filter
Embeddings
coming soonMITfp16bge-m3
BAAI BGE-M3 — strong multilingual embeddings, longer context than most retrievers. Coming soon.
Context
8,192 tokens
Hardware
RTX 5090
Endpoint
/v1/embeddings
Capabilities
- OpenAI-compatible /v1/embeddings
- Multilingual (100+ languages)
- Dense + sparse + multi-vector retrieval
- 1,024-dimensional output
Want a model we don't serve yet?
We add open-weight models on customer demand. Tell us what you need.
[email protected]