gpubox.ai

Models

Real names. Real GPUs. No surprises.

GPUBox tells you exactly which model serves your request. No opaque endpoints, no silent swaps, no "mystery model" pricing. You name the model in your code; we serve that model.

Chat / LLM

liveApache 2.0AWQ-int4

qwen2.5-32b-instruct

Qwen2.5-32B-Instruct from Alibaba — strong general-purpose LLM at the 32B parameter class. Reliable function calling, decent reasoning, fast on consumer-grade hardware via 4-bit quantisation.

Context

8,192 tokens

Hardware

RTX 5090

Endpoint

/v1/chat/completions

Capabilities

  • Chat completions (OpenAI-compatible)
  • Streaming SSE
  • Tool / function calling
  • JSON mode (response_format)
  • Multilingual: English, Chinese, Spanish, French, German, etc.

Reasoning LLM

liveApache 2.0fp16

qwq-32b

Qwen QwQ-32B-Preview — a reasoning model that thinks out loud before answering. Replies INCLUDE its working-out as part of the content (no separate reasoning channel in this Preview release), so expect verbose, transparent answers. Pick this when you want to audit how the model reached its conclusion; pick Qwen2.5-32B-Instruct when you want a tight final answer only.

Context

32,768 tokens

Hardware

RTX PRO 6000 (Blackwell, 96GB)

Endpoint

/v1/chat/completions

Capabilities

  • Chat completions (OpenAI-compatible)
  • Streaming SSE
  • Step-by-step reasoning visible inline in the response
  • Strong on maths, code review, multi-step analysis
  • Multilingual: English, Chinese

Speech-to-text

liveMITfp16

whisper-large-v3-turbo

OpenAI's Whisper large-v3-turbo via faster-whisper. Real-time-factor ~0.3 on the 5090: a 60-second clip transcribes in roughly 18 seconds.

Context

30 second windows

Hardware

RTX 5090

Endpoint

/v1/audio/transcriptions

Capabilities

  • OpenAI-compatible /v1/audio/transcriptions
  • Multipart upload (file + model + optional language/prompt/temperature)
  • 100+ languages with auto-detection
  • verbose_json response with segment-level timestamps and confidence
  • Voice-activity detection (VAD) filter

Embeddings

liveMITfp16

bge-m3

BAAI BGE-M3 — strong multilingual embeddings, 8k context. Returns L2-normalised 1024-d dense vectors via the OpenAI embeddings shape.

Context

8,192 tokens

Hardware

Ryzen 9 9950X (CPU)

Endpoint

/v1/embeddings

Capabilities

  • OpenAI-compatible /v1/embeddings
  • Multilingual (100+ languages)
  • 1,024-dimensional dense vectors
  • L2-normalised by default

Want a model we don't serve yet?

We add open-weight models on customer demand. Tell us what you need.

[email protected]