Benchmark report · June 2025
We measured throughput, TTFT, and per-token cost for Qwen3-30B-A3B under concurrent load on our stack and compared the results against published numbers from the two commercial providers that currently serve this model. Our hardware: a single RTX A6000 48 GB on Vast.ai spot at $0.376/hr. Their hardware: H-class datacenter GPUs.
Our results
Model: Qwen3-30B-A3B-AWQ (INT4 weights, INT4 KV cache) · Hardware: 1× RTX A6000 48 GB · Stack: vLLM (our fork) + AWQ-Marlin kernel · Cost: $0.113 / 1M output tokens · In production, the fleet scales horizontally — users never queue behind a saturated GPU.
| Concurrency | Throughput tok/s aggregate | tok/s/user aggregate ÷ N | TTFT p50 ms | TPOT p50 ms / token |
|---|---|---|---|---|
| 1 | 105 | 105 | 70 | 9.1 |
| 8 | 440 | 55 | 142 | 16.9 |
| 16 | 643 | 40 | 155 | 22.6 |
| 32 | 923 | 29 | 273 | 29.6 |
Prompt: 512 tokens, completion: 256 tokens. Measured with OpenAI-compatible streaming client. TPOT = inter-token gap, measured at the client over the full stream.
Competitor baseline
As of June 2025, DeepInfra and Alibaba Cloud DashScope are the only two commercial providers serving Qwen3-30B-A3B, per Artificial Analysis. Numbers below are from their published benchmarks; we have not independently verified them.
| Provider | Throughput tok/s | Output cost / 1M tokens | Hardware | Source |
|---|---|---|---|---|
DeepInfra Qwen/Qwen3-30B-A3B | 83.8 | $0.29 | H-class, FP8 | Artificial Analysis ↗ |
Alibaba Cloud qwen3-30b-a3b (DashScope) | 86.1 | $0.80 | Proprietary | Artificial Analysis ↗ |
pinstripes ps/qwen3-30b-a3b (AWQ INT4) | 105 (c=1) | $0.113 | 1× RTX A6000 48 GB (spot) | This report |
Competitor throughput figures are single-stream (concurrency=1). Our 105 tok/s is also concurrency=1, making the comparison direct. The c=8–32 rows characterise a single GPU under load; in production additional GPUs are added automatically so each user is never throttled.
Methodology
What it means for agentic workloads
Agentic systems — Claude Code, AutoGen, Hermes, custom tool-call loops — issue inference requests in tight sequences where each step's output feeds the next. The wall-clock time for a task is dominated by how fast the model generates tokens, not network latency.
At 105 tok/s per user, a 1 000-token reasoning step completes in under 10 seconds. At DeepInfra's published 83.8 tok/s, the same step takes 12 seconds — a 20% tax on every agent loop iteration, compounded across hundreds of tool calls per task.
And because the fleet scales horizontally, there are no rate limits and no queuing. When you fire 50 parallel agent threads, new GPU instances spin up to serve them. Every user always sees ~105 tok/s — not the degraded throughput of a shared, saturated GPU. You never need to throttle your agents to fit within a provider's concurrency cap.
Try it yourself
OpenAI-compatible. Change one line, keep everything else.
All measurements made June 2025. Competitor data sourced from Artificial Analysis (artificialanalysis.ai), which publishes independent benchmarks for LLM API providers. Our benchmark code is available on request. We will rerun and publish updated figures whenever we make material changes to the inference stack.