Benchmark report · June 2025

Qwen3-30B-A3B: pinstripes vs.DeepInfra & Alibaba Cloud

We measured throughput, TTFT, and per-token cost for Qwen3-30B-A3B under concurrent load on our stack and compared the results against published numbers from the two commercial providers that currently serve this model. Our hardware: a single RTX A6000 48 GB on Vast.ai spot at $0.376/hr. Their hardware: H-class datacenter GPUs.

+25%
faster single-user throughput
vs DeepInfra published
2.6×
cheaper than DeepInfra
($0.113 vs $0.29 / 1M output)
7.1×
cheaper than Alibaba Cloud
($0.113 vs $0.80 / 1M output)

Our results

Single-GPU throughput under load

Model: Qwen3-30B-A3B-AWQ (INT4 weights, INT4 KV cache) · Hardware: 1× RTX A6000 48 GB · Stack: vLLM (our fork) + AWQ-Marlin kernel · Cost: $0.113 / 1M output tokens · In production, the fleet scales horizontally — users never queue behind a saturated GPU.

ConcurrencyThroughput
tok/s aggregate
tok/s/user
aggregate ÷ N
TTFT p50
ms
TPOT p50
ms / token
1105105709.1
84405514216.9
166434015522.6
329232927329.6

Prompt: 512 tokens, completion: 256 tokens. Measured with OpenAI-compatible streaming client. TPOT = inter-token gap, measured at the client over the full stream.

Competitor baseline

Published provider numbers

As of June 2025, DeepInfra and Alibaba Cloud DashScope are the only two commercial providers serving Qwen3-30B-A3B, per Artificial Analysis. Numbers below are from their published benchmarks; we have not independently verified them.

ProviderThroughput
tok/s
Output cost
/ 1M tokens
HardwareSource
DeepInfra
Qwen/Qwen3-30B-A3B
83.8$0.29H-class, FP8Artificial Analysis
Alibaba Cloud
qwen3-30b-a3b (DashScope)
86.1$0.80ProprietaryArtificial Analysis
pinstripes
ps/qwen3-30b-a3b (AWQ INT4)
105 (c=1)$0.1131× RTX A6000 48 GB (spot)This report

Competitor throughput figures are single-stream (concurrency=1). Our 105 tok/s is also concurrency=1, making the comparison direct. The c=8–32 rows characterise a single GPU under load; in production additional GPUs are added automatically so each user is never throttled.

Methodology

How we measured

Hardware

  • GPU: RTX A6000 48 GB (× 1)
  • Host: Vast.ai spot instance
  • Cost: $0.376/hr compute
  • TP size: 1 (single-GPU MoE)
  • GPU util: 100% SM, 243 W, 38.6 GB VRAM

Stack

  • Engine: vLLM (pinstripes fork)
  • Weights: AWQ 4-bit (AWQ-Marlin kernel)
  • KV cache: INT4 quantisation
  • Context: 32 768 tokens max
  • Routing: EPLB (expert-parallel load balancing)

Benchmark protocol

  • Client: OpenAI SDK, streaming mode
  • Prompt: 512 tokens (fixed)
  • Completion: 256 tokens (max)
  • Concurrency: 1, 8, 16, 32 simultaneous requests
  • TTFT: time to first streaming token
  • TPOT: mean inter-token gap over full stream
  • Throughput: aggregate tok/s across all concurrent streams

Caveats

  • Competitor numbers are single-stream from Artificial Analysis; not independently verified by us.
  • AWQ INT4 vs FP8: AWQ uses less VRAM and is cheaper to run but slightly lower precision. Quality impact on MoE models is negligible on standard benchmarks.
  • Spot instance pricing varies; $0.376/hr was the Vast.ai market rate at time of test.
  • GPU model and time of day affect results; reported figures are from a warm-cache run after the model was fully loaded.

What it means for agentic workloads

Fast tokens, no rate limits, no queuing

Agentic systems — Claude Code, AutoGen, Hermes, custom tool-call loops — issue inference requests in tight sequences where each step's output feeds the next. The wall-clock time for a task is dominated by how fast the model generates tokens, not network latency.

At 105 tok/s per user, a 1 000-token reasoning step completes in under 10 seconds. At DeepInfra's published 83.8 tok/s, the same step takes 12 seconds — a 20% tax on every agent loop iteration, compounded across hundreds of tool calls per task.

And because the fleet scales horizontally, there are no rate limits and no queuing. When you fire 50 parallel agent threads, new GPU instances spin up to serve them. Every user always sees ~105 tok/s — not the degraded throughput of a shared, saturated GPU. You never need to throttle your agents to fit within a provider's concurrency cap.

Try it yourself

OpenAI-compatible. Change one line, keep everything else.

All measurements made June 2025. Competitor data sourced from Artificial Analysis (artificialanalysis.ai), which publishes independent benchmarks for LLM API providers. Our benchmark code is available on request. We will rerun and publish updated figures whenever we make material changes to the inference stack.