Benchmark report · June 2025

Qwen3-30B-A3B: pinstripes vs.
DeepInfra & Alibaba Cloud

We measured throughput, TTFT, and per-token cost for Qwen3-30B-A3B under concurrent load on our stack and compared the results against published numbers from the two commercial providers that currently serve this model. Our hardware: a single RTX A6000 48 GB on Vast.ai spot at $0.376/hr. Their hardware: H-class datacenter GPUs.

+25%

faster single-user throughput
vs DeepInfra published

2.6×

cheaper than DeepInfra
($0.113 vs $0.29 / 1M output)

7.1×

cheaper than Alibaba Cloud
($0.113 vs $0.80 / 1M output)

Our results

Single-GPU throughput under load

Model: Qwen3-30B-A3B-AWQ (INT4 weights, INT4 KV cache) · Hardware: 1× RTX A6000 48 GB · Stack: vLLM (our fork) + AWQ-Marlin kernel · Cost: $0.113 / 1M output tokens · In production, the fleet scales horizontally — users never queue behind a saturated GPU.

Concurrency	Throughput tok/s aggregate	tok/s/user aggregate ÷ N	TTFT p50 ms	TPOT p50 ms / token
1	105	105	70	9.1
8	440	55	142	16.9
16	643	40	155	22.6
32	923	29	273	29.6

Prompt: 512 tokens, completion: 256 tokens. Measured with OpenAI-compatible streaming client. TPOT = inter-token gap, measured at the client over the full stream.

Competitor baseline

Published provider numbers

As of June 2025, DeepInfra and Alibaba Cloud DashScope are the only two commercial providers serving Qwen3-30B-A3B, per Artificial Analysis. Numbers below are from their published benchmarks; we have not independently verified them.

Provider	Throughput tok/s	Output cost / 1M tokens	Hardware	Source
DeepInfra Qwen/Qwen3-30B-A3B	83.8	$0.29	H-class, FP8	Artificial Analysis ↗
Alibaba Cloud qwen3-30b-a3b (DashScope)	86.1	$0.80	Proprietary	Artificial Analysis ↗
pinstripes ps/qwen3-30b-a3b (AWQ INT4)	105 (c=1)	$0.113	1× RTX A6000 48 GB (spot)	This report

Competitor throughput figures are single-stream (concurrency=1). Our 105 tok/s is also concurrency=1, making the comparison direct. The c=8–32 rows characterise a single GPU under load; in production additional GPUs are added automatically so each user is never throttled.

Methodology

How we measured

Hardware

GPU: RTX A6000 48 GB (× 1)
Host: Vast.ai spot instance
Cost: $0.376/hr compute
TP size: 1 (single-GPU MoE)
GPU util: 100% SM, 243 W, 38.6 GB VRAM

Stack

Engine: vLLM (pinstripes fork)
Weights: AWQ 4-bit (AWQ-Marlin kernel)
KV cache: INT4 quantisation
Context: 32 768 tokens max
Routing: EPLB (expert-parallel load balancing)

Benchmark protocol

Client: OpenAI SDK, streaming mode
Prompt: 512 tokens (fixed)
Completion: 256 tokens (max)
Concurrency: 1, 8, 16, 32 simultaneous requests
TTFT: time to first streaming token
TPOT: mean inter-token gap over full stream
Throughput: aggregate tok/s across all concurrent streams

Caveats

Competitor numbers are single-stream from Artificial Analysis; not independently verified by us.
AWQ INT4 vs FP8: AWQ uses less VRAM and is cheaper to run but slightly lower precision. Quality impact on MoE models is negligible on standard benchmarks.
Spot instance pricing varies; $0.376/hr was the Vast.ai market rate at time of test.
GPU model and time of day affect results; reported figures are from a warm-cache run after the model was fully loaded.

What it means for agentic workloads

Fast tokens, no rate limits, no queuing

Agentic systems — Claude Code, AutoGen, Hermes, custom tool-call loops — issue inference requests in tight sequences where each step's output feeds the next. The wall-clock time for a task is dominated by how fast the model generates tokens, not network latency.

At 105 tok/s per user, a 1 000-token reasoning step completes in under 10 seconds. At DeepInfra's published 83.8 tok/s, the same step takes 12 seconds — a 20% tax on every agent loop iteration, compounded across hundreds of tool calls per task.

And because the fleet scales horizontally, there are no rate limits and no queuing. When you fire 50 parallel agent threads, new GPU instances spin up to serve them. Every user always sees ~105 tok/s — not the degraded throughput of a shared, saturated GPU. You never need to throttle your agents to fit within a provider's concurrency cap.

Try it yourself

OpenAI-compatible. Change one line, keep everything else.

Get API key Docs

All measurements made June 2025. Competitor data sourced from Artificial Analysis (artificialanalysis.ai), which publishes independent benchmarks for LLM API providers. Our benchmark code is available on request. We will rerun and publish updated figures whenever we make material changes to the inference stack.

Qwen3-30B-A3B: pinstripes vs.DeepInfra & Alibaba Cloud