LLM Inference Speed

LLM Inference
Speed Estimator.

Select an LLM model, quantization format, and GPU to estimate real-time token generation speed. Based on memory bandwidth — the true bottleneck of autoregressive inference.

settings_input_component PC Builder Configurator bolt Power Supply Calculator memory LLM VRAM Calculator code Compilation Speed Benchmark electric_bolt Electricity Cost Calculator compare_arrows Bottleneck Detector psychology LLM Inference Speed

1. Select LLM Model

2. Quantization

fp32/fp16 = full precision, max qualityq8 = 8-bit, near losslessq4 = 4-bit, best size/quality tradeoffq2 = 2-bit, very small, lower quality

3. Select GPU

branding_watermark

Selecciona una marca de GPU

psychology

Configure your estimate

Select a model, quantization, and GPU to see the estimated token generation speed.

1 smart_toyChoose an LLM model

2 memory_altSelect quantization format

3 developer_boardPick your inference GPU