LLM Inference Speed LLM Inference
LLM Inference
Speed Estimator.
Select an LLM model, quantization format, and GPU to estimate real-time token generation speed. Based on memory bandwidth — the true bottleneck of autoregressive inference.
1. Select LLM Model
category
Selecciona una familia de modelos
2. Quantization
fp32/fp16 = full precision, max qualityq8 = 8-bit, near losslessq4 = 4-bit, best size/quality tradeoffq2 = 2-bit, very small, lower quality
3. Select GPU
branding_watermark
Selecciona una marca de GPU
psychology
Configure your estimate
Select a model, quantization, and GPU to see the estimated token generation speed.
1 smart_toyChoose an LLM model
2 memory_altSelect quantization format
3 developer_boardPick your inference GPU