Tag: llm-inference

All the articles with the tag "llm-inference".

vLLM, Quantization, and Serving LLMs on a Budget

16 Apr, 2024

Self-hosting an open model when GPUs are scarce and finance is reading the bill. Continuous batching, KV-cache, what quantization actually costs you, and when to just call a hosted API instead.
Llama 2 Is Here. Should You Self-Host?

15 Aug, 2023

The week Llama 2 dropped, half my inbox asked whether to pull inference in-house. The break-even math, the GPU scarcity, and the on-call tax nobody puts in the spreadsheet.

vLLM, Quantization, and Serving LLMs on a Budget