Tag: llm-inference
All the articles with the tag "llm-inference".
-
vLLM, Quantization, and Serving LLMs on a Budget
Self-hosting an open model when GPUs are scarce and finance is reading the bill. Continuous batching, KV-cache, what quantization actually costs you, and when to just call a hosted API instead.
-
Llama 2 Is Here. Should You Self-Host?
The week Llama 2 dropped, half my inbox asked whether to pull inference in-house. The break-even math, the GPU scarcity, and the on-call tax nobody puts in the spreadsheet.