19  LLM Inference Frameworks (2024-2025)

The landscape of production LLM inference has matured dramatically in 2024-2025, with vLLM, Text Generation Inference, and Ollama emerging as the dominant frameworks. For your hospital deployment serving 27B-120B parameter models on A100/V100 GPUs, vLLM stands out as the optimal choice for production workloads, with TGI as a strong alternative for Hugging Face-centric deployments. Ollama excels for development and edge scenarios but lacks the throughput for high-concurrency medical applications.

19.1 Executive recommendations

Primary recommendation: vLLM for production inference paired with LangChain for orchestration, Qdrant for RAG vector storage, and LiteLLM as an optional API gateway layer. This stack provides exceptional GPU efficiency (up to 6,000 tokens/second on A100), mature function calling support, comprehensive model format compatibility, and proven production reliability. Deploy via Docker Compose initially, scaling to Kubernetes as user load increases beyond 500 concurrent users.

Alternative recommendation: Text Generation Inference (TGI) if your team is deeply embedded in the Hugging Face ecosystem or requires extreme long-context performance (TGI shows 13x speedup over vLLM for 200K+ token prompts). TGI offers enterprise-grade stability, comprehensive quantization support, and official Hugging Face backing.

Development/prototyping recommendation: Ollama for rapid iteration during initial development phases. Ollama’s exceptional ease of use and native GGUF support make it ideal for proof-of-concept work, but plan migration to vLLM or TGI before production deployment due to throughput limitations (plateaus at ~22 requests/second vs. 6,000+ for vLLM).

19.2 Core framework comparison matrix

Framework GPU Efficiency Model Format Support REST API Production Readiness Tool Calling Best Use Case
vLLM ⭐⭐⭐⭐⭐ Up to 6K TPS on A100 HuggingFace (native), GGUF (experimental) OpenAI-compatible, excellent docs ⭐⭐⭐⭐⭐ Battle-tested ⭐⭐⭐⭐ Parser-dependent High-throughput production
TGI ⭐⭐⭐⭐⭐ Excellent, 13x faster on long context HuggingFace (native), no GGUF OpenAI-compatible, enterprise-grade ⭐⭐⭐⭐⭐ Powers HF production ⭐⭐⭐⭐⭐ Guidance system HuggingFace ecosystem
Ollama ⭐⭐⭐ Good (22 req/s plateau) GGUF (native), HuggingFace (conversion) OpenAI-compatible, simple ⭐⭐⭐⭐ Local/edge production ⭐⭐⭐⭐ Native support Development, edge deployment
llama.cpp ⭐⭐⭐⭐ Excellent for GGUF GGUF (native) OpenAI-compatible server ⭐⭐⭐ Requires setup ⭐⭐⭐ Supported Resource-constrained, CPU fallback
LiteLLM ⭐⭐⭐ Proxy layer (10-50ms overhead) Universal (100+ providers) OpenAI-compatible, unified ⭐⭐⭐⭐⭐ Production gateway ⭐⭐⭐⭐⭐ Universal Multi-provider abstraction
OpenLLM ⭐⭐⭐⭐ vLLM-backed HuggingFace primary OpenAI-compatible ⭐⭐⭐ BentoML ecosystem ⚠️ Limited BentoML users
FastChat ⭐⭐⭐ Moderate Wide model support OpenAI-compatible ⭐⭐ Academic focus ❌ No support Research, evaluation

19.3 vLLM: Deep analysis for hospital deployment

Performance characteristics on A100/V100 GPUs:

vLLM achieves exceptional throughput through PagedAttention and continuous batching. On dual A100 40GB GPUs, benchmarks show 3,700-6,000 tokens/second for Llama-3-8B at 50-100 concurrent requests, 2,400 tokens/second for Gemma-3-27B, and approximately 1,000 tokens/second for 32B models. Recent v0.6.0 updates delivered 2.7x throughput improvement and 5x latency reduction on Llama 8B compared to v0.5.3. The PagedAttention mechanism reduces memory fragmentation, enabling 3x higher throughput by optimizing KV cache management.

For your target 27B models like gemma3 and medgemma, expect around 2,000-2,500 tokens/second on dual A100 40GB with tensor parallelism. The 120B gpt-oss model will require multi-GPU tensor parallelism (4x A100 80GB recommended) and achieve approximately 200-400 tokens/second depending on batch size and context length.

Model format and configuration:

vLLM natively supports HuggingFace Transformers models with automatic loading from the Hub. GGUF support arrived in v0.6.0 but remains experimental - only single-file GGUF works, requiring manual tokenizer specification. For production hospital deployment, stick with HuggingFace format to avoid GGUF limitations. System prompts configure via chat template with {"role": "system", "content": "..."} messages. Runtime parameters include temperature, top_p, top_k, repetition_penalty, max_tokens, and advanced options like frequency_penalty and presence_penalty.

Function calling implementation:

vLLM’s function calling matured significantly in 2024. Enable with --enable-auto-tool-choice --tool-call-parser llama3_json flags. Supported parsers include llama3_json, mistral, hermes, granite, internlm, and xlam. The framework uses guided decoding via the Outlines library to ensure valid JSON output. Tool_choice options include auto, required (v0.8.3+), none, and named function calling. Critical consideration: Requires models natively trained for function calling (Llama 3.1, Mistral, Granite work well; smaller 7B models struggle with parallel calls).

Docker deployment:

Official Docker images available at vllm/vllm-openai:latest with NVIDIA GPU support. Deployment requires nvidia-container-toolkit and --gpus all flag. For tensor parallelism across multiple GPUs, use --tensor-parallel-size 2 for dual GPU setups. Example deployment:

docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-27B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.9 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json

Production readiness: Highly active maintenance with 60,000+ GitHub stars and 1,000+ contributors. Used in production by Anyscale, Databricks, and Neural Magic. Biweekly office hours and fast bug response. Comprehensive documentation at docs.vllm.ai.

19.4 Text Generation Inference: Alternative for HuggingFace workflows

Performance and optimization:

TGI excels in long-context scenarios, demonstrating 13x faster performance than vLLM on prompts exceeding 200,000 tokens (2 seconds vs 27.5 seconds) and 3x token capacity per GPU. Recent v3.0 (December 2024) introduced zero-config mode that automatically optimizes settings based on hardware. Continuous batching, Flash Attention, and Paged Attention provide excellent throughput. Production deployments at Grammarly, Uber, and Deutsche Telekom validate stability.

Comprehensive quantization support:

TGI offers the industry’s most comprehensive quantization options: bitsandbytes (4-bit NF4/FP4), GPTQ, AWQ, EETQ, fp8, and Marlin. This flexibility enables memory-constrained deployments and performance tuning. For A100 GPUs with limited memory, 4-bit quantization can enable larger models while maintaining acceptable accuracy.

Function calling via Guidance:

TGI implements function calling through its Guidance feature using constrained decoding. The grammar parameter enforces JSON schema validation, ensuring structured outputs. OpenAI-compatible tools API works directly with OpenAI client libraries. Less flexible than vLLM’s parser system but production-ready and well-documented.

Docker deployment:

Official images at ghcr.io/huggingface/text-generation-inference:3.3.4 with excellent documentation. Deployment simplicity:

docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.4 \
  --model-id meta-llama/Llama-3.1-27B-Instruct \
  --quantize bitsandbytes-nf4

When to choose TGI over vLLM: Deep Hugging Face integration needs, enterprise support requirements, extreme long-context optimization (100K+ tokens), or comprehensive quantization flexibility.

19.5 Ollama: Development and edge deployment

Strengths and limitations:

Ollama prioritizes simplicity and developer experience. Native GGUF support, curated model registry, and one-command deployment (ollama run llama3.1) make it exceptional for prototyping. However, throughput plateaus at approximately 22 requests/second at 32 concurrent users - a critical limitation for hospital production environments expecting hundreds of concurrent users.

Appropriate use cases for hospital deployment:

Use Ollama for initial proof-of-concept development, testing prompt engineering strategies, evaluating model capabilities, and edge deployment scenarios (local workstations without server connectivity). Plan migration to vLLM or TGI before production rollout to avoid performance bottlenecks.

Function calling: Native support added mid-2024 with excellent Python integration. Pass Python functions directly with automatic schema generation from docstrings and type hints. Streaming tool calls supported with new intelligent parser.

19.6 llama.cpp and LiteLLM: Specialized use cases

llama.cpp ecosystem:

The foundation for GGUF model serving with exceptional CPU performance and multi-platform support. llama-server provides OpenAI-compatible HTTP endpoints. Best for resource-constrained environments, CPU fallback scenarios, or quantized GGUF deployment. Performance: ~277 tokens/second for 7B models on A100, ~30 tokens/second for 70B models. For your 27B-120B target models, llama.cpp works but vLLM will deliver 5-10x better throughput on A100 GPUs.

LiteLLM as API gateway:

LiteLLM functions as a unified proxy layer supporting 100+ LLM providers. Deploy as middleware between your application and inference backends for multi-provider abstraction, automatic fallbacks, load balancing, cost tracking, and unified OpenAI-compatible API. Adds 10-50ms latency but provides exceptional operational flexibility. Consider for production as your API gateway layer sitting in front of vLLM/TGI inference servers.

Architecture pattern:

Client → LiteLLM Gateway → vLLM (primary) / TGI (backup)
         ↓
    Monitoring, cost tracking, rate limiting

19.7 Model Context Protocol (MCP) integration

Current ecosystem status:

MCP represents an emerging standard for connecting AI models with external data sources and tools. As of late 2024/early 2025, the ecosystem remains nascent with limited production implementations. No LLM inference frameworks offer native MCP support currently. Integration requires custom development.

Integration approaches for hospital deployment:

Middleware proxy pattern (recommended): Implement MCP client in a separate orchestration layer sitting between your application and vLLM/TGI. This layer handles MCP server connections, tool routing, and response formatting. Python MCP SDKs available (check Anthropic’s MCP specification repository for latest).

Embedded MCP client: Build MCP client directly into your business logic service. This service queries MCP servers for context (medical knowledge bases, patient records, drug databases) before calling the LLM inference endpoint.

Practical recommendation: Given MCP’s early stage, prioritize OpenAI-compatible function calling for your initial deployment. Function calling provides mature, well-documented tool integration. Monitor MCP ecosystem development for future adoption as standards solidify and implementations mature.

19.8 RAG implementation for medical document retrieval

Vector database selection:

Qdrant emerges as the optimal choice for medical RAG deployments. Benchmarks show 4x performance gains over competitors, excellent metadata filtering crucial for medical data (patient demographics, diagnosis codes, document types), ACID transactions for data integrity, and production-grade Docker deployment. Rust implementation delivers lowest latency and highest RPS. For massive scale (millions of documents), Milvus provides proven scalability for billion-vector deployments.

Avoid: Pinecone due to data governance concerns with PHI in managed cloud services. Chroma works well for prototyping but lacks multi-node scalability for production hospital loads.

RAG orchestration framework:

LangChain provides the most comprehensive tool integration and workflow orchestration capabilities. Use for complex medical workflows requiring multiple data sources, external API integration (drug databases, clinical trial lookups), and agent-based reasoning. Learning curve steeper than alternatives but offers maximum flexibility for evolving requirements.

LlamaIndex excels for RAG-focused applications with simpler architecture needs. Superior indexing structures and query optimization for pure retrieval scenarios. Choose if your primary use case centers on document search and question-answering without complex multi-step workflows.

Medical-specific considerations:

Embedding models: BGE (BAAI/bge-large-en-v1.5) consistently outperforms medical-specific models in benchmarks. Use general domain models unless specialized medical terminology requires domain adaptation.

Chunking strategy: 512-1024 tokens per chunk with 10-20% overlap preserves clinical context. Split at semantic boundaries (section headers in clinical notes) rather than arbitrary character counts.

Hybrid search: Essential for medical terminology. Combine semantic search with keyword matching to handle acronyms, abbreviations, and exact term matching (drug names, diagnosis codes). Both Qdrant and Weaviate support hybrid search natively.

Citation tracking: Implement chunk-level citations linking generated responses to source documents. Critical for medical decision support and regulatory compliance. Track document version, chunk location, retrieval score, and metadata.

Streaming RAG patterns:

For real-time streaming responses, implement pre-fetching retrieval strategies. Retrieve relevant documents before full query completion based on initial tokens. Use Server-Sent Events (SSE) for token streaming from LLM to client. Research shows fixed-interval streaming RAG delivers 200% accuracy improvement with 20% latency reduction compared to traditional RAG.

19.10 Docker Compose reference implementation

Production-ready deployment:

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    volumes:
      - ./models:/root/.cache/huggingface:ro
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.1-27B-Instruct
      --tensor-parallel-size 2
      --max-model-len 4096
      --gpu-memory-utilization 0.9
      --enable-auto-tool-choice
      --tool-call-parser llama3_json
    ports:
      - "8000:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    networks:
      - hospital-ai

  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant_data:/qdrant/storage
    networks:
      - hospital-ai
    deploy:
      resources:
        limits:
          memory: 16G

  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    ports:
      - "4000:4000"
    environment:
      - DATABASE_URL=postgresql://user:pass@postgres:5432/litellm
    depends_on:
      - vllm
      - postgres
    networks:
      - hospital-ai

  app:
    build: ./app
    ports:
      - "8080:8080"
    environment:
      - VLLM_ENDPOINT=http://vllm:8000
      - QDRANT_HOST=qdrant
      - QDRANT_PORT=6333
      - REDIS_URL=redis://redis:6379
    depends_on:
      - vllm
      - qdrant
      - redis
    networks:
      - hospital-ai

  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    networks:
      - hospital-ai

  postgres:
    image: postgres:14
    environment:
      - POSTGRES_DB=hospital_ai
      - POSTGRES_USER=admin
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - hospital-ai

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    networks:
      - hospital-ai

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    networks:
      - hospital-ai

networks:
  hospital-ai:
    driver: bridge

volumes:
  qdrant_data:
  redis_data:
  postgres_data:
  prometheus_data:
  grafana_data:

19.11 Implementation roadmap

Phase 1: MVP (2-4 weeks)

Deploy single vLLM instance with 27B model on dual A100 GPUs. Implement basic FastAPI application layer with simple prompt templates. Add Qdrant for RAG with medical literature corpus. Use Docker Compose on single server. Implement basic authentication and audit logging. Expected capacity: 50-100 concurrent users with 2-3 second response latency.

Phase 2: Production hardening (4-6 weeks)

Add LiteLLM gateway for load balancing and monitoring. Deploy 2-3 vLLM replicas for high availability. Implement comprehensive monitoring with Prometheus, Grafana, and ELK stack. Add Redis caching layer for embeddings and common queries. Implement advanced RAG with hybrid search and citation tracking. Full HIPAA compliance review and security hardening. Expected capacity: 200-500 concurrent users with 1-2 second response latency.

Phase 3: Scale and optimize (2-3 months)

Migrate to Kubernetes for auto-scaling. Add A/B testing framework for model comparison. Implement advanced function calling for external tool integration (drug databases, clinical decision support tools). Deploy multiple specialized models for different medical specialties. Advanced caching strategies and performance optimization. Expected capacity: 1,000+ concurrent users with sub-second p50 latency.

19.12 Performance benchmarks summary

vLLM on dual A100 40GB: - 7B models: 3,700-6,000 tokens/second - 27B models: 2,000-2,500 tokens/second - 32B models: 1,000 tokens/second - 70B models: 200-400 tokens/second (requires 4x A100)

TGI on A100 80GB: - Competitive with vLLM on standard workloads - 13x faster on 200K+ token contexts - 3x token capacity per GPU for long contexts

Ollama: - 22 requests/second plateau at 32+ concurrent users - Suitable for 03c50 concurrent users

Cost estimates: - Small deployment (2x A100 40GB): $8,000-12,000/month cloud or $50,000-80,000 on-premises - Medium deployment (2-4x A100 80GB): $15,000-25,000/month cloud or $100,000-150,000 on-premises

19.13 Critical decision factors

Choose vLLM if: - Maximum throughput and lowest latency are critical - Serving 100+ concurrent users in production - Need proven production reliability and active maintenance - Want flexibility with multiple model architectures - HuggingFace models are your primary format

Choose TGI if: - Deep Hugging Face ecosystem integration - Enterprise support and stability paramount - Long-context optimization crucial (medical documents 03e100K tokens) - Need comprehensive quantization options - Want official backing from major AI company

Choose Ollama for: - Development and prototyping only - Edge deployment on workstations - Proof-of-concept demonstrations - GGUF format requirement - Teams prioritizing simplicity over scale

Avoid for production: - FastChat (no function calling, academic focus) - OpenLLM (limited maturity, requires BentoML ecosystem) - llama.cpp server (unless CPU deployment required)

19.14 Monitoring and operational excellence

Key metrics to track:

Performance: Request volume (requests/minute), TTFT (time to first token), TPOT (time per output token), total request duration, GPU utilization percentage, GPU memory usage, queue depth.

Quality: Output relevance scores, hallucination detection rates, user feedback (thumbs up/down), citation accuracy, error rates by type.

Cost: Token usage (input + output), cost per request, GPU hours consumed, cost by department and use case.

Medical-specific: Clinical accuracy validation, guideline adherence, false positive/negative rates, source document coverage.

Recommended tooling:

OpenTelemetry for standardized traces, metrics, and logs. Prometheus for time-series metrics storage. Grafana for visualization dashboards. LangSmith or Langfuse for LLM-specific observability (prompt versioning, tracing, evaluation). ELK stack for centralized log aggregation and search.

Health checks and reliability:

Implement /health (liveness), /ready (readiness with model loaded check), and /metrics (Prometheus format) endpoints. Use circuit breakers to stop requests to failing services. Implement exponential backoff for retries. Set request timeouts (30-60 seconds for medical applications). Define fallback chains: primary model → quantized backup → cached response → error message.

SLO targets for medical deployment: - Availability: 99.9% (8.76 hours downtime per year) - Latency p50: 03c 2 seconds - Latency p95: 03c 5 seconds
- Latency p99: 03c 10 seconds - Error rate: 03c 0.1%

19.15 Conclusion: Actionable recommendations

For your hospital deployment serving 27B-120B parameter models on A100/V100 GPUs with MCP integration requirements and production medical AI needs, implement this architecture:

Infrastructure: vLLM as primary inference engine on 2-4x NVIDIA A100 GPUs with tensor parallelism. LiteLLM as API gateway for load balancing and monitoring. Qdrant for RAG vector storage. LangChain for orchestration and tool integration. Docker Compose initially, migrate to Kubernetes at 500+ concurrent users.

Model strategy: Deploy Llama 3.1 27B or Mistral models for primary workload. Use 4-bit quantization if memory-constrained. Avoid GGUF format in production - stick with HuggingFace format for maximum compatibility and performance.

Function calling: Leverage vLLM’s native function calling with llama3_json parser. Implement OpenAI-compatible tool schemas. Start with function calling rather than waiting for MCP ecosystem maturity.

RAG implementation: Use Qdrant for vector database, BGE embeddings for medical text, 512-1024 token chunks with 10-20% overlap, hybrid search for terminology handling, and chunk-level citation tracking for medical compliance.

HIPAA compliance: Self-hosted on-premises deployment, encryption at rest and in transit, comprehensive audit logging with 6-year retention, role-based access control, and PHI de-identification where feasible.

Timeline: Achieve MVP in 2-4 weeks with basic vLLM deployment and simple RAG. Reach production-ready status in 6-10 weeks with monitoring, high availability, and compliance measures. Scale to 1,000+ users within 3-6 months with Kubernetes and optimization.

This architecture provides a solid foundation for production medical AI deployment while maintaining flexibility for future requirements and emerging technologies like MCP as the ecosystem matures.