1. Ollama

“Docker for AI models.” (a lightweight “runtime”)
Runs directly on CPU or GPU (if available).
Local / Simple Self-Hosting.

Features:

Supports open-source models like LLaMA 2/3, Mistral, or smaller variants.
- Just downloads models and runs on local machine.
It struggles with very large models (70B+ parameters) on typical consumer GPUs because of VRAM limits.
Best suited for single-user or low-concurrency setups, like testing, development, or small chatbots.

How it works

# Install Ollama
*curl -fsSL <https://ollama.com/install.sh> | sh*

# Pull a model
*ollama pull llama3*

# Run it manually
*ollama run llama3*

#list available models
*ollama list*

#starts a local server on **port 11434** (default)
*ollama serve*

// Backend can call ➜ http://localhost:11434/api/generate

Extra tip:

If want to access it remotely (from your backend hosted elsewhere):

*export OLLAMA_HOST=0.0.0.0*
*ollama serve*

Then can call it from another machine:http://<server-ip>:11434/api/generate