Deploy your first LLM on GCP: Gemma with Cloud Run (Serverless & GPU-powered)

3 min readMar 28, 2025

What’s this all about?

Large Language Models (LLMs) like Gemma, Mistral, or LLaMA are getting easier to run — but they’re still big and need powerful machines. What if you could deploy them in the cloud, only pay when you use them, and skip all the complicated infrastructure?

That’s what we’ll learn today:
How to run a powerful LLM in the cloud, using:

  • Ollama — a tool to run LLMs easily
  • Google Cloud Run — a service that runs containers on-demand, even with GPUs

Why would you want to do this?

Running LLMs locally is fun, but:

  • You may not have a good enough GPU
  • It’s hard to share your model with others
  • You might want to integrate it with a real app

By using Google Cloud Run with GPU, you can:

  • Use a powerful NVIDIA GPU without owning one
  • Only pay when someone uses it (no idle cost)
  • Deploy in just a few steps
  • Stay fully serverless (no servers to manage!)

What are we going to deploy?

We’ll deploy a Gemma 4B model using Ollama in a Docker container, hosted on Cloud Run GPU. It’ll be private, scalable, and cost-efficient.

What do you need?

  • A Google Cloud account
  • Billing enabled
  • Basic knowledge of the terminal
  • Docker installed
  • Optional: Terraform (for automation)

Step 1: Create the Dockerfile

We use the official ollama image and preload the model:

FROM ollama/ollama:0.1.30

ENV OLLAMA_HOST=0.0.0.0:8080 \
OLLAMA_MODELS=/models \
OLLAMA_DEBUG=false \
OLLAMA_KEEP_ALIVE=-1 \
MODEL=gemma:4b
RUN ollama pull ${MODEL}
ENTRYPOINT ["ollama", "serve"]

This will make Ollama serve the model and keep it ready.

Step 2: Build and push your image

gcloud builds submit \
--tag us-central1-docker.pkg.dev/YOUR_PROJECT/YOUR_REPO/ollama-gemma \
--project YOUR_PROJECT

This sends your container to Artifact Registry.

🚀 Step 3: Deploy to Cloud Run (with GPU)

gcloud beta run deploy ollama-gemma \
--image us-central1-docker.pkg.dev/YOUR_PROJECT/YOUR_REPO/ollama-gemma \
--region us-central1 \
--cpu=4 \
--memory=24Gi \
--gpu=1 \
--gpu-type=nvidia-l4 \
--max-instances=1 \
--min-instances=0 \
--concurrency=1 \
--no-allow-unauthenticated \
--no-cpu-throttling \
--timeout=600

📌 This means:

  • Use GPU only when needed
  • Zero cost when idle
  • One request at a time (safe for LLMs)

🌐 Step 4: Call your model

Once deployed, you’ll get a private URL (something like https://ollama-gemma-xxxx.a.run.app).
From another app, you can send a request like:

curl -X POST https://YOUR_URL/api/generate \
-H "Authorization: Bearer <TOKEN>" \
-d '{"prompt":"Hello, who are you?"}'

(You’ll need proper IAM auth, or make it public for quick tests)

How much does it cost?

Resource Approximate Price NVIDIA L4 GPU ~$0.73 / hour 4 vCPU ~$0.126 / hour 24 GiB RAM ~$0.1 / hour TOTAL ~$1 / hour (only when running!)

With min-instances=0, your cost at idle is $0.00 💡

Benefits

  • No GPU? No problem. Run LLMs from anywhere
  • Scalable: can grow as needed
  • Pay per use: perfect for occasional or on-demand workloads
  • No server management: pure serverless

Limitations

  • Cold start time (~20–40s) since the GPU has to spin up
  • Model size matters: use smaller models (like Gemma 2B–4B) for faster response
  • Not ideal for real-time chat unless you keep the instance warm (which costs more)

What’s next?

Once this works, you can:

  • Add API Gateway or Authentication
  • Use Terraform to automate and manage the deployment
  • Add CI/CD to deploy new models or versions
  • Expose this LLM via a chatbot or app

📌 TL;DR (Too Long; Didn’t Read)

You can now:

  • Deploy a local LLM to the cloud
  • Use a GPU only when needed
  • Avoid paying when it’s not used
  • Stay 100% serverless

Conclusion

Running your own language model in the cloud is no longer just for big companies. Tools like Ollama + Cloud Run GPU make it simple, fast, and accessible. Whether you’re experimenting with LLMs, building a startup, or integrating AI into your app, you can do it with minimal cost and effort.

Try it out, and you’ll be surprised how easy it is to deploy your own model.
The power of AI is now just a few commands away. 💡

sources:

--

--

Lionel Owono
Lionel Owono

Written by Lionel Owono

Jesus’s disciple, beloved husband, beloved father and passionated scientist.

No responses yet