Introducing Docker Model Runner: A Better Way to Build and Run GenAI Models Locally

Generative AI is transforming software development, but building and running AI models locally is still harder than it should be. Today’s developers face fragmented tooling, hardware compatibility headaches, and disconnected application development workflows, all of which hinder iteration and slow down progress.  

That’s why we’re launching Docker Model Runner — a faster, simpler way to run and test AI models locally, right from your existing workflow. Whether you’re experimenting with the latest LLMs or deploying to production, Model Runner brings the performance and control you need, without the friction.

We’re also teaming up with some of the most influential names in AI and software development, including Google, Continue, Dagger, Qualcomm, HuggingFace, Spring AI, and VMware Tanzu AI Solutions, to give developers direct access to the latest models, frameworks, and tools. These partnerships aren’t just integrations, they’re a shared commitment to making AI innovation more accessible, powerful, and developer-friendly. With Docker Model Runner, you can tap into the best of the AI ecosystem from right inside your Docker workflow.

LLM development is evolving: We’re making it local-first 

Local development for applications powered by LLMs is gaining momentum, and for good reason. It offers several advantages on key dimensions such as performance, cost, and data privacy. But today, local setup is complex.  

Developers are often forced to manually integrate multiple tools, configure environments, and manage models separately from container workflows. Running a model varies by platform and depends on available hardware. Model storage is fragmented because there is no standard way to store, share, or serve models. 

The result? Rising cloud inference costs and a disjoined developer experience. With our first release, we’re focused on reducing that friction, making local model execution simpler, faster, and easier to fit into the way developers already build.

Docker Model Runner: The simple, secure way to run AI models locally

Docker Model Runner is designed to make AI model execution as simple as running a container. With this Beta release, we’re giving developers a fast, low-friction way to run models, test them, and iterate on application code that uses models locally, without all the usual setup headaches. Here’s how:

Running models locally 

With Docker Model Runner, running AI models locally is now as simple as running any other service in your inner loop. Docker Model Runner delivers this by including an inference engine as part of Docker Desktop, built on top of llama.cpp and accessible through the familiar OpenAI API. No extra tools, no extra setup, and no disconnected workflows. Everything stays in one place, so you can test and iterate quickly, right on your machine.

Enabling GPU acceleration (Apple silicon)

GPU acceleration on Apple silicon helps developers get fast inference and the most out of their local hardware. By using host-based execution, we avoid the performance limitations of running models inside virtual machines. This translates to faster inference, smoother testing, and better feedback loops.

Standardizing model packaging with OCI Artifacts

Model distribution today is messy. Models are often shared as loose files or behind proprietary download tools with custom authentication. With Docker Model Runner, we package models as OCI Artifacts, an open standard that allows you to distribute and version them through the same registries and workflows you already use for containers. Today, you can easily pull ready-to-use models from Docker Hub. Soon, you’ll also be able to push your own models, integrate with any container registry, connect them to your CI/CD pipelines, and use familiar tools for access control and automation.

Building momentum with a thriving GenAI ecosystem

To make local development seamless, it needs an ecosystem. That starts with meeting developers where they are, whether they’re testing model performance on their local machines or building applications that run these models. 

That’s why we’re launching Docker Model Runner with a powerful ecosystem of partners on both sides of the AI application development process. On the model side, we’re collaborating with industry leaders like Google and community platforms like HuggingFace to bring you high-quality, optimized models ready for local use. These models are published as OCI artifacts, so you can pull and run them using standard Docker commands, just like any container image.

But we aren’t stopping at models. We’re also working with application, language, and tooling partners like Dagger, Continue, and Spring AI and VMware Tanzu to ensure applications built with Model Runner integrate seamlessly into real-world developer workflows. Additionally, we’re working with hardware partners like Qualcomm to ensure high performance inference on all platforms.

As Docker Model Runner evolves, we’ll work to expand its ecosystem of partners, allowing for ample distribution and added functionality.

Where We’re Going

This is just the beginning. With Docker Model Runner, we’re making it easier for developers to bring AI model execution into everyday workflows, securely, locally, and with a low barrier of entry. Soon, you’ll be able to run models on more platforms, including Windows with GPU acceleration, customize and publish your own models, and integrate AI into your dev loop with even greater flexibility (including Compose and Testcontainers). With each Docker Desktop release, we’ll continue to unlock new capabilities that make GenAI development easier, faster, and way more fun to build with.

Try it out now! 

Docker Model Runner is now available as a Beta feature in Docker Desktop 4.40. To get started:

On a Mac with Apple silicon

Update to Docker Desktop 4.40

Pull models developed by our partners at Docker’s GenAI Hub and start experimenting

For more information, check out our documentation here.

Try it out and let us know what you think!

How can I learn more about Docker Model Runner?

Check out our available assets today! 

Turn your Mac into an AI playground YouTube tutorialA Quickstart Guide to Docker Model Runner Docker Model Runner on Docker Docs 

Come meet us at Google Cloud Next! 

Swing by booth 1530 in the Mandalay Convention Center for hands-on demos and exclusive content.
Quelle: https://blog.docker.com/feed/

Run Gemma 3 with Docker Model Runner: Fully Local GenAI Developer Experience

The landscape of generative AI development is evolving rapidly but comes with significant challenges. API usage costs can quickly add up, especially during development. Privacy concerns arise when sensitive data must be sent to external services. And relying on external APIs can introduce connectivity issues and latency.

Enter Gemma 3 and Docker Model Runner, a powerful combination that brings state-of-the-art language models to your local environment, addressing these challenges head-on.

In this blog post, we’ll explore how to run Gemma 3 locally using Docker Model Runner. We’ll also walk through a practical case study: a Comment Processing System that analyzes user feedback about a fictional AI assistant named Jarvis.

The power of local GenAI development

Before diving into the implementation, let’s look at why local GenAI development is becoming increasingly important:

Cost efficiency: With no per-token or per-request charges, you can experiment freely without worrying about usage fees.

Data privacy: Sensitive data stays within your environment, with no third-party exposure.

Reduced network latency: Eliminates reliance on external APIs and enables offline use.

Full control: Run the model on your terms, with no intermediaries and full transparency.

Setting up Docker Model Runner with Gemma 3

Docker Model Runner provides an OpenAI-compatible API interface to run models locally.It is included in Docker Desktop for macOS, starting with version 4.40.0.Here’s how to set it up with Gemma 3:

docker desktop enable model-runner –tcp 12434
docker model pull ai/gemma3

Once setup is complete, the OpenAI-compatible API provided by the Model Runner is available at: http://localhost:12434/engines/v1

Case study: Comment processing system

To demonstrate the power of local GenAI development, we’ve built a Comment Processing System that leverages Gemma 3 for multiple NLP tasks. This system:

Generates synthetic user comments about a fictional AI assistant

Categorizes comments as positive, negative, or neutral

Clusters similar comments together using embeddings

Identifies potential product features from the comments

Generates contextually appropriate responses

All tasks are performed locally with no external API calls.

Implementation details

Configuring the OpenAI SDK to use local models

To make this work, we configure the OpenAI SDK to point to the Docker Model Runner:

// config.js

export default {
// Model configuration
openai: {
baseURL: "http://localhost:12434/engines/v1", // Base URL for Docker Model Runner
apiKey: 'ignored',
model: "ai/gemma3",
commentGeneration: { // Each task has its own configuration, for example temperature is set to a high value when generating comments for creativity
temperature: 0.3,
max_tokens: 250,
n: 1,
},
embedding: {
model: "ai/mxbai-embed-large", // Model for generating embeddings
},
},
// … other configuration options
};

import OpenAI from 'openai';
import config from './config.js';

// Initialize OpenAI client with local endpoint
const client = new OpenAI({
baseURL: config.openai.baseURL,
apiKey: config.openai.apiKey,
});

Task-specific configuration

One key benefit of running models locally is the ability to experiment freely with different configurations for each task without worrying about API costs or rate limits.

In our case:

Synthetic comment generation uses a higher temperature for creativity.

Categorization uses a lower temperature and a 10-token limit for consistency.

Clustering allows up to 20 tokens to improve semantic richness in embeddings.

This flexibility lets us iterate quickly, tune for performance, and tailor the model’s behavior to each use case.

Generating synthetic comments

To simulate user feedback, we use Gemma 3’s ability to follow detailed, context-aware prompts.

/**
* Create a prompt for comment generation
* @param {string} type – Type of comment (positive, negative, neutral)
* @param {string} topic – Topic of the comment
* @returns {string} – Prompt for OpenAI
*/
function createPromptForCommentGeneration(type, topic) {
let sentiment = '';

switch (type) {
case 'positive':
sentiment = 'positive and appreciative';
break;
case 'negative':
sentiment = 'negative and critical';
break;
case 'neutral':
sentiment = 'neutral and balanced';
break;
default:
sentiment = 'general';
}

return `Generate a realistic ${sentiment} user comment about an AI assistant called Jarvis, focusing on its ${topic}.

The comment should sound natural, as if written by a real user who has been using Jarvis.
Keep the comment concise (1-3 sentences) and focused on the specific topic.
Do not include ratings (like "5/5 stars") or formatting.
Just return the comment text without any additional context or explanation.`;
}

Examples:

"Honestly, Jarvis is just a lot of empty promises. It keeps suggesting irrelevant articles and failing to actually understand my requests for help with my work – it’s not helpful at all."

"Jarvis is seriously impressive – the speed at which it responds is incredible! I’ve never used an AI assistant that’s so quick and efficient, it’s a game changer.

The ability to produce realistic feedback on demand is incredibly useful for simulating user data with zero API cost.

Generating contextual responses

We also use Gemma 3 to simulate polite, on-brand support responses to user comments. Here’s the prompt logic:

const response = await client.chat.completions.create({
model: config.openai.model,
messages: [
{
role: "system",
content: `You are a customer support representative for an AI assistant called Jarvis. Your task is to generate polite, helpful responses to user comments.

Guidelines for responses:
1. Be empathetic and acknowledge the user's feedback
2. Thank the user for their input
3. If the comment is positive, express appreciation
4. If the comment is negative, apologize for the inconvenience and assure them you're working on improvements
5. If the comment is neutral, acknowledge their observation
6. If relevant, mention that their feedback will be considered for future updates
7. Keep responses concise (2-4 sentences) and professional
8. Do not make specific promises about feature implementation or timelines
9. Sign the response as "The Jarvis Team"`
},
{
role: "user",
content: `User comment: "${comment.text}"
Comment category: ${comment.category || 'unknown'}

${featuresContext}

Generate a polite, helpful response to this user comment.`
}
],
temperature: 0.7,
max_tokens: 200
});

Examples:

For a positive comment:

Thank you so much for your positive feedback regarding Jarvis’s interface! We’re thrilled to hear you find it clean and intuitive – that’s exactly what we’re aiming for. We appreciate you pointing out your desire for more visual customization options, and your feedback will definitely be considered as we continue to develop Jarvis.

The Jarvis Team

For a negative comment:

Thank you for your feedback – we appreciate you taking the time to share your observations about Jarvis. We sincerely apologize for the glitches and freezes you’ve experienced; we understand how frustrating that can be. Your input is valuable, and we’re actively working on improvements to enhance Jarvis’s reliability and accuracy.

The Jarvis Team

This approach ensures a consistent, human-like support experience generated entirely locally.

Extracting product features from user feedback

Beyond generating and responding to comments, we also use Gemma 3 to analyze user feedback and identify actionable insights. This helps simulate the role of a product analyst, surfacing recurring themes, user pain points, and opportunities for improvement.

Here, we provide a prompt instructing the model to identify up to three potential features or improvements based on a set of user comments. 

/**
* Extract features from comments
* @param {string} commentsText – Text of comments
* @returns {Promise<Array>} – Array of identified features
*/
async function extractFeaturesFromComments(commentsText) {
const response = await client.chat.completions.create({
model: config.openai.model,
messages: [
{
role: "system",
content: `You are a product analyst for an AI assistant called Jarvis. Your task is to identify potential product features or improvements based on user comments.

For each set of comments, identify up to 3 potential features or improvements that could address the user feedback.

For each feature, provide:
1. A short name (2-5 words)
2. A brief description (1-2 sentences)
3. The type of feature (New Feature, Improvement, Bug Fix)
4. Priority (High, Medium, Low)

Format your response as a JSON array of features, with each feature having the fields: name, description, type, and priority.`
},
{
role: "user",
content: `Here are some user comments about Jarvis. Identify potential features or improvements based on these comments:

${commentsText}`
}
],
response_format: { type: "json_object" },
temperature: 0.5
});

try {
const result = JSON.parse(response.choices[0].message.content);
return result.features || [];
} catch (error) {
console.error('Error parsing feature identification response:', error);
return [];
}
}

Here’s an example of what the model might return:

"features": [
{
"name": "Enhanced Visual Customization",
"description": "Allows users to personalize the Jarvis interface with more themes, icon styles, and display options to improve visual appeal and user preference.",
"type": "Improvement",
"priority": "Medium",
"clusters": [
"1"
]
},

And just like everything else in this project, it’s generated locally with no external services.

Conclusion

By combining Gemma 3 with Docker Model Runner, we’ve unlocked a local GenAI workflow that’s fast, private, cost-effective, and fully under our control. In building our Comment Processing System, we experienced firsthand the benefits of this approach:

Rapid iteration without worrying about API costs or rate limits

Flexibility to test different configurations for each task

Offline development with no dependency on external services

Significant cost savings during development

And this is just one example of what’s possible. Whether you’re prototyping a new AI product, building internal tools, or exploring advanced NLP use cases, running models locally puts you in the driver’s seat.

As open-source models and local tooling continue to evolve, the barrier to entry for building powerful AI systems keeps getting lower.

Don’t just consume AI; develop, shape, and own the process.

Try it yourself: clone the repository and start experimenting today.
Quelle: https://blog.docker.com/feed/