Unlock the Power
of Local AI

Local LLMs in Practice

FLX AI Hub Community Convening · 60-Minute Session

OpenWebUI / ComfyUI
LM Studio
AI Edge Gallery
Spark DXG
Mac
Mobile

Podcast / Media Work

Paul Engin

Professor of New Media

Finger Lakes Community College

✉ [email protected]

Focus Areas

Local AI & LLMs
New Media
AI in Education
Emerging Tech
Video and Animation
Art and Design

Today's Session: Local LLMs in Practice

What We'll Cover Today

Four Key Reasons

🔒 Privacy & Data Control

Your data never leaves your machine. No cloud logging, no third-party data retention. Critical for sensitive documents, legal, medical, or proprietary business data.

💰 No API Costs

Eliminate per-token billing. Run inference thousands of times for free. Especially valuable for token-heavy workflows.

📡 Works Offline

Fully functional without internet. Field work, travel, air-gapped environments — your AI runs anywhere.

⚙️ Full Customization

Choose your model, adjust parameters, run fine-tuned versions. You're in control of every variable.

Drawbacks of Running Local

Hardware Requirements

Models need significant RAM/VRAM. A 7B model needs ~8GB RAM; 13B needs ~16GB. GPU acceleration is ideal but not required.

Slower Inference

Without a GPU, generation is slower than cloud APIs. CPU-only runs can be 5–10× slower than hosted solutions.

Smaller Model Capability

Local models are improving fast, but frontier models (GPT-4, Claude) still outperform most models you can run locally.

Setup & Maintenance

You manage downloads, updates, and configuration. No plug-and-play experience like consumer cloud AI apps.

Heat & Power Draw

Long inference sessions push CPU/GPU hard. Expect heat and battery impact on laptops and phones.

Model Size = Disk Space

A single 7B model is 4–8 GB. Running several models eats storage fast.

What Is a Token?

"Running AI locally saves money" → 7 tokens

Running AI local ly saves money .

~¾

of a word per token, on average

128K+

context window in modern models

$$$

API cost scales directly with tokens

Token-heavy workloads (large docs, long chats, code review, image creation) cost real money on cloud APIs, but run FREE locally. Input tokens, Thinking Tokens, and Output Tokens all count.

What Users & Businesses Actually Pay Per Month

Providers charge per 1M tokens. 1M tokens ≈ 750,000 words — about 3 novels. Every prompt and every reply burns tokens.

Provider / Model	Tier	Input /1M	Output /1M	~100K emails/mo	vs. Local
OpenAI GPT-4o	Flagship	$5.00	$15.00	~$300–600	FREE
Anthropic Claude Sonnet	Flagship	$3.00	$15.00	~$250–500	FREE
Google Gemini 1.5 Pro	Flagship	$3.50	$10.50	~$200–400	FREE
OpenAI GPT-4o mini	Fast/Cheap	$0.15	$0.60	~$10–20	FREE
Claude Haiku	Fast/Cheap	$0.25	$1.25	~$15–30	FREE
Gemini 1.5 Flash	Fast/Cheap	$0.075	$0.30	~$5–10	FREE
Local LLM ✓	Any size	$0	$0	$0	∞

Running locally = $0 in API costs, forever — whether you send 100 prompts or 100,000.

Parameters & Quantization: Size Isn't Everything

A parameter is a tiny dial learned from training data — more dials = more nuance. But quantization compresses models so a 70B model fits in far less RAM than you'd think.

8B Parameters

Runs on: Phone, Laptop, Mac

Knowledgeable generalist — fast, always available, fits in your pocket. Instant responses, no special hardware needed.

600B Parameters

Runs on: Data center / server farm

Near-human reasoning — handles complex tasks at frontier quality. Requires massive hardware to operate.

Quantization (Q4/Q8) compresses models — same 70B brain, fraction of the RAM:

Q4 (4-bit) ✓ ~38 GB

Sweet spot — runs locally! Q8 (8-bit) ~70 GB

64–128GB Mac FP16 ~140 GB

Workstation FP32 ~280 GB

Research server only

💡 An 8B model today outperforms GPT-3 (2020). Always try Q4/Q8 quantized first — half the RAM, minimal quality loss.

Size vs. Speed vs. Quality

Model Size	RAM Needed	Best For	Example Models
1B – 3B	2–4 GB	Phone / quick tasks	Phi-4 Mini, Gemma 3
7B	6–8 GB	Everyday use, summaries	Llama 3.2, Mistral 7B
13B	10–16 GB	Code, reasoning, drafting	Qwen 2.5, Gemma 3 12B
30B+	24+ GB	Near-cloud quality	Llama 3.3 70B (quantized)

Get models at: huggingface.co ollama.com/library

💡 Q4/Q8 quantized models run in half the RAM with minimal quality loss — always try quantized first.

Not All Models Are the Same

Think of it like hiring a contractor: you want the right specialist, not just the biggest team.

🗣 Language

Writing, summarizing, Q&A, chat. The most common type, what most people mean when they say "AI".

e.g. Llama 3, Mistral, Gemma

🧮 Reasoning

Thinks step-by-step before answering. Better at math, logic, and multi-step problems. Slower but smarter.

e.g. DeepSeek R1, QwQ

💻 Coding

Trained heavily on source code. Dramatically better at writing, debugging, and explaining code.

e.g. Qwen Coder, DeepSeek Coder

🖼 Vision

Understands images as well as text. Can describe photos, read documents, analyze diagrams.

e.g. LLaVA, Gemma 3, Phi-4 Vision

🌐 Multimodal

Combines multiple types: text, image, sometimes audio. The direction most new models are heading.

e.g. Gemma 3, Phi-4 Mini

A small coding-specialized 7B model will often beat a general 70B model at writing code — right model beats big model.

Retrieval-Augmented Generation

"How do I get the AI to know about MY documents?"

Retrieval

Find relevant information from your own documents, PDFs, databases, websites, or internal wikis.

Augmented

Add that retrieved information directly into the prompt, giving the model fresh, specific context it didn't have before.

Generation

The LLM generates an answer using both its training knowledge and the retrieved information together.

RAG lets a small local model answer questions about your private documents, your PDFs, your data — without ever sending anything to the cloud.

⚠️ RAG reads text — not numbers. For database queries you need a SQL tool. Think of RAG as a filing cabinet reader; SQL as a calculator that searches rows.

Three Ways to Run Local AI

🌐 OpenWebUI

Spark DXG / PC / Server

Self-hosted. Full-featured UI.

Runs on Ollama backend
Chat history, model switching
Multi-user support
Best for: power users & teams

🧪 LM Studio

Mac / Windows

Desktop app. Download & run.

Built-in model browser
OpenAI-compatible local API
GPU acceleration support
Best for: developers & tinkerers

📱 AI Edge Gallery

Android & iOS

Google's on-device LLM app.

Runs Gemma 4 fully offline
Integrated Hugging Face browser
Agent skills + multimodal vision
Best for: mobile, on-the-go AI

Scan now — be ready before the demos start

Demo 1 — Follow Along

AI Edge Gallery

Android

Google Play ↗

Demo 1 — Follow Along

AI Edge Gallery

iOS

App Store ↗

Demo 2 — Follow Along

LM Studio

Mac & Windows

lmstudio.ai ↗

Demo 3 — Watch Along

OpenWebUI

Self-hosted

openwebui.com ↗

💡 AI Edge Gallery & LM Studio: follow along during the demo. OpenWebUI: watch the Spark DXG — no setup needed from you.

Scan Now — Prompts & Links

Android

Google Play ↗

iOS / Apple

App Store ↗

Prompts

Scan for prompts ↗

DEMO 1 — FOLLOW ALONG ON YOUR PHONE LIVE DEMO

Google AI Edge Gallery

Gemma 4 running on your phone — fully offline, no account needed

Android iOS

Open AI Edge Gallery — tap the model library
Download Gemma 4 E2B (~2GB). While it downloads, we'll cover the next slide — it'll be ready.
Paste this prompt:
"Explain a terms of service agreement like I'm 10 years old"
Watch the tokens stream in real time, count the tokens/sec shown at the bottom.
Snap a photo of something in the room → tap Ask Image
Vision model describes it locally. No photo ever leaves your device.
Try Audio Scribe, Mobile Actions and Prompt Labs
Audio, device control and more.

DEMO 2 — FOLLOW ALONG ON YOUR LAPTOP LIVE DEMO

LM Studio — Mac & Windows

Desktop app — download a model, chat, and expose a local API in minutes · lmstudio.ai

Download LM Studio → open it
Free download, Mac & Windows. No account. Open the app — takes 30 seconds.
Search "llama-3.2-3b" or "gemma-4-12b" in the model browser → download
3B model is ~2GB — downloads fast. Watch the Hugging Face integration work right in the app.
Prompt:
"Write a one-page executive summary explaining the benefits of running AI locally instead of in the cloud."
Compare the response to Edge Gallery. Watch tokens/sec — different hardware, different speed.
Click the ↔ API tab → Start Server
Local server on localhost:1234 — OpenAI-compatible. Any app that talks to ChatGPT can talk to this.

DEMO 3 — WATCH ALONG (no setup needed) LIVE DEMO

OpenWebUI and ComfyUI on Spark DXG

Self-hosted AI — running on a local device, accessible to anyone on the network · openwebui.com

Open browser on any device on the network → Spark's IP
No install on your device. The Spark DXG hosts everything; you just browse to it.
Shared team server
This is what local AI looks like for an organization. One device serves the whole team.
RAG and data in action
The model answers from the document only — your data never left this room.
ComfyUI — video & image generation
Generative image pipelines running locally on the Spark. (Comfy.org)

💡 Think of OpenWebUI as "ChatGPT for your organization" — private, free, and running on hardware you control.

When to Use Which

Use LOCAL when…

✓ Data is sensitive / confidential

✓ You have token-heavy workloads

✓ You need offline capability

✓ API costs are a concern

✓ You want full control over the model

✓ Running batch jobs or automation

Use CLOUD when…

✓ You need frontier-level intelligence

✓ Hardware is limited

✓ Speed is critical

✓ Multi-modal tasks (vision, audio)

✓ Quick prototyping / experimentation

✓ Team collaboration & sharing

Local AI Is Moving Into the OS Itself

Just Announced

WWDC 2026 — This Week

Apple

Foundation Models framework — on-device LLMs in any app with a few lines of Swift

Runs entirely on Apple Silicon: privacy-first, no cloud required

iOS 27 & macOS 27: in beta now

Build 2026 — Last Week

Microsoft

Windows AI Engine built into the OS kernel: any app can call local SLMs via NPU

Windows AI Runtime runs full LLMs locally: offline Copilot, no cloud needed

RTX Spark chip brings dedicated on-device AI acceleration to PCs

Google I/O 2026

Google / Android

AI Edge Gallery: Gemma 4 running natively on Android & iOS (you're demoing this today)

On-device AI built into Android Studio and the platform SDK

Gemma models optimized for phones: getting smaller and faster each release

The tools you're learning today are the foundation of how every OS will handle AI tomorrow.

Resources & Next Steps

OpenWebUIopenwebui.com

LM Studiolmstudio.ai

Ollamaollama.com

Hugging Facehuggingface.co

AI Edge Gallerygithub.com/google-ai-edge/gallery

ComfyUIComfy.org

Key Takeaways

Local LLMs are ready for real-world use today!
Privacy + cost savings are the biggest wins
Start with a 7B model — you'll be surprised at the quality
The ecosystem is growing fast — check back frequently

Questions?

[email protected]

AI-generated bunny image created locally with ComfyUI on the Spark DXG

Generated locally with ComfyUI on the Nvidia Spark DXG:

ComfyUI

comfy.org — run it yourself

comfy.org ↗

Unlock the Powerof Local AI