Unlock the Power
of Local AI

Local LLMs in Practice

FLX AI Hub Community Convening  ·  60-Minute Session

  • OpenWebUI / ComfyUI
  • LM Studio
  • AI Edge Gallery
  • Spark DXG
  • Mac
  • Mobile
Paul Engin, Professor of New Media at Finger Lakes Community College
Paul Engin hosting a podcast episode on AI and media

Podcast / Media Work

Paul Engin

Professor of New Media

Finger Lakes Community College

[email protected]

Focus Areas

  • Local AI & LLMs
  • New Media
  • AI in Education
  • Emerging Tech
  • Video and Animation
  • Art and Design

Today's Session: Local LLMs in Practice

What We'll Cover Today

Four Key Reasons

🔒 Privacy & Data Control

Your data never leaves your machine. No cloud logging, no third-party data retention. Critical for sensitive documents, legal, medical, or proprietary business data.

💰 No API Costs

Eliminate per-token billing. Run inference thousands of times for free. Especially valuable for token-heavy workflows.

📡 Works Offline

Fully functional without internet. Field work, travel, air-gapped environments — your AI runs anywhere.

⚙️ Full Customization

Choose your model, adjust parameters, run fine-tuned versions. You're in control of every variable.

Drawbacks of Running Local

Hardware Requirements

Models need significant RAM/VRAM. A 7B model needs ~8GB RAM; 13B needs ~16GB. GPU acceleration is ideal but not required.

Slower Inference

Without a GPU, generation is slower than cloud APIs. CPU-only runs can be 5–10× slower than hosted solutions.

Smaller Model Capability

Local models are improving fast, but frontier models (GPT-4, Claude) still outperform most models you can run locally.

Setup & Maintenance

You manage downloads, updates, and configuration. No plug-and-play experience like consumer cloud AI apps.

Heat & Power Draw

Long inference sessions push CPU/GPU hard. Expect heat and battery impact on laptops and phones.

Model Size = Disk Space

A single 7B model is 4–8 GB. Running several models eats storage fast.

What Is a Token?

"Running AI locally saves money" → 7 tokens

Running AI local ly saves money .
of a word per token, on average
128K+
context window in modern models
$$$
API cost scales directly with tokens
Token-heavy workloads (large docs, long chats, code review, image creation) cost real money on cloud APIs, but run FREE locally. Input tokens, Thinking Tokens, and Output Tokens all count.

What Users & Businesses Actually Pay Per Month

Providers charge per 1M tokens. 1M tokens ≈ 750,000 words — about 3 novels. Every prompt and every reply burns tokens.

Provider / Model Tier Input /1M Output /1M ~100K emails/mo vs. Local
OpenAI GPT-4oFlagship$5.00$15.00~$300–600FREE
Anthropic Claude SonnetFlagship$3.00$15.00~$250–500FREE
Google Gemini 1.5 ProFlagship$3.50$10.50~$200–400FREE
OpenAI GPT-4o miniFast/Cheap$0.15$0.60~$10–20FREE
Claude HaikuFast/Cheap$0.25$1.25~$15–30FREE
Gemini 1.5 FlashFast/Cheap$0.075$0.30~$5–10FREE
Local LLM ✓Any size$0$0$0
Running locally = $0 in API costs, forever — whether you send 100 prompts or 100,000.

Parameters & Quantization: Size Isn't Everything

A parameter is a tiny dial learned from training data — more dials = more nuance. But quantization compresses models so a 70B model fits in far less RAM than you'd think.

8B Parameters

Runs on: Phone, Laptop, Mac

Knowledgeable generalist — fast, always available, fits in your pocket. Instant responses, no special hardware needed.

600B Parameters

Runs on: Data center / server farm

Near-human reasoning — handles complex tasks at frontier quality. Requires massive hardware to operate.

Quantization (Q4/Q8) compresses models — same 70B brain, fraction of the RAM:

Q4 (4-bit) ✓ ~38 GB Sweet spot — runs locally! Q8 (8-bit) ~70 GB 64–128GB Mac FP16 ~140 GB Workstation FP32 ~280 GB Research server only
💡 An 8B model today outperforms GPT-3 (2020). Always try Q4/Q8 quantized first — half the RAM, minimal quality loss.

Size vs. Speed vs. Quality

Model SizeRAM Needed Best ForExample Models
1B – 3B2–4 GBPhone / quick tasksPhi-4 Mini, Gemma 3
7B6–8 GBEveryday use, summariesLlama 3.2, Mistral 7B
13B10–16 GBCode, reasoning, draftingQwen 2.5, Gemma 3 12B
30B+24+ GBNear-cloud qualityLlama 3.3 70B (quantized)

Get models at: huggingface.co ollama.com/library

💡 Q4/Q8 quantized models run in half the RAM with minimal quality loss — always try quantized first.

Not All Models Are the Same

Think of it like hiring a contractor: you want the right specialist, not just the biggest team.

🗣 Language

Writing, summarizing, Q&A, chat. The most common type, what most people mean when they say "AI".

e.g. Llama 3, Mistral, Gemma

🧮 Reasoning

Thinks step-by-step before answering. Better at math, logic, and multi-step problems. Slower but smarter.

e.g. DeepSeek R1, QwQ

💻 Coding

Trained heavily on source code. Dramatically better at writing, debugging, and explaining code.

e.g. Qwen Coder, DeepSeek Coder

🖼 Vision

Understands images as well as text. Can describe photos, read documents, analyze diagrams.

e.g. LLaVA, Gemma 3, Phi-4 Vision

🌐 Multimodal

Combines multiple types: text, image, sometimes audio. The direction most new models are heading.

e.g. Gemma 3, Phi-4 Mini

A small coding-specialized 7B model will often beat a general 70B model at writing code — right model beats big model.

Retrieval-Augmented Generation

"How do I get the AI to know about MY documents?"

Retrieval

Find relevant information from your own documents, PDFs, databases, websites, or internal wikis.

Augmented

Add that retrieved information directly into the prompt, giving the model fresh, specific context it didn't have before.

Generation

The LLM generates an answer using both its training knowledge and the retrieved information together.

RAG lets a small local model answer questions about your private documents, your PDFs, your data — without ever sending anything to the cloud.
⚠️ RAG reads text — not numbers. For database queries you need a SQL tool. Think of RAG as a filing cabinet reader; SQL as a calculator that searches rows.

Three Ways to Run Local AI

🌐 OpenWebUI

Spark DXG / PC / Server

Self-hosted. Full-featured UI.

  • Runs on Ollama backend
  • Chat history, model switching
  • Multi-user support
  • Best for: power users & teams

🧪 LM Studio

Mac / Windows

Desktop app. Download & run.

  • Built-in model browser
  • OpenAI-compatible local API
  • GPU acceleration support
  • Best for: developers & tinkerers

📱 AI Edge Gallery

Android & iOS

Google's on-device LLM app.

  • Runs Gemma 4 fully offline
  • Integrated Hugging Face browser
  • Agent skills + multimodal vision
  • Best for: mobile, on-the-go AI
DEMO 1 — FOLLOW ALONG ON YOUR PHONE LIVE DEMO

Google AI Edge Gallery

Gemma 4 running on your phone — fully offline, no account needed

Android iOS

  1. Open AI Edge Gallery — tap the model library

    Download Gemma 4 E2B (~2GB). While it downloads, we'll cover the next slide — it'll be ready.

  2. Paste this prompt:

    "Explain a terms of service agreement like I'm 10 years old"

    Watch the tokens stream in real time, count the tokens/sec shown at the bottom.

  3. Snap a photo of something in the room → tap Ask Image

    Vision model describes it locally. No photo ever leaves your device.

  4. Try Audio Scribe, Mobile Actions and Prompt Labs

    Audio, device control and more.

DEMO 2 — FOLLOW ALONG ON YOUR LAPTOP LIVE DEMO

LM Studio — Mac & Windows

Desktop app — download a model, chat, and expose a local API in minutes  ·  lmstudio.ai

  1. Download LM Studio → open it

    Free download, Mac & Windows. No account. Open the app — takes 30 seconds.

  2. Search "llama-3.2-3b" or "gemma-4-12b" in the model browser → download

    3B model is ~2GB — downloads fast. Watch the Hugging Face integration work right in the app.

  3. Prompt:

    "Write a one-page executive summary explaining the benefits of running AI locally instead of in the cloud."

    Compare the response to Edge Gallery. Watch tokens/sec — different hardware, different speed.

  4. Click the ↔ API tab → Start Server

    Local server on localhost:1234 — OpenAI-compatible. Any app that talks to ChatGPT can talk to this.

DEMO 3 — WATCH ALONG (no setup needed) LIVE DEMO

OpenWebUI and ComfyUI on Spark DXG

Self-hosted AI — running on a local device, accessible to anyone on the network  ·  openwebui.com

  1. Open browser on any device on the network → Spark's IP

    No install on your device. The Spark DXG hosts everything; you just browse to it.

  2. Shared team server

    This is what local AI looks like for an organization. One device serves the whole team.

  3. RAG and data in action

    The model answers from the document only — your data never left this room.

  4. ComfyUI — video & image generation

    Generative image pipelines running locally on the Spark. (Comfy.org)

💡 Think of OpenWebUI as "ChatGPT for your organization" — private, free, and running on hardware you control.

When to Use Which

Use LOCAL when…

✓ Data is sensitive / confidential

✓ You have token-heavy workloads

✓ You need offline capability

✓ API costs are a concern

✓ You want full control over the model

✓ Running batch jobs or automation

Use CLOUD when…

✓ You need frontier-level intelligence

✓ Hardware is limited

✓ Speed is critical

✓ Multi-modal tasks (vision, audio)

✓ Quick prototyping / experimentation

✓ Team collaboration & sharing

Local AI Is Moving Into the OS Itself

Just Announced

WWDC 2026 — This Week

Apple

Foundation Models framework — on-device LLMs in any app with a few lines of Swift

Runs entirely on Apple Silicon: privacy-first, no cloud required

iOS 27 & macOS 27: in beta now

Build 2026 — Last Week

Microsoft

Windows AI Engine built into the OS kernel: any app can call local SLMs via NPU

Windows AI Runtime runs full LLMs locally: offline Copilot, no cloud needed

RTX Spark chip brings dedicated on-device AI acceleration to PCs

Google I/O 2026

Google / Android

AI Edge Gallery: Gemma 4 running natively on Android & iOS (you're demoing this today)

On-device AI built into Android Studio and the platform SDK

Gemma models optimized for phones: getting smaller and faster each release

The tools you're learning today are the foundation of how every OS will handle AI tomorrow.

Resources & Next Steps

OpenWebUIopenwebui.com
LM Studiolmstudio.ai
Hugging Facehuggingface.co
ComfyUIComfy.org

Key Takeaways

  • Local LLMs are ready for real-world use today!
  • Privacy + cost savings are the biggest wins
  • Start with a 7B model — you'll be surprised at the quality
  • The ecosystem is growing fast — check back frequently

Generated locally with ComfyUI on the Nvidia Spark DXG:

QR code for comfy.org

ComfyUI

comfy.org — run it yourself

comfy.org ↗