Intelligence Per Watt

Vision Statement

From 1946 to 2009, computing efficiency—performance per watt—doubled every 1.5 years. This trend, documented by Koomey and colleagues, transformed where computing could happen. Workloads migrated from mainframe rooms to desktops, then laptops, then pockets. The transition from centralized time-sharing to personal computing didn't occur because PCs surpassed mainframes in raw performance. It occurred when efficiency gains made computing capable enough within the power constraints of personal devices.

We're at the same inflection point for artificial intelligence.

Today, most AI queries flow through centralized datacenters while demand grows at steep rates: 1300× increases in token processing, year-over-year scaling that strains power grids. Yet telemetry shows that 77% of requests are practical tasks—writing emails, summarizing documents, seeking information—that don't require frontier-scale models.

We propose INTELLIGENCE PER WATT (IPW)—task accuracy per unit of power—as a unified metric for understanding this transition. Just as performance-per-watt guided the mainframe-to-PC shift, intelligence-per-watt clarifies the path from centralized AI to distributed intelligence. IPW provides a common framework for studying three questions shaping AI's future:

Workload Redistribution: From Cloud to Edge

Local language models (≤20B parameters) now accurately answer 88.7% of single-turn queries, and consumer accelerators run them at interactive latencies. IPW improved 5.3× from 2023–2025—3.1× from model advances, 1.7× from hardware gains. By measuring intelligence efficiency across the model-hardware landscape, we can identify which queries belong on which devices. Hybrid systems that route queries appropriately cut energy, compute, and cost by 60–80% while preserving quality. IPW tracks this redistribution as it unfolds.

Economic Value: Measuring AI's Real-World Impact

Not all intelligence is equal. A model that handles graduate-level physics but fails at email drafting delivers different economic value than one with the opposite profile. By weighting IPW against GDP-relevant task distributions, we can quantify how much economic value AI systems generate per watt consumed. This lens reveals where current systems create value, where gaps remain, and how efficiency gains translate into productivity across economic sectors.

National Competitiveness: The Global AI Race

The nation that most efficiently converts energy into deployed intelligence gains advantage. We introduce Gross Domestic Intelligence (GDI)—the product of intelligence-per-watt and accessible power—as a framework for AI competition. China and the United States face inverse constraints: China is compute-bound by export controls on advanced chips; America is energy-bound by grid limitations and datacenter bottlenecks. IPW reveals an asymmetric American asset: hundreds of millions of local accelerators already deployed in homes and offices. This installed base could boost effective AI capacity 2–4× without new datacenter construction.

The path forward: Intelligence per watt should be a north star metric for model architecture, hardware design, and national strategy. We're building the measurement infrastructure, benchmarks, and systems to make this concrete—and releasing our tools for others to use.

The IPW Research Agenda

We're pursuing a coordinated research program to understand and maximize intelligence efficiency across the full stack.

Category	Initiative	Objective
Measurement & Benchmarking	GDP-Weighted Evaluation	Quantifying economic value generated per watt on real-world, GDP-relevant tasks.
Measurement & Benchmarking	IPW Attribution	Decomposing efficiency gains into algorithmic versus hardware contributions through continuous benchmarking.
National Competitiveness	Gross Domestic Intelligence	Identifying high-impact interventions across inference systems, power grids, and model architectures.
Models & Systems	Post-training for IPW	Training local models to use frontier models as tools for verification and sophisticated assistance.
Models & Systems	Hybrid Inference Engine	Building systems that automatically route work between local and cloud compute to maximize IPW subject to latency, privacy, and cost constraints.

Papers + Code

Publications

📰 Article

China's AI Heist

Jared Dunnmon, Avanika Narayan, Jon Saad-Falcon

A Foreign Affairs essay on how the United States should respond to Beijing's unauthorized "distillation" of frontier AI models, and what safeguarding America's lead in AI will require.

Read in Foreign Affairs →

📄 Publication

Intelligence Per Watt: Measuring Intelligence Efficiency of Local AI

Jon Saad-Falcon*, Avanika Narayan*, et al.

Introduces "intelligence per watt" (IPW) as a metric for measuring AI efficiency, finding that local LMs can answer 88.7% of single-turn reasoning & chat queries and that hybrid local-cloud routing cuts energy use by 64% and costs by 59% compared to cloud-only inference.

Paper (arXiv) → Blog Post →

📄 Publication

Maximizing American Gross Domestic Intelligence with Hybrid Inference

Jared Dunnmon*, Avanika Narayan*, Jon Saad-Falcon*, Chris Ré

Proposes "Gross Domestic Intelligence" (GDI) as a framework for national AI competitiveness, arguing that the U.S. can boost effective inference capacity 2–4× by activating the 70–80M AI-capable devices already deployed in American homes and offices alongside cloud infrastructure.

Blog Post →

📄 Publication

OpenJarvis: Personal AI, On Personal Devices

Jon Saad-Falcon*, Avanika Narayan*, John Hennessy, Christopher Ré, Azalia Mirhoseini

An open-source framework for building personal AI agents that run entirely on-device, providing composable primitives for local AI systems that prioritize efficiency and privacy by keeping user data on personal hardware rather than routing through cloud services.

Paper (arXiv) → Blog Post →

📄 Publication

Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models

Avanika Narayan*, Dan Biderman*, Sabri Eyuboglu*, et al.

Introduces protocols for local-cloud LM collaboration on long-document reasoning tasks, where MinionS reduces cloud costs by 5.7× while maintaining 97.9% of frontier model accuracy by decomposing tasks into parallelizable subtasks executed locally.

Paper (arXiv) → Blog Post →

📄 Publication

Archon: An Architecture Search Framework for Inference-Time Techniques

Jon Saad-Falcon, Adrian Gamarra Lafuente, Shlok Natarajan, et al.

An automated framework for optimizing inference-time techniques in LLMs, exploring a large design space to discover optimized configurations. Archon-designed systems outperform frontier models such as OpenAI's o1, GPT-4o, and Claude 3.5 Sonnet by an average of 15.1% across instruction-following, reasoning, and coding tasks.

Paper (arXiv) →

📄 Publication

Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers

Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, et al.

A framework combining multiple imperfect verifiers to evaluate language model responses. Uses weighted ensembles of weaker verification systems with weak supervision to estimate accuracy, achieving competitive results with smaller models that approach the performance of advanced systems like o3-mini.

Paper (arXiv) →

Code + Tools

🔧 Code & Tools

IPW Profiling Harness

Open-source benchmarking suite that profiles LLM inference across NVIDIA, AMD, and Apple Silicon, measuring energy consumption, power draw, latency, and throughput to compute intelligence-per-watt metrics for any model-accelerator configuration.

GitHub Repository →

🔧 Code & Tools

OpenJarvis

Open-source toolkit for building and deploying personal AI agents on local hardware. Provides composable primitives, device-optimized model serving, and privacy-preserving pipelines for on-device intelligence.

GitHub Repository → Docs →

🔧 Code & Tools

Minions

Reference implementation for local-cloud LM collaboration protocols. Includes MinionS and Minion strategies for decomposing tasks across on-device and cloud models to reduce costs while preserving accuracy.

GitHub Repository →

🔧 Code & Tools

Archon

Architecture search framework for automatically discovering optimized inference-time technique configurations across LLMs, including generation ensembling, fusion, ranking, and verification strategies.

GitHub Repository →

🔧 Code & Tools

Weaver

Toolkit for building weighted ensembles of weak verifiers to evaluate language model outputs. Enables scalable verification using smaller, cost-efficient models with weak supervision techniques.

GitHub Repository →

People

Sponsors & Labs

Labs

Blog

May 15, 2026 · Hazy Research ↗

From Minions to OpenJarvis: A Retrospective on Two Years in Local AI

Jon Saad-Falcon, Avanika Narayan

A look back at two years of research on local AI — tracing the path from Minions through to OpenJarvis, the lessons learned along the way, and where on-device intelligence is headed next.

March 17, 2026

How Close Are Local Models to the Cloud? An OpenJarvis Benchmark Study

Avanika Narayan, Jon Saad-Falcon

We used OpenJarvis to run a head-to-head evaluation of 8 local open-source models against 6 frontier cloud models across 5 representative use-case benchmarks. The headline: local models rank within the top 3 overall.

March 12, 2026 · Scaling Intelligence Lab ↗

OpenJarvis: Personal AI, On Personal Devices

Jon Saad-Falcon, Avanika Narayan, Herumb Shandilya, Hakki Orhun Akengin, Robby Manihani, Gabriel Bo, John Hennessy, Christopher Ré, Azalia Mirhoseini

An open-source framework for personal AI agents that run entirely on-device. OpenJarvis provides composable primitives, treats efficiency as a first-class constraint, and lets models improve locally from interaction traces while keeping user data on personal hardware.

November 11, 2025 · Hazy Research ↗

Intelligence Per Watt: A Study of Local Intelligence Efficiency

Jon Saad-Falcon, Avanika Narayan, John Hennessy, Azalia Mirhoseini, Christopher Ré

Introduces Intelligence Per Watt (IPW) as a metric for how effectively inference systems convert energy into accurate computation. Local LMs handle 88.7% of single-turn chat queries while IPW improved 5.3× over two years — pointing toward a shift from centralized cloud to distributed edge inference.

How Close Are Local Models to the Cloud? An OpenJarvis Benchmark Study

Avanika Narayan, Jon Saad-Falcon · March 17, 2026

TL;DR — We used OpenJarvis to run a head-to-head evaluation of 8 local open-source models against 6 frontier cloud models across 5 representative use-case benchmarks. The headline: local models rank within the top 3 overall, with the best local model (Qwen3.5:122B-A10B, 0.840 avg accuracy) matching or exceeding frontier cloud models like Claude Opus 4.6 and GPT-5.4. When you factor in that local inference costs $0 in API fees (you already own the hardware), the picture starts to get very interesting.

The Eval Setup

Tasks — We designed 5 use-case benchmarks that mirror how people actually use AI assistants:

Benchmark	What It Tests
Coding Assistant	Generate, debug, and explain code
Security Scanner	Identify vulnerabilities in code snippets
Daily Digest	Summarize news/information into concise briefings
Document Q&A	Answer questions grounded in provided documents
Browser Assistant	Parse and reason over web content

Each benchmark has 30 samples, scored via LLM-as-a-judge methodology over task-specific rubrics. All runs use temperature 0.0 and seed 42 for reproducibility.

Models — All local models were served via Ollama on a local server with a single NVIDIA H100 GPU in Q4 quantization. We evaluated 14 models — 8 local and 6 cloud (via native APIs):

Model	Type	Details
Qwen3.5:122B-A10B	Local	MoE, 122B total / 10B active
Qwen3.5:35B-A3B	Local	MoE, 35B total / 3B active
GPT-OSS:120B	Local	MoE, 120B
GLM-4.7-Flash	Local	MoE, 4.7B
GLM4:Latest	Local	Dense, 9B
Granite4-h-small	Local	32B hybrid (SSM+attention), 19.5 GB
Granite3.3:8b	Local	8B, 4.9 GB
Granite4-micro	Local	3B, 2.1 GB
Claude Opus 4.6	Cloud	Anthropic
Claude Haiku 4.5	Cloud	Anthropic
GPT-5.4	Cloud	OpenAI
GPT-5-mini	Cloud	OpenAI
Gemini 3.1 Pro	Cloud	Google
Gemini 3.1 Flash Lite	Cloud	Google

Metrics

Accuracy: Per-benchmark rubric scoring (0–1)
Latency: End-to-end response time (ms)
Time to First Token (TTFT): Time elapsed before first token is generated (ms)
Cost: Dollar cost per request (cloud); local models cost $0 in API fees
Token usage: Input/output tokens per sample

The Results

The Leaderboard

#	Model	Type	Coding	Security	Digest	DocQA	Browser	Avg
1	Qwen3.5:122B-A10B	Local	1.000	0.667	0.833	0.800	0.900	0.840
2	Qwen3.5:35B-A3B	Local	1.000	0.400	0.941	0.824	0.923	0.818
3	Gemini 3.1 Flash Lite	Cloud	1.000	0.400	0.933	0.867	0.867	0.813
4	Claude Opus 4.6	Cloud	0.967	0.400	0.933	0.967	0.767	0.807
5	GPT-5.4	Cloud	1.000	0.233	0.967	0.967	0.833	0.800
6	Granite4-h-small	Local	1.000	0.300	0.930	0.800	0.900	0.790
7	Gemini 3.1 Pro	Cloud	0.933	0.367	0.933	0.933	0.767	0.787
8	Granite3.3:8b	Local	1.000	0.300	0.870	0.730	0.830	0.750
9	Claude Haiku 4.5	Cloud	0.767	0.433	0.933	0.800	0.767	0.740
10	GPT-5-mini	Cloud	1.000	0.367	0.933	0.833	0.467	0.720
11	GPT-OSS:120B	Local	0.967	0.367	0.900	0.767	0.500	0.700
12	GLM-4.7-Flash	Local	0.800	0.367	0.833	0.767	0.700	0.693
13	Granite4-micro	Local	1.000	0.200	0.830	0.700	0.730	0.690
14	GLM4:Latest	Local	0.800	0.200	0.733	0.667	0.467	0.573

The top of the table is striking: local models rank within the top 3 overall. Excitingly, Qwen3.5:35B-A3B does it with only 3 billion active parameters per token.

Also notable: Granite4-h-small (a 32B hybrid SSM+attention model at 19.5 GB) slots in at #6 overall — ahead of Gemini Pro — while Granite4-micro (a 3B model that fits in 2.1 GB) scores 0.69 at sub-3-second latency.

The Cost Story

Model	Type	Accuracy	Total Cost	Cost/Request
Gemini Flash Lite	Cloud	0.813	$0.19	$0.001
Claude Haiku 4.5	Cloud	0.740	$0.48	$0.003
GPT-5-mini	Cloud	0.720	$0.54	$0.004
Gemini Pro	Cloud	0.787	$1.34	$0.009
Claude Opus 4.6	Cloud	0.807	$3.47	$0.023
GPT-5.4	Cloud	0.800	$6.74	$0.045
All 8 local models	Local	0.732 (avg)	$0	$0

GPT-5.4 costs 36x more than Gemini Flash Lite for comparable accuracy (0.800 vs 0.813). And the local models? $0 API cost. Once you have the hardware, inference is free (for simplicity we don't factor in electricity costs). Qwen3.5:35B-A3B delivers 0.818 accuracy at no marginal cost vs Claude Opus at $3.47 for 0.807.

Total cloud API spend for this eval: $12.77. Total for all local models: $0.

The Latency Story

Model	Type	Mean Latency	Mean TTFT
Granite4-micro	Local	2.69s	—
Gemini Flash Lite	Cloud	2.8s	2.8s
Granite3.3:8b	Local	3.30s	—
GLM4:Latest	Local	5.7s	0.06s
Claude Haiku 4.5	Cloud	5.2s	5.2s
Granite4-h-small	Local	6.98s	—
GPT-5.4	Cloud	12.3s	12.3s
Claude Opus 4.6	Cloud	18.5s	18.5s
GPT-OSS:120B	Local	58.2s	0.27s
Qwen3.5:35B-A3B	Local	89.5s	0.25s
GLM-4.7-Flash	Local	104.0s	0.13s
Qwen3.5:122B-A10B	Local	418.2s	3.25s

The average cloud latency is 13.5s per sample. The average local latency is 135s — about 10x slower.

TTFT tells a different story. Local models start streaming tokens almost instantly — GLM-4.7-Flash has a 0.13s TTFT vs cloud models at 2–24s. The total latency comes from generation, not queuing. For applications that show streaming output (chat UIs, code editors), the perceived responsiveness of local models can actually be better than cloud.

The Granite models tell a particularly compelling latency story: Granite4-micro at 2.69s is the fastest model in the entire eval (local or cloud), and Granite4-h-small at 6.98s delivers 0.79 accuracy — faster than GPT-5.4, Claude Opus, and Gemini Pro.

Interesting Findings

Finding 1: $0 vs. $12.77 — The Cost Gap Is Stark

Total cloud API spend for this eval: $12.77. Total for all 8 local models: $0. And the variance within cloud is just as wild. GPT-5.4 costs 36x more than Gemini Flash Lite ($6.74 vs $0.19) for comparable accuracy (0.800 vs 0.813). Meanwhile, Qwen3.5:35B-A3B delivers 0.818 accuracy — higher than Opus — at zero marginal cost. Once you own the hardware, every additional request is free.

Finding 2: Local Models Stream First, Think Later

The latency story looks bad for local models at first glance — 135s average vs 13.5s for cloud, roughly 10x slower. But TTFT (time-to-first-token) flips the narrative. Local models start streaming tokens almost instantly: GLM-4.7-Flash has a 0.13s TTFT, Qwen3.5:35B-A3B clocks 0.25s, and even the massive Qwen3.5:122B-A10B starts at 3.25s. Cloud models? They range from 2.8s (Gemini Flash Lite) to 24.2s (GPT-5-mini) before you see a single token.

Finding 3: MoE Is the Secret Sauce

Both top-performing local models use Mixture-of-Experts architecture. Qwen3.5:122B-A10B has 122 billion total parameters but only activates 10 billion per token. Qwen3.5:35B-A3B activates just 3 billion. This means they fit on a single H100 while delivering frontier-class accuracy. The MoE architecture is what makes "local models that compete with the cloud" possible — you get the knowledge capacity of a massive model with the inference cost of a small one.

Finding 4: Tiny Models Punch Way Above Their Weight

Look at the Granite models in the leaderboard. All three perform well for coding — including Granite4-micro, a 3 billion parameter model that fits in 2.1 GB of memory. It scores 0.69 average across all benchmarks at sub-3-second latency. That's a model that can run on your phone and still get 69% of the way to Claude Opus.

Granite4-h-small is particularly interesting — its hybrid SSM+attention architecture delivers 79% accuracy at 7s latency, slotting in at #6 overall (ahead of Gemini Pro!) while being 4x faster than Qwen3.5:35B-A3B at only 5% less accuracy.

Limitations

Sample sizes are small (30 per benchmark). We'd love to scale this up.
5 benchmarks don't cover everything. These are realistic use-case tasks, but they're not exhaustive.
Quantization matters. All local models ran in Q4 quantization. Full-precision scores might differ.

This is exploratory work. We spun up a few representative benchmark tasks and wanted to openly share what we learned. We're still learning — so if you see something off, let us know.

Key Takeaways

The accuracy gap between local and cloud models is narrow and closing fast. Local models rank within the top 3 overall, and the best local MoE models are competitive with every cloud model we tested.
Cost is the killer argument for local. $0 marginal cost for equivalent or better accuracy. For batch workloads, data processing, evals, or any use case where you're making hundreds+ of requests, local inference is a no-brainer once you have the hardware.
Latency is the real tradeoff. Local models are slower for end-to-end generation, but their instant TTFT means streaming UIs feel responsive. For interactive use, smaller models like Granite4-h-small (7s) or GLM4 (5.7s) are cloud-competitive on latency.
MoE architecture is what makes this possible. Sparse activation lets you pack frontier-level knowledge into a GPU-friendly package. The models winning on local inference aren't brute-forcing it with dense parameters — they're being smart about which parameters to use.
Small models are more capable than you think. A 3B model at 2.1 GB can execute coding tasks and handle 69% of what Opus handles. A 32B hybrid model matches cloud latency. Don't sleep on small models.

OpenJarvis is open source. Check it out at github.com/HazyResearch/OpenJarvis.

Vision Statement

Workload Redistribution: From Cloud to Edge

Economic Value: Measuring AI's Real-World Impact

National Competitiveness: The Global AI Race

The IPW Research Agenda

Papers + Code

Related Works

People

Principal Investigators

PhD Students

Master's Students

Undergraduates

Industry Collaborators

Sponsors & Labs

Sponsors

Labs

Blog

The Eval Setup

The Results

Interesting Findings

Limitations

Key Takeaways