Efficient Multi-Device AI Inference: Leveraging Every Bit of Compute You Own
how to use all of your available local devices, phones, laptops, desktops, and embedded boards, for AI model inference.
Over the last few months, I’ve been experimenting with a problem I hadn’t seen clearly addressed: how to use all of your available local devices, phones, laptops, desktops, and embedded boards, for AI model inference.
Maybe you have a high-end desktop, a MacBook, an old iPad, a Raspberry Pi, and a few phones lying around. They're not powerful alone, but together, they represent a decent pool of compute.
I wanted to know: Can I shard a transformer model or any heavy inference workload across these machines and make it run faster, or at least, more efficiently?
The short answer is: yes, but it takes work. The long answer is what this post is about.
Context: Why Do This?
Most developers optimize for inference on a single device, compressing, quantizing, or distilling models to fit within the memory and speed limits of that machine.
However, many of us own multiple devices that sit idle. What if we could turn them into a local, ad-hoc inference cluster? Not for training, but for inference only. That means:
Partitioning models across devices
Running parts of inference pipelines independently
Streaming tensor data between steps
And maintaining real-time throughput
We're not talking about Kubernetes or Slurm. Think zero-install or minimal setup. Just distributed inference on commodity hardware, managed by Python scripts and low-level messaging.
Key Challenges
Before diving into code, here are the core problems this approach must solve:
Model Partitioning: Breaking a neural net into computable blocks with clear input/output tensor boundaries.
Device Discovery and Communication: Dynamic registration of devices with their compute profiles, and efficient tensor transfer between them.
Execution Graph Management: A system to track, schedule, and re-route inference steps based on hardware availability and network latency.
Precision and Serialization: Quantization and tensor serialization/deserialization between heterogeneous devices (ARM, x86, different accelerators).
Fault Tolerance: Handling node dropout, reconnection, and warm-state restoration.
Design: System Architecture
High-Level Flow
A central controller initializes the computation graph.
Devices register with the controller and report their capabilities.
The model is partitioned into blocks.
Each block is assigned to a device.
The controller sends tensors to the first device.
Each device:
Runs its assigned block(s)
Sends the output tensor to the next device
The final device sends the result back to the controller.
[Controller]
|
---------------------
| | |
[Device 1][Device 2][Device 3]
| | |
Block A Block B Block C
↓ ↓ ↓
--> Intermediate tensors -->
Model Partitioning in Practice (Transformers)
Transformer models are inherently sequential: each layer processes the hidden state and passes it forward. This makes them perfect for vertical model parallelism.
Using Hugging Face’s Transformer stack, for instance:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
layers = model.transformer.h # list of transformer blocks
# Partition into chunks
chunks = []
num_devices = 3
chunk_size = len(layers) // num_devices
for i in range(num_devices):
start = i * chunk_size
end = len(layers) if i == num_devices - 1 else (i + 1) * chunk_size
chunks.append(layers[start:end])
Each chunk is serializable (within PyTorch) and can be transferred to another device with the appropriate architecture.
Messaging and Data Transfer
Option 1: ZeroMQ
ZeroMQ gives you lightweight messaging sockets with nearly no overhead and no brokers. Here's a simplified example:
# device_node.py
import zmq
import torch
context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555")
while True:
data = socket.recv()
tensor = torch.load(io.BytesIO(data))
output = run_model_chunk(tensor)
out_bytes = io.BytesIO()
torch.save(output, out_bytes)
socket.send(out_bytes.getvalue())
Option 2: gRPC
If you need platform-independent message definitions and better error handling, gRPC is a good option—but comes with more boilerplate.
Tensor Serialization
Avoid JSON or protobuf for raw tensors. Stick with PyTorch’s torch.save, NumPy’s savez_compressed, or Cap’n Proto for speed.
To reduce size:
Use
float16orint8quantized models.Compress with
lz4orzstdif latency allows.
Device Orchestration
Controller
Each device on your LAN runs a daemon with:
A health-check endpoint (ping, memory)
Supported ops (CPU/GPU, RAM)
Current load (active jobs)
The controller periodically checks device status and reassigns blocks as needed.
You can implement this via:
A heartbeat over UDP broadcast
HTTP-based keepalive pings
SSH for remote code execution (e.g., deploying model chunks)
Deployment on Heterogeneous Hardware
x86 (Linux/macOS): PyTorch native
ARMv8 (Raspberry Pi 4/5): PyTorch + Metal/LLVM backend
iOS/macOS (M1/M2): CoreML / Metal
Android: TensorFlow Lite / ONNX Runtime
For non-PyTorch clients (like CoreML or TFLite), you need to:
Export submodels from the full model (ONNX or TorchScript)
Wrap inference in an HTTP/REST or WebSocket server on the mobile device
Use
mDNSor Bonjour to auto-discover devices on the LAN
Performance: What You Can Expect
Let’s take LLaMA 7B (quantized to 4-bit) as a benchmark.
| Setup | Devices | Tokens/sec |
| -------------------------- | ----------------------------------------
| Single Device | MacBook Pro M2 (base) | ~0.7 |
| Dual Device (MacBook + PC) | M2 + GTX 1070 | ~1.3 |
| Tri-Device | M2 + GTX 1070 + Raspberry Pi 5 | ~1.5 |
| Quad (Add iPad M1) | M2 + GTX + RPi5 + iPad M1 (CoreML chunk) | ~1.8 |
| |The gains aren't linear. You’re bound by the slowest device and tensor transfer speed. However, CPU-only inference benefits more because you're parallelizing across what would otherwise be serial.
Challenges and Workarounds
1. Layer Normalization and Residuals
Splitting layers improperly can break residual connections. Always split at transformer block boundaries, not mid-layer.
2. Memory Bloat
Each device loads a submodel. Memory usage is proportional to how many layers it holds. Quantization is critical.
3. Network Latency
Use high-throughput LAN. Avoid sending large tensors over WiFi if possible. Consider compression pipelines like fp16 → int8 → lz4.
4. Fault Recovery
If a node drops mid-run:
Retry the request (idempotent)
Reassign that model chunk to a backup node
Store tensor checkpoints (slow, but useful in long inference chains)
Extensions
Dynamic scheduling: Reassign layers on the fly based on device health/load.
Weighted layer distribution: Assign more layers to faster nodes.
Multi-stream inference: Run multiple forward passes in parallel across replicas.
Distributed token streaming: For autoregressive models, stream next-token computation to a free device.
Final Thoughts
This is not plug-and-play. You have to orchestrate every part, from model slicing to tensor transfer and error handling.
But once you do, it’s powerful. You get full control over your inference stack and can make use of hardware most people forget.
If you're building local-first AI tools, edge deployment systems, or just like solving system-level puzzles, it's a worthwhile direction to explore.
I’ll probably open-source a version of this in the next month, once I clean up the orchestration logic, config system, and add device auto-discovery. If you're interested, reach out.


