Efficient Multi-Device AI Inference: Leveraging Every Bit of Compute You Own

how to use all of your available local devices, phones, laptops, desktops, and embedded boards, for AI model inference.

May 02, 2025

Over the last few months, I’ve been experimenting with a problem I hadn’t seen clearly addressed: how to use all of your available local devices, phones, laptops, desktops, and embedded boards, for AI model inference.

Maybe you have a high-end desktop, a MacBook, an old iPad, a Raspberry Pi, and a few phones lying around. They're not powerful alone, but together, they represent a decent pool of compute.

I wanted to know: Can I shard a transformer model or any heavy inference workload across these machines and make it run faster, or at least, more efficiently?

The short answer is: yes, but it takes work. The long answer is what this post is about.

Context: Why Do This?

Most developers optimize for inference on a single device, compressing, quantizing, or distilling models to fit within the memory and speed limits of that machine.

However, many of us own multiple devices that sit idle. What if we could turn them into a local, ad-hoc inference cluster? Not for training, but for inference only. That means:

Partitioning models across devices
Running parts of inference pipelines independently
Streaming tensor data between steps
And maintaining real-time throughput

We're not talking about Kubernetes or Slurm. Think zero-install or minimal setup. Just distributed inference on commodity hardware, managed by Python scripts and low-level messaging.

Key Challenges

Before diving into code, here are the core problems this approach must solve:

Model Partitioning: Breaking a neural net into computable blocks with clear input/output tensor boundaries.
Device Discovery and Communication: Dynamic registration of devices with their compute profiles, and efficient tensor transfer between them.
Execution Graph Management: A system to track, schedule, and re-route inference steps based on hardware availability and network latency.
Precision and Serialization: Quantization and tensor serialization/deserialization between heterogeneous devices (ARM, x86, different accelerators).
Fault Tolerance: Handling node dropout, reconnection, and warm-state restoration.

Design: System Architecture

High-Level Flow

A central controller initializes the computation graph.
Devices register with the controller and report their capabilities.
The model is partitioned into blocks.
Each block is assigned to a device.
The controller sends tensors to the first device.
Each device:
- Runs its assigned block(s)
- Sends the output tensor to the next device
The final device sends the result back to the controller.

           [Controller]
                |
         ---------------------
         |        |         |
      [Device 1][Device 2][Device 3]
         |        |         |
      Block A   Block B   Block C
         ↓        ↓         ↓
         --> Intermediate tensors -->

Model Partitioning in Practice (Transformers)

Transformer models are inherently sequential: each layer processes the hidden state and passes it forward. This makes them perfect for vertical model parallelism.

Using Hugging Face’s Transformer stack, for instance:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16)
layers = model.transformer.h  # list of transformer blocks

# Partition into chunks
chunks = []
num_devices = 3
chunk_size = len(layers) // num_devices
for i in range(num_devices):
    start = i * chunk_size
    end = len(layers) if i == num_devices - 1 else (i + 1) * chunk_size
    chunks.append(layers[start:end])

Each chunk is serializable (within PyTorch) and can be transferred to another device with the appropriate architecture.

Messaging and Data Transfer

Option 1: `ZeroMQ`

ZeroMQ gives you lightweight messaging sockets with nearly no overhead and no brokers. Here's a simplified example:


# device_node.py
import zmq
import torch

context = zmq.Context()
socket = context.socket(zmq.REP)
socket.bind("tcp://*:5555")

while True:
    data = socket.recv()
    tensor = torch.load(io.BytesIO(data))
    output = run_model_chunk(tensor)
    
    out_bytes = io.BytesIO()
    torch.save(output, out_bytes)
    socket.send(out_bytes.getvalue())

Option 2: `gRPC`

If you need platform-independent message definitions and better error handling, gRPC is a good option—but comes with more boilerplate.

Tensor Serialization

Avoid JSON or protobuf for raw tensors. Stick with PyTorch’s torch.save, NumPy’s savez_compressed, or Cap’n Proto for speed.

To reduce size:

Use float16 or int8 quantized models.
Compress with lz4 or zstd if latency allows.

Device Orchestration

Controller

Each device on your LAN runs a daemon with:

A health-check endpoint (ping, memory)
Supported ops (CPU/GPU, RAM)
Current load (active jobs)

The controller periodically checks device status and reassigns blocks as needed.

You can implement this via:

A heartbeat over UDP broadcast
HTTP-based keepalive pings
SSH for remote code execution (e.g., deploying model chunks)

Deployment on Heterogeneous Hardware

x86 (Linux/macOS): PyTorch native

ARMv8 (Raspberry Pi 4/5): PyTorch + Metal/LLVM backend

iOS/macOS (M1/M2): CoreML / Metal

Android: TensorFlow Lite / ONNX Runtime

For non-PyTorch clients (like CoreML or TFLite), you need to:

Export submodels from the full model (ONNX or TorchScript)
Wrap inference in an HTTP/REST or WebSocket server on the mobile device
Use mDNS or Bonjour to auto-discover devices on the LAN

Performance: What You Can Expect

Let’s take LLaMA 7B (quantized to 4-bit) as a benchmark.

| Setup                      | Devices                                            | Tokens/sec                 |
| -------------------------- | ---------------------------------------- 
| Single Device              | MacBook Pro M2 (base)                    | ~0.7                       |
| Dual Device (MacBook + PC) | M2 + GTX 1070                            | ~1.3                       |
| Tri-Device                 | M2 + GTX 1070 + Raspberry Pi 5           | ~1.5                       | 
| Quad (Add iPad M1)         | M2 + GTX + RPi5 + iPad M1 (CoreML chunk) | ~1.8                       |
|                            |

The gains aren't linear. You’re bound by the slowest device and tensor transfer speed. However, CPU-only inference benefits more because you're parallelizing across what would otherwise be serial.

Challenges and Workarounds

1. Layer Normalization and Residuals

Splitting layers improperly can break residual connections. Always split at transformer block boundaries, not mid-layer.

2. Memory Bloat

Each device loads a submodel. Memory usage is proportional to how many layers it holds. Quantization is critical.

3. Network Latency

Use high-throughput LAN. Avoid sending large tensors over WiFi if possible. Consider compression pipelines like fp16 → int8 → lz4.

4. Fault Recovery

If a node drops mid-run:

Retry the request (idempotent)
Reassign that model chunk to a backup node
Store tensor checkpoints (slow, but useful in long inference chains)

Extensions

Dynamic scheduling: Reassign layers on the fly based on device health/load.
Weighted layer distribution: Assign more layers to faster nodes.
Multi-stream inference: Run multiple forward passes in parallel across replicas.
Distributed token streaming: For autoregressive models, stream next-token computation to a free device.

Final Thoughts

This is not plug-and-play. You have to orchestrate every part, from model slicing to tensor transfer and error handling.

But once you do, it’s powerful. You get full control over your inference stack and can make use of hardware most people forget.

If you're building local-first AI tools, edge deployment systems, or just like solving system-level puzzles, it's a worthwhile direction to explore.

I’ll probably open-source a version of this in the next month, once I clean up the orchestration logic, config system, and add device auto-discovery. If you're interested, reach out.

JASKIRAT’s Substack

Discussion about this post

Ready for more?