Building Synapse: A Borg-like Cluster Scheduler for AI
Standard web servers are happy to start one by one. If you ask for 50 servers and get 40, your website still works.
AI is different. If you are training a massive model (like Llama 3) across 64 GPUs, and you can only get 63, the job cannot start. Standard schedulers will reserve those 63 GPUs and let them sit idle while waiting for the last one. This wastes millions of dollars in compute time.
To solve this, I built Synapse, a distributed cluster orchestrator that implements Gang Scheduling (All-or-Nothing allocation) to maximize resource utilization for AI workloads.
Architecture: Mimicking Borg
Synapse follows the classic Google Borg architecture (the predecessor to Kubernetes), consisting of three distinct components:
- The Brain (Scheduler): Holds the “state of the world.” It knows which nodes are free, which are busy, and where they are physically located.
- The Muscle (Worker): A lightweight agent on every machine. It listens for orders (“Start Job 50”) and executes them using my custom Carapace runtime.
- The Interface (CLI): A
kubectl-like tool for submitting jobs.
Topology Awareness
In a massive data center, the speed of light matters. If Node A and Node B are on the same rack, they can talk instantly. If they are on opposite sides of the building, latency destroys performance.
Synapse isn’t just looking for any free GPUs. It tries to find GPUs on the same rack (or simulated grouping) to maximize training speed.
The Core Problem: Gang Scheduling
The biggest challenge in AI infrastructure is ensuring that a job gets all its requested resources at the exact same time.
I implemented a Gang Scheduler loop that runs every second. It iterates through the pending job queue and performs an atomic check:
// Gang Scheduling: either the job gets ALL its resources, or it waits
func (c *InMemoryCluster) Schedule() {
c.mu.Lock()
defer c.mu.Unlock()
for _, job := range c.jobQueue {
if job.Status != JobPending { continue }
// 1. Scan: Can we satisfy this job's requirements?
neededCPU := job.MinCPU
candidateNodes := []*Node{}
for _, node := range c.nodes {
if node.AvailableCPU() > 0 {
candidateNodes = append(candidateNodes, node)
neededCPU -= node.AvailableCPU()
}
}
// 2. Decision: If we can't find enough, WAIT.
// Do not partially allocate resources.
if neededCPU > 0 {
continue
}
// 3. Commit: We have enough. Atomically claim resources.
for _, node := range candidateNodes {
// ... update node state ...
}
job.Status = JobScheduled
}
}
This prevents deadlocks where multiple jobs are holding onto partial resources, waiting for each other to finish.
Engineering Challenges
1. The “Split Brain” & Bi-Directional gRPC
In a typical web app, the client talks to the server. But in a cluster, who is the client?
- Worker -> Master: The worker needs to send Heartbeats (“I’m alive”).
- Master -> Worker: The master needs to send Commands (“Start Job”).
I solved this by making the Worker act as both a Client and a Server.
- It runs a background goroutine to push Heartbeats to the Master.
- It blocks the main thread on a
grpcServer.Serve(lis)call to listen for incomingStartJobcommands.
2. Fault Tolerance: The Reaper
Distributed systems must assume failure. I implemented a Reaper process in the Scheduler that scans for “silent” nodes.
If a node hasn’t sent a heartbeat in 10 seconds, the Reaper marks it as DEAD. This is critical for the Gang Scheduler—if a node dies, we need to know immediately so we don’t try to schedule new jobs on it.
// The Reaper Loop
go func() {
ticker := time.NewTicker(5 * time.Second)
for range ticker.C {
// Mark nodes as DEAD if they missed heartbeats
deadIDs := clusterManager.MarkDeadNodes(10 * time.Second)
for _, id := range deadIDs {
log.Printf("REAPER: Node %s is DEAD", id)
}
}
}()
3. Bridging Go and Rust
Synapse is written in Go (for its excellent concurrency primitives), but the container runtime (Carapace) is written in Rust (for low-level Linux control).
To bridge them, I used Go’s os/exec package to perform a Fork/Exec operation. The Go worker acts as the supervisor, forking a new process to run the Rust binary, which then sets up the Namespaces and Cgroups for the container.
// Go Worker Code
cmd := exec.Command("./carapace", "run", req.JobId, req.Image)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
err := cmd.Run() // Blocks until the container finishes
4. The RootFS Problem
Since Carapace is a custom runtime, it doesn’t pull images from Docker Hub automatically. If you pass ubuntu:latest, it fails because it looks for a folder on disk.
To make this work, I had to manually export a valid Root Filesystem (RootFS) from Docker:
# Create a dummy container
docker create --name temp-export ubuntu:latest
# Export the filesystem to a folder
docker export temp-export | tar -x -C my-rootfs
This gave Synapse a real environment to chroot into, allowing it to run actual Linux commands.
Conclusion
Building Synapse gave me a deep appreciation for the complexity of schedulers like Kubernetes and Borg. It’s not just about finding a free server; it’s about managing state, handling partial failures, and ensuring that expensive hardware isn’t sitting idle.
Check out the full source code on GitHub.