Imprint

On Customizable software

Bill Njoroge — Wed, 29 Apr 2026 00:00:00 GMT

On Customizable software.

Much of how we built software has changed over the past year or so. Agents are much more capable than they were even just 6 months ago. It's kind of surreal how trivial it's gotten to customize software to your own liking. For instance, I use codex++, a tweak system that allows you to somewhat customize the Codex MacOS desktop app. I've built plugins to do things like use Pierre's new diffs tool instead of the native diff viewer Codex supports or any of the existing tweaks such as support for /goal feature. These are relatively trivial though so I wanted to document my experience using agents for slightly more complex features.

I use Zed as my primary editor these days. It's gotten much better since when I tried it out maybe 2 years ago. Basic primitives generally work well. I also use their Agent Context Protocol(ACP) integration which allows me to use my existing Codex/Claude subs in Zed's agent panel. The codex ACP, via the subscription, however doesnt quite work in remote projects. That was frustrating because the whole point of the setup was to use Codex naturally from inside Zed, not to manage a separate auth mode only because the project happened to be remote. It would work fine for local zed projects, but you'd have to use an API key for remote projects.

Tracing the bug

I had an initial hypothesis that either Zed was explicitly disabling the browser-based login path for remote projects or The Codex ACP adapter only knew how to authenticate through a local browser callback flow and had no headless fallback. Armed with the hypothesis, I spun up codex. Code archaeology, root-cause analysis, patching, validation, issue triage, fork setup, release wrangling, and PR prep was all driven by GPT 5.4 and 5.4-mini interchangeably.

"Codex in Zed" is really several systems glued together. Zed knows how to install and launch registry agents, codex-acp is the ACP adapter process Zed actually talks to and upstream Codex owns the real auth flows and token storage behavior. Codex added support for device-auth which is what made it possible to fix. The PR is here but the gist of it was keeping local projects on the existing browser callback login flow, in NO_BROWSER mode, keep ChatGPT auth available and switch the headless path to device-code auth instead of hiding it. I definitely did not want to wait until Zed releases this fix, if at all, so I just patched the binary that contains this fix, and it's been my primary way of using it.

Rust compile times are insane

This is my first time using Rust for a relatively large project, and man those compile times are brutal. This was not even a giant from-scratch product build. It was a targeted patch in a repo that pulls in a serious dependency graph through upstream Codex. Even when the code changes were modest, the build and link cycles were long enough that they shaped the entire experience. That changed how the agent workflow felt.

There was one point where we tried to produce polished release assets and got dragged through:

missing pkg-config
missing libssl-dev
missing libcap-dev
a stale apt repository on the Ubuntu box
fork release workflow assumptions that only made sense in upstream CI

That whole stretch made the session feel less like "AI is doing software for me" and more like "AI is helping me stay in the loop while I fight the usual systems bullshit."

What building with agents felt like

A few things stood out to me.

First, the best use of the agents was not raw generation. It was disciplined persistence. We kept narrowing from symptom to mechanism: From figuring out where does Zed decide this is headless to where is the ChatGPT method filtered. Does upstream Codex already support device code or is this a Zed limitation, a codex-acp limitation, or both?

Second, the models were useful as research and synthesis partners, especially when I already had a hunch but needed to ground it in code. They were good at turning a vague suspicion into a concrete explanation with file paths and behavior. Third, the systems boundary still matters a lot. The agent was most effective when the task was scoped, source-backed and verifiable. It was less magical when the task turned into orchestration across GitHub, local builds, remote VMs, and release asset handling. Still helpful, just less elegant. And fourth, I found myself wanting a better "long-running task" feel. When compilers are going and release assets are uploading, you do not need more intelligence so much as better ergonomics around waiting, status, and recovery. There's still so much to innovate here.

Where I landed

The whole thing took maybe 5 hours, and most of it was actually I/O from either the LLM or the compilers, not my "processing speed". It was also a good reminder that the hard part of agent-assisted engineering is often not the coding. It is the long tail: build systems, environment mismatches, release mechanics, and staying clearheaded while waiting for Rust to compile half the planet.

dfs and bfs in the wild

Bill Njoroge — Fri, 27 Mar 2026 00:00:00 GMT

I've been building a static analysis tool that detects concurrency bugs in Go programs. One of the rules it enforces is if a goroutine loops over a channel using for range, that channel needs to be closed at some point, otherwise the goroutine will block forever and leak.

For example:


go func() {

for range ch { }

}()

To check whether close(ch) actually happens, the tool looks at the code's control flow graph (CFG). This post explains them in slightly more detail. A CFG is just a map of how the program executes: it breaks a function into basic blocks (straight-line sequences of instructions) and draws arrows between them wherever the code can branch.

My first attempt was a simple BFS over this graph. Starting from the goroutine launch, I checked whether there was any path through the graph that eventually reached a close(ch). If so, I assumed the channel was properly closed.

This approach worked until I tested it against something like this:


func f(cond bool) {

	ch := make(chan int)
	
	go func() {
	
		for range ch { }
		
	}()

	if cond {
	
		close(ch)
	
	}

}

BFS found a path to close(ch) and concluded the channel was safe. It ignored the branch where cond is false and the function returns without ever closing the channel. That goroutine leaks, and the tool said nothing.

The problem is that BFS stops as soon as it finds what it's looking for. It found one safe path and called it done. But I didn't need to know whether a close was reachable on some path. I needed to know whether it was guaranteed on every path.

That's a different question, and it calls for DFS. Using DFS, I traced every path from the goroutine launch to the end of the function. If any path reached the end without passing through close(ch) first, the channel was not reliably closed.


var dfs func(node int) bool

dfs = func(n int) bool {

	if n == closeBlock {
	
	return false // this path is safe, stop here
	
	}

if len(successors[n]) == 0 {

	return true // reached the end without closing, this is the leak

}

for _, next := range successors[n] {

	if dfs(next) {
	
		return true
	
	}

}

	return false
	
	}

DFS is well suited here because it commits to exploring a single path all the way to its endpoint before backtracking. That commitment is exactly what makes it possible to say "no path escapes" rather than merely "some path is safe."

For the conditional example, DFS traces the if cond branch and finds close(ch) — fine, that path is safe. Then it backs up and traces the else branch, reaches the function exit, and notices it never closed the channel. That's the leak. The tool flags it.

Often times, data structures and algorithms feel abit too academic or theretical, but it's always fun when you need to reach for them in practice. Most of the time the graph is implicit in the structure of the problem, not handed to you as an adjacency list. Recognizing that the CFG was a graph, and that "guaranteed execution" was a path coverage question, was the step that made the right algorithm obvious.

CFG, Data Flow Analysis and SSA

Bill Njoroge — Mon, 23 Mar 2026 00:00:00 GMT

Compilers use static analysis to determine where transformations can be safely applied. Control flow and data flow analysis are two techniques often used in compiler optimization. Control-flow analysis seeks to understand the flow of control between operations, and data-flow analysis(DFA) analyses the flow of actual values through the code and operations. SSA is an intermediate representation that embeds a Control Flow Graph(CFG), and makes DFA relatively trivial.

I'll be looking at it from a concurrency-focused lens, not so much code optimization. Understanding how both controls and data flow then becomes especially important when you want to understand how different concurrency primitives are applied. The question i'll be exploring is whether a goroutine will leak.

To motivate this, let's look at a simple example:

func f(cond bool) {
    ch := make(chan int)
    go func() { for range ch {} }()  // line 4
    if cond {
        close(ch)  // line 6 — after goroutine, but only reachable when cond==true
    }
    // when cond==false: goroutine leaks
}

For the above example, we want to be able to detect if there is a goroutine leak, an instance where the channel is not closed after the goroutine blocks forever because the channel it ranges over is never closed. One trivial way to check for this is to literally compare the line-number ordering. If the close(ch) is after the goroutine, then we can probably assume that the goroutine is not going to leak(assuming we can correctly match the channel identifier). This approach, however, wouldnt work in the above case because the channel is closed conditionally.

With a CFG, you can ask some questions like Does every path from the goroutine launch to function exit pass through a close? This is called a dominance or post-dominance check.

This is an illustration of how the control flows in the example above.

graph TD
    entry["entry: ch = make(chan int)"]
    goStmt["go func() { range ch }"]
    ifCond{"cond?"}
    closeCh["close(ch)"]
    noClose["(no close)"]
    ret["return"]

    entry --> goStmt
    goStmt --> ifCond
    ifCond -->|true| closeCh
    ifCond -->|false| noClose
    closeCh --> ret
    noClose --> ret

To understand CFGs, we need to understand what post-dominators are. A node X post-dominates node Y if every path from Y to the function exit must pass through X. The post-dominator tree encodes this relationship: X post-dominates Y if and only if X is an ancestor of Y in the tree. We can look at the post-dominator tree for this code:

graph TD
    ret["return (EXIT)"]
    ifCond{"cond?"}
    goStmt["go func() { range ch }"]
    entry["entry: ch = make(chan int)"]
    closeCh["close(ch)"]
    noClose["(no close)"]

    ret --> ifCond
    ret --> closeCh
    ret --> noClose
    ifCond --> goStmt
    goStmt --> entry

Look at where close(ch) sits in the post-dominator tree. It hangs directly off return. It's not an ancestor of go func() { range ch }. That tells us close(ch) does not post-dominate the goroutine launch. So basically, there exists a path from the goroutine launch to function exit that never passes through close(ch).

Compare that with ifCond, which is an ancestor of goStmt in the post-dominator tree. Every path from the goroutine launch to the exit must pass through the cond branch.

So back to the question: "does the channel get closed on every path after the goroutine starts?" reduces to: "is close(ch) an ancestor of go func() in the post-dominator tree?" Here, the answer is no which tells us the goroutine can leak.

Data flow Analysis

Now, consider another extension to the example above:

func f() {
    ch := make(chan int)
    x := ch
    go func() { for range x {} }()
    close(ch)
}

In this case, we are assigning ch to another variable. The flow is trivially linear so you would think our initial line-numbering tool would work? But it actually wouldnt. A post-dominance check here would be fine in the sense that close(ch) is indeed on every path, but the range loops over x and we are closing ch. And so we would end up assuming that the channel x that the goroutine is ranging(not sure that's a word) over was not closed. We clearly need a deeper understanding of the flow of values. That's essentially where DFA comes into place. DFA tracks values. The specific form of DFA that helps here is called reaching definitions

ch := make(chan int)   // def₁: ch = <new channel>
x := ch               // def₂: x = ch
go func() {
    for range x {}     // using x. which definition reaches here?
}()
close(ch)              // using ch — which definition reaches here?

This is what the DFA looks like:

graph LR
    def1["def1: ch = make(chan int)"]
    def2["def2: x = ch"]
    useX["use: range x"]
    useCh["use: close(ch)"]

    def1 -->|"ch flows to"| def2
    def1 -->|"ch flows to"| useCh
    def2 -->|"x flows to"| useX
    def1 -.->|"same underlying value"| useX

def₁ creates the channel. def₂ copies it into x. The reaching definition for range x is def₂, which in turn got its value from def₁. The reaching definition for close(ch) is def₁ directly. Both trace back to the same make(chan int) so they operate on the same channel.

Static Single Assignment.

If you imagine a case like in a real-world program where you have branches, loops, or reassignments, doing this kind of analysis on raw code is a lot of work. SSA makes this relatively easy to do, and go makes it even easier with the ssa package. In SSA, every variable is assigned exactly once. If the original code assigns a variable twice, SSA creates two different names. Here's what the example looks like in SSA form:

t0 = make chan int     ; the one and only channel value
t1 = t0                ; x := ch so SSA shows t1 IS t0
go func() {
    range t1           ; uses t1, which IS t0
}
close(t0)              ; same value

If you process the SSA, and not the raw code, you can see that range and close operate on the same object. This example was fairly trivial and to be honest, might be handled by regular data flow analysis. What makes SSA more interesting is when you have different control flows that need to merged. So let's look at another example.

func f(cond bool) {
    ch1 := make(chan int)
    ch2 := make(chan int)
    var x chan int
    if cond {
        x = ch1
    } else {
        x = ch2
    }
    close(x)  // which channel does this close?
}

The fairly reasonable answer would be either channel would be closed right? I mean depending on if cond is true or false. In SSA, this would be:

t0 = make chan int       ; ch1
t1 = make chan int       ; ch2
if cond goto block1 else block2

block1:
  jump block3

block2:
  jump block3

block3:
  t2 = phi [block1: t0, block2: t1]   ; "x is t0 if we came from block1, t1 if block2"
  close(t2)

Looks a bit scary, but really the most important thing here is the phi node. This essentially encodes that its value depends on the path we took. This allows us to define SSA's exclusive 'every variable is assigned once' property. For concurrency, specifically these phi node representations are important because they can represent runtime conditions such as whether to use a buffered or unbuffered channel or which server to connect to.

What we've looked at are just single-function definitions. A more common use-case is when channels are created at one place, passed to consumers, producers or cleanup functions.

Take this example for instance:

func produce(ch chan<- int) {
    for i := 0; i < 5; i++ {
        ch <- i
    }
    close(ch)
}

func consume(ch <-chan int) {
    go func() {
        for range ch {}
    }()
}

func main() {
    ch := make(chan int)
    consume(ch)
    produce(ch)
}

Within main, the CFG is straight-line. SSA tells us that consume and produce both receive the same channel value (the MakeChan from main). But neither of them tell us whether the goroutine launched inside consume eventually sees a close on its channel. There's three different functions here, and none of the analysis we've done earlier can connect them. As you would imagine, we need some kind of call-graph. Enter Interprocedural analysis. The call-graph looks something like this.

graph TD
    main["main()"]
    consume["consume(ch)"]
    produce["produce(ch)"]
    anonFn["go func() { range ch }"]

    main -->|"calls"| consume
    main -->|"calls"| produce
    consume -->|"launches goroutine"| anonFn

This tells us who calls whom, but what we really wanna know is what happens to the channel. You could absolutely walk through each callee body and re-analyze it, but you quickly run into performance issues if say produce is called multiple times, or recursively. A better approach is to track a compact description of what a function does with its parameters without re-analyzing its entire body. This is called a function summary. You compute it once, then reuse it everywhere the function is called. Summaries are also nice because you can do cross-package analysis. The contrived function summaries look like this:

produce(ch chan<- int):
    sends to:  param#0
    closes:    param#0
    launches:  (none)

consume(ch <-chan int):
    sends to:  (none)
    closes:    (none)
    launches:  goroutine that ranges param#0

From the perspective of main:

main():
    ch = make(chan int)

    consume(ch):
        → launches goroutine that ranges ch

    produce(ch):
        → sends to ch
        → closes ch

We can see here that produce does indeed close the same ch we created in main which answers our question from earlier.

We started with a deceptively simple question: does a goroutine leak? Answering it pulled us through four layers of analysis. Control flow analysis gave us the CFG, post-dominance and the ability to reason about which paths exist. Data flow analysis gave us reaching definitions — the ability to track which values flow along those paths. SSA unified both into a single representation where the answers are structural, not computed. Interprocedural analysis extended the picture across function boundaries through call graphs and function summaries. Each technique answers a different dimension of the same question: paths, values, and boundaries. The goroutine leak was the motivating example, but these same tools generalize to any concurrency question you can think of really. Double closes, sends to channels no one receives from, mutexes held across goroutine boundaries. Just need to ask the right questions!

Connecting to multiple tailscale networks on a single host

Bill Njoroge — Wed, 11 Feb 2026 00:00:00 GMT

Tailscale is one of those tools that quietly becomes load-bearing in your life. I use it at work, at home, and I've run it on machines I barely remember owning. So naturally the moment I wanted to connect to two tailnets simultaneously, on the same laptop, I was bummed to discover you basically can't.

Logging out and back in works, sure. Doing that twelve times a day adds up.

Tailscale runs a single daemon, tailscaled, which creates one tunnel interface (tailscale0) and plants itself firmly in the host's networking stack. It assumes it owns the networking environment. Connecting to a second tailnet means evicting the first one.

The core idea

The easiest way to get a completely separate networking stack is to, well, run VMs on the host. I use this approach on my Mac with Orbstack. Orbstack machines are almost-like full vms, but much faster and lighter. You can also use Lima or Colima to achieve the same. The basic idea is the host connects to one tailnet, the VM connects to another, and they never interact.

Host machine → tailnet A
VM → tailnet B

On Linux, you can get the exact same isolation without the hypervisor overhead using network namespaces. A network namespace gives a process its own interfaces, routing table, and firewall rules. Software running inside it can't even tell it's sharing a kernel with anything else. Containers work like this under the hood. But running a VPN client inside a container usually means handing it elevated privileges so it can create tunnel interfaces and mess with routing tables.

Another approach worth considering is a userspace networking stack via SOCKS5 proxy. Instead of creating actual tunnel interfaces, Tailscale can run with --tun=userspace-networking, routing all traffic through a SOCKS5 proxy on localhost. This doesn't require elevated privileges and works well if your applications understand SOCKS5.

$ tailscaled --tun=userspace-networking --socks5-server=localhost:1890 --socket /home/bnjoroge/.local/share/tailscale1/tailscaled.sock --statedir /home/bnjoroge/.local/share/tailscale1/ &
$ tailscaled --tun=userspace-networking --socks5-server=localhost:1056 --socket /home/bnjoroge/.local/share/tailscale2/tailscaled.sock --statedir /home/bnjoroge/.local/share/tailscale2/ &

However, the SOCKS5 approach has a significant limitation: most applications don't understand SOCKS5 natively. You'd need to configure each tool individually—SSH via ProxyCommand, curl with --socks5, browsers with manual proxy settings. For anything that doesn't support SOCKS5, you'd have to layer an HTTP proxy on top, which adds complexity and breaks protocols like SSH that don't work through HTTP proxies. If you try to use Docker or Orbstack to isolate apps that need the proxy, you're back to managing VMs, which defeats the purpose of avoiding hypervisor overhead.

For my use case where I want CLI tools, databases, and services to transparently connect to different tailnets without per-application configuration, the proxy approach doesn't cut it. Working directly with namespaces gives you a completely separate networking stack where everything just works. Any tool, any protocol, no configuration needed beyond the initial setup.

You might think Docker would be perfect here. Containers use namespaces and cgroups under the hood, which is exactly what we need for isolation and resource limiting. VPN daemons, though, need to create tunnel interfaces and manipulate routing tables, which requires CAP_NET_ADMIN and CAP_SYS_MODULE capabilities. Running a containerized Tailscale instance involves some friction. You could run with the privileged flag and lose the isolation benefit, or pass specific capabilities and end up doing low-level networking setup inside the container anyway. You could also use host networking mode to let the container reach the host's network stack, but then you're back to a single shared network namespace and can't run two Tailscale instances simultaneously. Even with custom bridge networks and veth pairs between containers, you're essentially recreating the namespace setup inside Docker's abstraction layer, which adds complexity without much benefit. Docker shines for application isolation and multi-tenancy where you want resource limits, but for this specific task of running network daemons with their own stacks, the tooling overhead outweighs the containerization benefits.

Let's get into it.

We'll start by creating the namespace.

sudo ip netns add ts-b

Verify it's there:

ip netns list

Right now it's basically a loopback interface and no internet access or anything really. So we need to hook it up with a virtual ethernet cable. A veth pair is exactly what it sounds like: a virtual cable with two ends. Anything going into one end comes out the other.

sudo ip link add veth-host type veth peer name veth-ns

Both ends currently live in the host namespace. We need to fix that by setting one end to the host and the other in the tailscale one.

sudo ip link set veth-ns netns ts-b

Now the topology looks like this:

Host namespace
    |
    veth-host
    |
  [cable]
    |
    veth-ns
    |
Namespace ts-b

Host side:

sudo ip addr add 10.200.1.1/24 dev veth-host
sudo ip link set veth-host up

Namespace side:

sudo ip netns exec ts-b ip addr add 10.200.1.2/24 dev veth-ns
sudo ip netns exec ts-b ip link set veth-ns up
sudo ip netns exec ts-b ip link set lo up

Without a default route, traffic from inside the namespace has nowhere to go. We tell it to send everything to the host:

sudo ip netns exec ts-b ip route add default via 10.200.1.1

sudo sysctl -w net.ipv4.ip_forward=1

Now we need to handle NAT (Network Address Translation) so traffic leaving the namespace appears to come from the host. The namespace has its own private IP range (10.200.1.0/24), but external systems on the internet only know how to reach the host's IP. The host needs to "masquerade" outgoing traffic from the namespace, rewriting the source IP to the host's address.

First, find your main network interface by running ip link show and looking for your primary Ethernet or Wi-Fi interface (usually eth0, enp3s0, or something like wlan0). Then add the masquerading rule:

sudo iptables -t nat -A POSTROUTING -s 10.200.1.0/24 -o eth0 -j MASQUERADE

This says: "Any packet leaving the host out interface eth0 with source IP in 10.200.1.0/24 (our namespace), rewrite the source IP to the host's IP." Without this, the remote system would see packets coming from 10.200.1.2, have no way to route back to that private IP, and drop the response.

But masquerading alone isn't enough. The Linux kernel has a FORWARD chain that controls whether packets can traverse between interfaces. By default on many distributions, this chain has a policy of DROP, meaning it refuses to forward anything unless explicitly allowed. We need to allow traffic from the namespace (veth-host) to the outside world (eth0) and allow responses back:

sudo iptables -A FORWARD -i veth-host -o eth0 -j ACCEPT
sudo iptables -A FORWARD -i eth0 -o veth-host -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

The first rule allows outgoing traffic from the namespace. The second allows incoming traffic only if it's related to a connection we initiated (the conntrack module tracks connection state). This way, the namespace can initiate connections, but random inbound traffic from the internet still gets dropped.

Note: If your system uses nftables instead of iptables, add the equivalent rules to your nftables config instead of mixing firewall systems. You can check with iptables --version or nft list ruleset.

Before bringing up Tailscale, it's worth checking that the namespace can actually reach the internet:

sudo ip netns exec ts-b ping -c 1 1.1.1.1

If that works, the namespace has a way out and Tailscale has a fair shot at coming up cleanly.

We've got the namespace setup, so just need to launch the second tailscaled inside the namespace, pointing it at its own state file and socket so it doesn't know the first one exists. If you want this to survive reboots, make sure to use a persistent path under something like /var/lib/ and /var/run/.

Btw tailscaled runs in the foreground. Either open a second terminal for the next step, background it, or wire it up to a service manager later.

sudo ip netns exec ts-b tailscaled \
  --state=/tmp/tailscale-b.state \
  --socket=/tmp/tailscale-b.sock

In another terminal:

sudo ip netns exec ts-b tailscale \
  --socket=/tmp/tailscale-b.sock \
  up

Authenticate the url and you are pretty much all set. The namespace spins up its own tailscale0 interface.

Check the host is still connected to tailnet A:

tailscale status

Check the namespace is connected to tailnet B:

sudo ip netns exec ts-b tailscale \
  --socket=/tmp/tailscale-b.sock \
  status

And just to see it with your own eyes:

sudo ip netns exec ts-b ip addr

Two tailscale0 interfaces, two separate tailnets, one machine.

What we actually built

Linux host
│
├─ root namespace
│   tailscale0 → tailnet A
│
└─ namespace ts-b
    tailscale0 → tailnet B

Each namespace has its own routing table, firewall state, and network interfaces. They share a kernel but otherwise have no idea about each other.

Once you have two separate networking stacks on one machine, you could let them talk to each other. Maybe you wanna easily send something from one tailnet to the other. The host can selectively forward traffic between namespaces, which effectively turns it into a bridge between tailnets. From there it's a short walk to things like controlled access gateways between environments, smooth migration paths when you're moving infrastructure between networks, or exposing specific services from one tailnet into another without fully merging them. And of course, you get the rest of Tailscale's benefits like funnel, ssh etc. I spend most of my time split between a mac and linux machines, so I use both this approach and via Orbstack machines.

Infra and devtool themes I'm excited about in 2024

Bill Njoroge — Thu, 29 Jan 2026 00:00:00 GMT

This post was originally published on Substack in January 2024. Re-published here with minor edits for tense and clarity.

So much happened in 2023, especially in cloud, data, and ML infrastructure, developer tools, and of course the craze of LLMs. Below are some themes that emerged prior to or became more mainstream in 2023, which I am excited about in 2024 and beyond. I should also mention these are merely interesting trends, not predictions. It’s an archive to look back to a few years down the line to see how they panned out.

data (and ml) stacks continue to be more composable

This trend predates 2023. Ever since the early 2000s, we’ve had many options at every layer of the data stack: new hardware (GPUs/specialized chips), different compute engines (dask/spark), databases, table formats, specialized engines (druid/clickhouse), SQL dialects, high-level dataframe APIs (pandas/polars).

These options are great for builders but often come with a pretty expensive integration tax. Having a standardized interface, much like the JVM bytecode or LLVM IR, streamlines data exchange and ensures interoperability. Apache Arrow (originally just an in-memory columnar data specification, but now including low-level tools like Flight SQL, ADBC, and Datafusion) was one of the first projects that pioneered this trend.

The other exciting component is Substrait, which represents compute operations across different SQL parsers and execution engines. This is particularly useful for scenarios where users employ different frameworks (pandas/polars) or languages depending on data scale, or compile different SQL dialects/query languages like Malloy. These components are implementation details and should be abstracted away from users. I'm excited to see more high-level Arrow-native and Substrait-native data systems like Gluten.

On the ML infra side, I am excited about projects like Ray, which contains multiple composable tools, Carton (which allows you to write application code in a different language from your Python ML inference), or run.house. Even in LLMs, composability is key, with concepts such as LLM routing and model chaining.

s3 continues to re-define data infra

S3 (and compatible object storage like Cloudflare's R2) has become the default choice for source-of-truth persistent storage, moving away from disk-based volumes that most infrastructure products used previously. Initial use cases included data warehouses (Databend), analytics databases (Chaossearch), search engines (Quickwit), and columnar log storage (Husky).

Recent use cases include serverless Postgres providers (Neon), streaming platforms (Warpstream), vector databases (LanceDB), and even file systems or key-value stores. You get a lot for "free" by leveraging an S3 backend. Most hard distributed systems challenges — durability, availability, and consistency — are better delegated to battle-tested systems.

One trade-off is higher latency. AWS introduced a new storage class, S3 Express 1Z, whose access speed is up to 10x faster but costs more. While it's slower than Redis, it is faster than standard S3 while providing IAM and security policies out of the box.

postgres continues to be the universal database platform

Postgres has become the default database of choice, growing significantly faster than alternatives. Developers are using it for data warehousing (Hydra), vector search (pgvector), machine learning (PostgresML), and search and analytics (ParadeDB).

Not having to manage separate infrastructure for each use case is a massive productivity unlock for data teams. Bundlers like Omnigres take this further by including caching, auth, and deployment logic. Putting logic in the database was an anti-pattern back in the day — maybe we are coming full circle?

local-first finally becoming mainstream

Local-first software has been buzzing with the rise of multiplayer applications like Google Docs. The benefits — security, privacy, offline capabilities — are clear, though the tooling remains nascent compared to client-server.

As developers, we still grapple with CRDTs and syncing complexities, but interesting projects are bridging the gap:

SQLite-based: cr-sqlite, SQLSync
Postgres-centric: ElectricSQL
Full-stack: Triplit

ml in database internals

There are many databases for ML, but fewer production use cases for applying classical ML within core database operations. OtterTune is the gold standard for configuration tuning, but I'm excited about applications in query optimization (beyond simple costs), learned indices, join order planning, compression techniques, and workload prediction.

workflow engines

Everything is a workflow. Durable Execution especially has been a hot space. While Temporal remains the de-facto standard, the ecosystem is evolving rapidly. Managing state is hard, and I am excited to see how this space converges.

minimizing the feedback loop

The current dev workflow — working locally, pushing to a branch, waiting for CI, code review — is often painful. I'm excited about ideas that reduce this loop:

Local emulation: LocalStack, Wing
Remote collaboration: Tunnel, Zed collaboration features
Local CI: Dagger
Ephemeral dev environments: Reducing the reliance on a messy localhost setup.

web assembly (wasm)

Wasm has been the rage for a while, but its most interesting use case is extending projects by running code in high-level languages within restricted environments (like UDFs in databases). Projects like Extism, TiDB UDFs, Convex, and SpacetimeDB are proving that Wasm is a powerful layer for interoperability.

I had a ton of fun writing this. A lot of these areas have interesting, unsolved technical challenges. Let me know what areas you are excited about or working on!

This post was originally published on Substack in January 2024. Re-published here with minor edits for tense and clarity.

AI-native software infrastructure: what I was excited about in 2023

Bill Njoroge — Tue, 27 Jan 2026 00:00:00 GMT

This post was originally published on Medium in January 2023. Re-published here with minor edits for tense and clarity.

Over the past decades, we've seen different platform shifts — from the web to the cloud to mobile — create immense value. Everyone has been speculating on what the post-mobile platform shift is (from DeFi to 5G etc). AI (loosely referring to deep learning in this context) is deservedly poised to be the next platform upon which billions of value will be derived.

Over the past two decades, AI has seen remarkable progress, but most of the models have been task-specific. After Google Research released the Transformers Architecture in a 2017 research paper, Attention is All You Need, which proposed a new architecture that was much easier to parallelize (read: better performance), quicker to train and has the ability to generalize across discrete tasks, the term Foundation Models (FMs) became more mainstream. From the State of AI report, "The Transformer architecture has expanded far beyond NLP and is emerging as a general purpose architecture for ML". FMs have two crucial attributes: emergence (ability to exhibit new behaviors implicitly) and generalizability, the ability to be used as a base for multiple use cases.

As with every technology shift, the applications are what's always exciting but as an infrastructure nerd, I tend to lean more on the enablers of the application layer. As they say, in every gold rush, you ideally want to be the one selling shovels. I write about a few areas I was particularly excited about both as a developer and as a (budding) VC with a particular interest in investing in technology startups.

infrastructure

Over the past couple of years, we saw an increasing focus on building the infrastructure to run, deploy and manage models at scale, leading to the emergence of the practice, MLOps (which encapsulates data validation, model testing, evaluation, deployment, versioning, etc). For enterprises to reliably productionize FMs, however, there's a need for an extended version of MLOps. Since retraining FMs, and in particular LLMs (a variation of FMs trained on corpus amounts of textual data), is prohibitively expensive given how incredibly huge they are (GPT-3 has 175 billion parameters), there has been a huge focus on using a data-driven approach, necessitating improved ways of managing FMs at scale.

integrating FMs with different entities

This is more of a "middle layer", actually. Combining FMs with computation or external memory/knowledge exponentially increases their capabilities. The most common implementation of this was through LLMs, by Langchain. Langchain offered interesting capabilities such as agents that execute different "actions" (using a context manager), or memory to enable persistence across different agent calls, interoperability between different LLMs, and the ability to test, template, experiment and emulate various prompts at scale.

GPT-Index was also a very exciting project that offered a simple and extensible interface between external data and your LLMs. It helped resolve prompt-size limitations allowing you to query external data instead of updating the model's weights.

I was excited about more infra innovation around support for multiple modalities — allowing for absolute interoperability and even more exciting applications — an area I was exploring with medical data while at Hopkins.

better tooling for prompt engineering

Prompt Engineering proved to be a very effective way of improving the accuracy of LLMs' outputs. Various techniques emerged such as zero-shot (prompt with no examples) and few-shot (prompt with one or n examples). Getting the right prompt involves a lot of iterations, and being able to do that at the enterprise level requires solid infrastructure. For instance, changing the order of the few shots, or versioning the various inputs can influence the LLM performance.

I was also excited about ideas such as Language Model Programming, which sought to provide more expressiveness and granularity when querying LLMs, increasing not only the accuracy of the outputs but also yielding cost savings.

Additionally, I thought we'd see search engines or databases specifically for storing prompts and the corresponding outputs. At an organization level, this would ensure better reproducibility, especially across workflows.

compute infra

Training and deploying FMs is extremely expensive. For reference, with the lowest-cost GPU, the cost of training GPT-3 was $4.6 Million. Additionally, for every copy of a foundation model tweaked to serve a new purpose — such as a model that translates to French and another that translates to Mandarin — you have to host a new version of that model. To achieve massive scale, there is a need to make compute less expensive.

There were various approaches to reduce the training costs such as using hardware-specific processors (ASICs) or Google TPUs. Other companies such as Mosaic ML demonstrated that using various software-centric approaches such as data parallelism can significantly improve the cost/performance ratio for FMs. Data-centric ML, as coined by Andrew Ng, was going to be even more relevant. I thought we'd see a more active focus on low-hanging optimizations to significantly reduce the cost of deploying FMs, whether that was medium-sized implementations of LLMs, such as nanoGPT, new parallelism techniques, or reducing the re-computation of different transformer layers.

Once the FMs have been deployed, another infrastructure layer is needed to enable applications to do inference (making predictions using new data — or rather, a lot of matrix multiplications) from the models. This particular infrastructure has to handle low latency and high throughput. I was excited about different ways to accelerate inference, whether that's by using hardware architectures (FPGA, TPU, etc), software (graph compilers, etc), or algorithms (pruning, quantization, etc). One company building in this space that I was excited about was Modular, which was solving hard problems at the intersection of compilers and AI.

deployment infra

A huge focus on open-source and community-led development was critical to ensuring AI becomes more mainstream. Meta's FAIR released Multi-ray, a platform for running state-of-the-art AI models at scale that allows multiple models to run on the same input and share the majority of processing costs while incurring only a small per-model cost. While that was Meta-specific, I thought we'd see organizations with AI-intensive workloads adopt firm-wide frameworks to deploy their FMs across different business units.

In a similar vein to MLOps, I thought Infrastructure as Code (IaC) would become even more relevant for productionizing (open source) FMs. The current de-facto tool is Hashicorp's Terraform. At Nvidia, that past summer, one of my projects was to transition my team's ML workflow orchestration pipeline from a YAML-based approach to a more declarative framework using high-level languages. This introduced benefits such as composability, flexibility, less error-prone config files, and much less redundancy.

security and safety infra

Current FMs are huge, multi-billion-parameter black boxes that make it incredibly hard to not only explain but also assess risks and vulnerabilities of using the model. These security guarantees are absolutely necessary for use cases such as healthcare or finance. It is thus imperative to ensure that — even as the attack surface increases with downstream applications — the potential vulnerabilities, which can include model artefacts, corrupted training data, potential to expose data after fine-tuning, and package dependency vulnerabilities such as the recent one in PyTorch's nightly build, are mitigated.

Just as important is to think about how the security framework between infrastructure providers (such as OpenAI or Anthropic) and applications will evolve. I thought we'd see a shared responsibility model, similar to that of cloud companies, where both players play a role in ensuring security guarantees.

tooling

embeddings infra

From OpenAI, embeddings are "numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts". One of the most remarkable developments by OpenAI that never got as much public limelight was their new Embeddings Model, which outperformed their previous model at most tasks while being 99.98% cheaper and having a context window that's twice as large. I was excited to see more projects that not only host embeddings from multiple models (performance varies across tasks) but also allow for fine-tuning, ideally with an on-premise option.

vector databases

Unstructured data (which includes images, video, text, and audio) accounts for 80-90% of any organization's data. Being able to index, store and search across them instead of human-generated labels or tags is exactly what vector databases were meant to solve. They have direct use cases especially in building better semantic search applications and recommendation systems. Open source vector databases on my radar included Pinecone, Weaviate, and Milvus.

AI-native dev tools

I always think of developer tools as a hidden 10x multiplier for not only engineering productivity but also output. Current AI-native dev tools such as GitHub Copilot and Replit's GhostWriter only scratch the surface of what is possible. I thought we'd see AI seep even further down the dev tool stack with eventually the ability to not only generate but also execute code and perform optimizations ad-hoc. Some attempts at this included Cursor (AI-native code editor), Scalene (AI-optimized Python profiler), and even more ambitious ideas such as building a GPT-only backend.

While most consumer-centric generative AI applications captured much of the public limelight, I believed most of the value accrual would — just like in enterprise software — be verticalized. AI would become as table-stakes as cloud-native or mobile-native has been over the past decade. Rather than building multiple models for different use cases and datasets, companies would focus on using proprietary data to enhance foundation models and using them to build more intelligent applications. 2023 was going to be a very defining year for AI, especially at the infrastructure layer.

Python Attribute Lookup

Bill Njoroge — Fri, 01 Aug 2025 00:00:00 GMT

Python Attribute Lookup

In an effort to understand the very basics, I've been implementing classes in Python from scratch, starting with the functions and dictionaries. I was curious how to implement attribute lookup how python actually does it. So I went down a rabbit hole and came across this article that covers it exhaustively. Learned a great deal about how alot of the higher level ORMS, frameworks like FastAPI and dataclasses are implemented.

Some key takeaways are:

Instance variables take precedence over class variables.
- ```
class Foo():
x = 'Foo class attribute'
    
x = Foo()
x.name 
```
From above, x.name essentially does a x.__get__attribute("name") which checks in the instance's attribute dict first, doesnt find it, then checks in the class.__dict__ which it does find.
data descriptors take precedece over instance, class and non-data descriptors.

You can also have things called descriptors. These are basically just objects that change how you access the different object attributes. There's two:

a) Data descriptors
b) Non-data descriptors.

Data descriptors have __set__, __delete__ and __get__ methods implemented while non-data descriptors only have __get__ implemented. The easiest example of this is the @property object. Under the hood, it's basically a data descriptor.

The data descriptor is the first thing that Python looks up when trying to get an attribute. These property descriptors also override an instance's attributes, non-data descriptors, and plain class attributes.

To illustrate this, here's an example:

class Foo:
    @property
    def x(self):
        return "computed"

obj = Foo()
obj.__dict__["x"] = "shadow?"
obj.x

This will actually return computed and not shadow because during @property is a data descriptor, thus Python will look at it's __set__ or __get__ methods first which x is. Thus it will return before we saw the instance obj attribute shadow.

So the general approach is:

Data descriptor (property)
Instance __dict__
Non‑data descriptor
Class attribute
MRO
__getattr__ In the case of where you have multiple inheritances, python uses Method resolution Order to access the attribute, specifically the C3 linearization algorithm.

Borrowing an image and example from the above article:

class A(object): pass

class B(object):

    x = x from B

class C(A, B):pass

class D(B):

    x = x from D
    
class E(C, D): pass

![[Pasted image 20260301141019.png]]

>>> E.__mro__
(__main__.E, __main__.C, __main__.A, __main__.D, __main

C3 linearization produces an MRO that satisfies three constraints:

1. Local precedence order

If a class lists bases as class C(A, B), then A must appear before B in the MRO of C.

2. Monotonicity

A subclass’s MRO must preserve the order of its parents’ MROs. This prevents “jumps” that would break inheritance consistency.

3. No contradictions

If two parents disagree on ordering, Python must find a consistent merge or raise an error.

This has made it so much easier to understand what frameworks like FastAPI and ORMs do behind the scenes.

Binary Heaps in Python

Bill Njoroge — Tue, 11 Feb 2025 00:00:00 GMT

A binary heap is a data structure where the topmost element is either the smallest or the largest. It's useful in cases where you need to keep track of some sort of order or even sort elements as in the sorting algorithm variant. I like to think of them as a binary tree where, in the case of a min heap, the root node is the smallest element in entire tree, and in the case of a max heap, it's the largest element in the entire tree. This is also going to be in Python, which as of python 3.10+ has max heap support via heapify_max()

By default, I'll be referring to a minheap when I mention a heap, and will explicitly mention the max heap. I'm also going to be keeping the tree balanced in most cases by assuming they are complete binary trees. That gives us a better chance at getting optimal performance of log(n). If the tree is skewed heavily towards either side, that degenerates to O(N) time complexity.

We rely on the heap property which is basically that, for every node i with parent p , the key in p is smaller or equal to the key in i. It's important to note that a heap is not globally ordered per se, but partially ordered. At each level, neither child is guaranteed to be sorted with respect to the other.

In the image below, the first tree is a minheap, and as you can see 6 is the least element in the tree, while in the second tree, 17 is the largest element in the tree.

You can represent a heap using its level order traversal in an array. For the ith element in an array, you can access the children as:

left_child = 2*i + 1
right_child = 2*i + 2

Inversely, you can access the parent node of a specific child by doing the reverse of the above for some child node i. We want to take the floor because for the right child, (i-1)/2 would not give you an integer for some even i.

parent_node = (i-1)//2

A heap supports the typical operations of a data structure such as being able to insert, delete(from a specific point), find nodes and being able to build it from a list. To insert to the heap, we want to do it at the end of the heap. This is fairly easy, but unfortunately it might violate the heap property if the value to be inserted is larger than the parent( and that might be the case for the parent's parent). So we somehow need to maintain this property by recursively comparing it with the parent. This is often called percolating up.

In the worst case, we might need to keep doing it until we get to the root of the tree so in the worst case, the time complexity for doing an insert is bounded by the height of the tree. Since we assumed it's a complete balanced binary tree, that's approximately log(n). Technically there's some constant just before it but we can ignore it.

def insert(heap, value):
    heap.append(value)
    i = len(heap) - 1
    
    # Percolate up
    while i > 0:
        parent = (i - 1) // 2
        if heap[parent] <= heap[i]:
            break  # Heap property satisfied
        heap[parent], heap[i] = heap[i], heap[parent]
        i = parent

For deletion operations, we need to pop the root which is the only interface to do it. We also need to maintain the heap order, so ensuring that whatever value we replace as the root maintains the heap order. To do so, we need to percolate down We can replace the last element as the root, and then, we can ensure this new root maintains the heap order. So we need to swap the root with the smaller child and recursively do it until we get to some leaf node. Ends up again being about log(N).

def pop_min(heap):
    if not heap:
        return None
    
    min_val = heap[0]
    heap[0] = heap[-1]
    heap.pop()
    
    # Percolate down
    i = 0
    while True:
        left = 2*i + 1
        right = 2*i + 2
        smallest = i
        
        if left < len(heap) and heap[left] < heap[smallest]:
            smallest = left
        if right < len(heap) and heap[right] < heap[smallest]:
            smallest = right
        
        if smallest == i:
            break  # Heap property satisfied
        
        heap[i], heap[smallest] = heap[smallest], heap[i]
        i = smallest
    
    return min_val

We can build a heap in two ways. Inserting the elements one at a time or using the entire list. If you insert using proper heap operations, the cost is O(n log n). If you insert into a sorted list, the cost is O(n²). Heapify is O(n), which is even better. We can find the place to insert it at using binary search so log(n) but inserting in the middle of a python list is actually O(n) because you have to shift the elements over. If you are inserting n items, the cumulative cost is O(n)² .

The other approach is to build the heap directly from the list itself. Python exposes a heapify() api that makes this trivial and has a time complexity of O(n). The gist of how it works is that, since a binary heap, is a complete tree, most of the nodes are at the bottom or close to it, so the actual cost to sift down is very cheap, maybe 0 or 1. We start by identifying the last parent node using the formula above. We assume the leaf nodes are valid heaps since they have no child nodes. In fact, while the math can get a bit complicated the code is relatively straightforward. It's some variations of this:

def heapify(arr):
    n = len(arr)
    # Start from last parent node
    for i in reversed(range(n // 2)):
        sift_down(arr, i, n)

# For an input like n=6, we would go to nodes 2, 1, 0

Using Python's heapq module:

import heapq

# Create a min heap from a list
heap = [3, 1, 4, 1, 5, 9, 2, 6]
heapq.heapify(heap)  # O(n) - in-place transformation

# Insert an element
heapq.heappush(heap, 0)  # O(log n)

# Pop the minimum element
min_val = heapq.heappop(heap)  # O(log n)

# Push then pop (more efficient than separate calls)
val = heapq.heappushpop(heap, 7)  # O(log n)

# Pop then push
val = heapq.heapreplace(heap, 8)  # O(log n)

For max heaps in python versions less than 3.10, since heapq supports min heaps natively, you can negate values:

# Max heap simulation using min heap
max_heap = [-x for x in [3, 1, 4, 1, 5]]
heapq.heapify(max_heap)

# Get max (negate back)
max_val = -heapq.heappop(max_heap)

Or use Python 3.10+ heapify_max():

from heapq import heapify_max, _heappop_max

max_heap = [3, 1, 4, 1, 5]
heapify_max(max_heap)
max_val = _heappop_max(max_heap)

We've looked at binary heaps, and some of their operations. The next post is going to be focused on patterns that occur when dealing with binary heap types of problems in interviews.