Infra and devtool themes I'm excited about in 2024

This post was originally published on Substack in January 2024. Re-published here with minor edits for tense and clarity.

So much happened in 2023, especially in cloud, data, and ML infrastructure, developer tools, and of course the craze of LLMs. Below are some themes that emerged prior to or became more mainstream in 2023, which I am excited about in 2024 and beyond. I should also mention these are merely interesting trends, not predictions. It’s an archive to look back to a few years down the line to see how they panned out.

data (and ml) stacks continue to be more composable

This trend predates 2023. Ever since the early 2000s, we’ve had many options at every layer of the data stack: new hardware (GPUs/specialized chips), different compute engines (dask/spark), databases, table formats, specialized engines (druid/clickhouse), SQL dialects, high-level dataframe APIs (pandas/polars).

These options are great for builders but often come with a pretty expensive integration tax. Having a standardized interface, much like the JVM bytecode or LLVM IR, streamlines data exchange and ensures interoperability. Apache Arrow (originally just an in-memory columnar data specification, but now including low-level tools like Flight SQL, ADBC, and Datafusion) was one of the first projects that pioneered this trend.

The other exciting component is Substrait, which represents compute operations across different SQL parsers and execution engines. This is particularly useful for scenarios where users employ different frameworks (pandas/polars) or languages depending on data scale, or compile different SQL dialects/query languages like Malloy. These components are implementation details and should be abstracted away from users. I’m excited to see more high-level Arrow-native and Substrait-native data systems like Gluten.

On the ML infra side, I am excited about projects like Ray, which contains multiple composable tools, Carton (which allows you to write application code in a different language from your Python ML inference), or run.house. Even in LLMs, composability is key, with concepts such as LLM routing and model chaining.

s3 continues to re-define data infra

S3 (and compatible object storage like Cloudflare’s R2) has become the default choice for source-of-truth persistent storage, moving away from disk-based volumes that most infrastructure products used previously. Initial use cases included data warehouses (Databend), analytics databases (Chaossearch), search engines (Quickwit), and columnar log storage (Husky).

Recent use cases include serverless Postgres providers (Neon), streaming platforms (Warpstream), vector databases (LanceDB), and even file systems or key-value stores. You get a lot for “free” by leveraging an S3 backend. Most hard distributed systems challenges — durability, availability, and consistency — are better delegated to battle-tested systems.

One trade-off is higher latency. AWS introduced a new storage class, S3 Express 1Z, whose access speed is up to 10x faster but costs more. While it’s slower than Redis, it is faster than standard S3 while providing IAM and security policies out of the box.

postgres continues to be the universal database platform

Postgres has become the default database of choice, growing significantly faster than alternatives. Developers are using it for data warehousing (Hydra), vector search (pgvector), machine learning (PostgresML), and search and analytics (ParadeDB).

Not having to manage separate infrastructure for each use case is a massive productivity unlock for data teams. Bundlers like Omnigres take this further by including caching, auth, and deployment logic. Putting logic in the database was an anti-pattern back in the day — maybe we are coming full circle?

local-first finally becoming mainstream

Local-first software has been buzzing with the rise of multiplayer applications like Google Docs. The benefits — security, privacy, offline capabilities — are clear, though the tooling remains nascent compared to client-server.

As developers, we still grapple with CRDTs and syncing complexities, but interesting projects are bridging the gap:

SQLite-based: cr-sqlite, SQLSync
Postgres-centric: ElectricSQL
Full-stack: Triplit

ml in database internals

There are many databases for ML, but fewer production use cases for applying classical ML within core database operations. OtterTune is the gold standard for configuration tuning, but I’m excited about applications in query optimization (beyond simple costs), learned indices, join order planning, compression techniques, and workload prediction.

workflow engines

Everything is a workflow. Durable Execution especially has been a hot space. While Temporal remains the de-facto standard, the ecosystem is evolving rapidly. Managing state is hard, and I am excited to see how this space converges.

minimizing the feedback loop

The current dev workflow — working locally, pushing to a branch, waiting for CI, code review — is often painful. I’m excited about ideas that reduce this loop:

Local emulation: LocalStack, Wing
Remote collaboration: Tunnel, Zed collaboration features
Local CI: Dagger
Ephemeral dev environments: Reducing the reliance on a messy localhost setup.

web assembly (wasm)

Wasm has been the rage for a while, but its most interesting use case is extending projects by running code in high-level languages within restricted environments (like UDFs in databases). Projects like Extism, TiDB UDFs, Convex, and SpacetimeDB are proving that Wasm is a powerful layer for interoperability.

I had a ton of fun writing this. A lot of these areas have interesting, unsolved technical challenges. Let me know what areas you are excited about or working on!

This post was originally published on Substack in January 2024. Re-published here with minor edits for tense and clarity.