AI-native software infrastructure: what I was excited about in 2023

This post was originally published on Medium in January 2023. Re-published here with minor edits for tense and clarity.

Over the past decades, we’ve seen different platform shifts — from the web to the cloud to mobile — create immense value. Everyone has been speculating on what the post-mobile platform shift is (from DeFi to 5G etc). AI (loosely referring to deep learning in this context) is deservedly poised to be the next platform upon which billions of value will be derived.

Over the past two decades, AI has seen remarkable progress, but most of the models have been task-specific. After Google Research released the Transformers Architecture in a 2017 research paper, Attention is All You Need, which proposed a new architecture that was much easier to parallelize (read: better performance), quicker to train and has the ability to generalize across discrete tasks, the term Foundation Models (FMs) became more mainstream. From the State of AI report, “The Transformer architecture has expanded far beyond NLP and is emerging as a general purpose architecture for ML”. FMs have two crucial attributes: emergence (ability to exhibit new behaviors implicitly) and generalizability, the ability to be used as a base for multiple use cases.

As with every technology shift, the applications are what’s always exciting but as an infrastructure nerd, I tend to lean more on the enablers of the application layer. As they say, in every gold rush, you ideally want to be the one selling shovels. I write about a few areas I was particularly excited about both as a developer and as a (budding) VC with a particular interest in investing in technology startups.

infrastructure

Over the past couple of years, we saw an increasing focus on building the infrastructure to run, deploy and manage models at scale, leading to the emergence of the practice, MLOps (which encapsulates data validation, model testing, evaluation, deployment, versioning, etc). For enterprises to reliably productionize FMs, however, there’s a need for an extended version of MLOps. Since retraining FMs, and in particular LLMs (a variation of FMs trained on corpus amounts of textual data), is prohibitively expensive given how incredibly huge they are (GPT-3 has 175 billion parameters), there has been a huge focus on using a data-driven approach, necessitating improved ways of managing FMs at scale.

integrating FMs with different entities

This is more of a “middle layer”, actually. Combining FMs with computation or external memory/knowledge exponentially increases their capabilities. The most common implementation of this was through LLMs, by Langchain. Langchain offered interesting capabilities such as agents that execute different “actions” (using a context manager), or memory to enable persistence across different agent calls, interoperability between different LLMs, and the ability to test, template, experiment and emulate various prompts at scale.

GPT-Index was also a very exciting project that offered a simple and extensible interface between external data and your LLMs. It helped resolve prompt-size limitations allowing you to query external data instead of updating the model’s weights.

I was excited about more infra innovation around support for multiple modalities — allowing for absolute interoperability and even more exciting applications — an area I was exploring with medical data while at Hopkins.

better tooling for prompt engineering

Prompt Engineering proved to be a very effective way of improving the accuracy of LLMs’ outputs. Various techniques emerged such as zero-shot (prompt with no examples) and few-shot (prompt with one or n examples). Getting the right prompt involves a lot of iterations, and being able to do that at the enterprise level requires solid infrastructure. For instance, changing the order of the few shots, or versioning the various inputs can influence the LLM performance.

I was also excited about ideas such as Language Model Programming, which sought to provide more expressiveness and granularity when querying LLMs, increasing not only the accuracy of the outputs but also yielding cost savings.

Additionally, I thought we’d see search engines or databases specifically for storing prompts and the corresponding outputs. At an organization level, this would ensure better reproducibility, especially across workflows.

compute infra

Training and deploying FMs is extremely expensive. For reference, with the lowest-cost GPU, the cost of training GPT-3 was $4.6 Million. Additionally, for every copy of a foundation model tweaked to serve a new purpose — such as a model that translates to French and another that translates to Mandarin — you have to host a new version of that model. To achieve massive scale, there is a need to make compute less expensive.

There were various approaches to reduce the training costs such as using hardware-specific processors (ASICs) or Google TPUs. Other companies such as Mosaic ML demonstrated that using various software-centric approaches such as data parallelism can significantly improve the cost/performance ratio for FMs. Data-centric ML, as coined by Andrew Ng, was going to be even more relevant. I thought we’d see a more active focus on low-hanging optimizations to significantly reduce the cost of deploying FMs, whether that was medium-sized implementations of LLMs, such as nanoGPT, new parallelism techniques, or reducing the re-computation of different transformer layers.

Once the FMs have been deployed, another infrastructure layer is needed to enable applications to do inference (making predictions using new data — or rather, a lot of matrix multiplications) from the models. This particular infrastructure has to handle low latency and high throughput. I was excited about different ways to accelerate inference, whether that’s by using hardware architectures (FPGA, TPU, etc), software (graph compilers, etc), or algorithms (pruning, quantization, etc). One company building in this space that I was excited about was Modular, which was solving hard problems at the intersection of compilers and AI.

deployment infra

A huge focus on open-source and community-led development was critical to ensuring AI becomes more mainstream. Meta’s FAIR released Multi-ray, a platform for running state-of-the-art AI models at scale that allows multiple models to run on the same input and share the majority of processing costs while incurring only a small per-model cost. While that was Meta-specific, I thought we’d see organizations with AI-intensive workloads adopt firm-wide frameworks to deploy their FMs across different business units.

In a similar vein to MLOps, I thought Infrastructure as Code (IaC) would become even more relevant for productionizing (open source) FMs. The current de-facto tool is Hashicorp’s Terraform. At Nvidia, that past summer, one of my projects was to transition my team’s ML workflow orchestration pipeline from a YAML-based approach to a more declarative framework using high-level languages. This introduced benefits such as composability, flexibility, less error-prone config files, and much less redundancy.

security and safety infra

Current FMs are huge, multi-billion-parameter black boxes that make it incredibly hard to not only explain but also assess risks and vulnerabilities of using the model. These security guarantees are absolutely necessary for use cases such as healthcare or finance. It is thus imperative to ensure that — even as the attack surface increases with downstream applications — the potential vulnerabilities, which can include model artefacts, corrupted training data, potential to expose data after fine-tuning, and package dependency vulnerabilities such as the recent one in PyTorch’s nightly build, are mitigated.

Just as important is to think about how the security framework between infrastructure providers (such as OpenAI or Anthropic) and applications will evolve. I thought we’d see a shared responsibility model, similar to that of cloud companies, where both players play a role in ensuring security guarantees.

tooling

embeddings infra

From OpenAI, embeddings are “numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts”. One of the most remarkable developments by OpenAI that never got as much public limelight was their new Embeddings Model, which outperformed their previous model at most tasks while being 99.98% cheaper and having a context window that’s twice as large. I was excited to see more projects that not only host embeddings from multiple models (performance varies across tasks) but also allow for fine-tuning, ideally with an on-premise option.

vector databases

Unstructured data (which includes images, video, text, and audio) accounts for 80-90% of any organization’s data. Being able to index, store and search across them instead of human-generated labels or tags is exactly what vector databases were meant to solve. They have direct use cases especially in building better semantic search applications and recommendation systems. Open source vector databases on my radar included Pinecone, Weaviate, and Milvus.

AI-native dev tools

I always think of developer tools as a hidden 10x multiplier for not only engineering productivity but also output. Current AI-native dev tools such as GitHub Copilot and Replit’s GhostWriter only scratch the surface of what is possible. I thought we’d see AI seep even further down the dev tool stack with eventually the ability to not only generate but also execute code and perform optimizations ad-hoc. Some attempts at this included Cursor (AI-native code editor), Scalene (AI-optimized Python profiler), and even more ambitious ideas such as building a GPT-only backend.

While most consumer-centric generative AI applications captured much of the public limelight, I believed most of the value accrual would — just like in enterprise software — be verticalized. AI would become as table-stakes as cloud-native or mobile-native has been over the past decade. Rather than building multiple models for different use cases and datasets, companies would focus on using proprietary data to enhance foundation models and using them to build more intelligent applications. 2023 was going to be a very defining year for AI, especially at the infrastructure layer.