December 5, 2024
|
Vincent Hoogsteder

The AI advantage for companies that have not built a data warehouse yet

This is a rare moment, where companies that have waited to make their move to a data warehouse have an advantage in their AI development. Why? Let’s dig into what technology works best for AI.

Many companies have moved their data into the cloud for the past decade and developed a data warehouse. This serves as a single source of truth that collects data from all the software tools the company uses and combines that into one central location for Business Intelligence across the different teams.

At the same time, we also speak to many companies who still need to do this. Their data lives scattered across the different software tools, or some central databases run on-premise instead of the cloud. This is a rare moment, where companies that have waited to make their move to a data warehouse can have an advantage in their AI development. Why? Let’s dig into what technology works best for AI.

The past decade made data warehouse infrastructure complex

Companies that have built their cloud data warehouse had an increasing number of options to choose from in the past ten years. From data lakes to real-time streaming platforms and graph databases. All of them are technically impressive and very good in specific (high-demand) use cases. But, they also have one other thing in common; they don't work together with Large Language Models. 

LLMs run best on a simple and super-structured setup

It is impossible to store all your internal company data inside the LLM, and therefore you need to feed the relevant company data at run-time. This feeding mechanism can run at speed because it operates on a structured database where there is a table with a few columns:

  • The ID of the content item
  • Some metadata to provide segmentation
  • Cleaned, text-only piece of content 
  • Vector embedding of this piece of content

And that’s it. Any modern LLM can work with this data structure and execute fast because it can go through the embedding vectors in this table quickly. What this does mean is that you need to structure, clean, and split your data very well so it’s suited for this setup. 

Want to do something similar in a data lake? Pretty hard and a lot of work since you initially skipped the crucial step to structure your data and moved that complexity to software later in the pipeline. 

It’s harder to go from complex to simple than start from zero

Many modern pieces of (unstructured) data warehouse technology need to add a structured layer before they can work with LLMs. Our take is that setting this up while you don’t have a data warehouse yet will go faster than adding this layer on an existing, more complex, setup. When starting from scratch, you can design for AI, rather than making it an afterthought.

Welcoming old-school technologies back

One of today’s most popular databases for LLMs is 27 years old Postgres with an extension for embeddings. This works well for several reasons:

  • Proven technology that is easy to maintain and operate.
  • Structured data, that works well for both Business Intelligence and AI.
  • SQL-based, which is still the database language with the least steep learning curve, it enables you to spread it more easily across teams.
  • Scales in the clouds of different vendors to a level that works for many organizations, as most don’t have huge datasets.

There are other structured databases out there, and for more scale, the likes of Google BigQuery and Snowflake work well. However, these are designed for the big data 1% and come with a hefty price tag. If you need hundreds of nodes to run a query, that is probably going to cost you an arm and a leg.

The beauty of building a data warehouse with simple technology like Postgres is that it enables you to get AI-ready very quickly and in a way that works both for Business Intelligence and AI. With today's capacities (CPU/RAM) of physical servers it is, for most use cases, likely to fit on a single machine, forgoing the need for a fancy solution. 

Developing the ideal setup for both BI and AI in one go

If you are at the stage where a central source of truth is not in place yet, this might be your best moment ever to start building a data warehouse. Doing this right now enables you to invest time in structuring and cleaning data once, but then have it readily available for many BI and AI use cases at the same time, with the same infrastructure and the same data flows. Every added tool and layer in a data stack complicates things. It makes maintenance, development, troubleshooting, and moving between vendors harder. 

Companies that are starting today on their data warehouse can opt to design for AI from the get-go, and with the lessons learned on maintaining complex data infrastructures and knowing the requirements of today’s newest AI technologies, they are in for a rare headstart.

We’ll write a follow-up post on the power of a simple and open-source data stack for your central source of truth for decision-making and AI.

Come chat with us

Get in touch to find out what your data can do for you. Spoiler alert: it's a lot.

Contact Us