Contributed by Kshitij Aranke, Data Engineer

Introduction: LLM products as data systems

When you spend some time around AI discussions, you will notice that people are always talking about bigger, faster, or smarter models. The discussions are dominated by architectural patterns. Now you understand why models are a part of the system we can see and interact with. But a closer look into the design of large language models (LLM) products reveals an entirely different reality: the model is simply a part of a larger machine network. Behind the model lies a big, sometimes messy, and constantly changing data system where most of the system’s operation takes place.

An LLM’s processing is shaped by the knowledge of the data it has seen. The LLM product does not just generate knowledge from nowhere but learns patterns from large amounts of text. This implies that the data structure and architecture are relevant in LLM products. This is why LLM products are handled like data products by engineers, who are aware that they are less like pure AI systems. The challenges with LLM go beyond improving the performance of models to collecting correct data and preserving it for long-term use. Immediately, you start to understand LLM from this lens; everything makes sense. The data becomes the asset, and the model becomes the interface. 

Data acquisition and corpus construction: the hidden foundation

Before engineers begin training a model, there is a basic step they have to take. Unfortunately, this step hardly gets noticed. Building the dataset is an important step for LLM. This is a big step that has to do with the scraping of large portions of the internet, including content generated by users.  The collection of big datasets is not an easy process. Without proper evaluation, a model trained using raw or messy data will produce faulty results because raw wed data is not consistent. That is why teams spend so much time trying to decide what should be included or removed. They remove spam items that can make the model exaggerate a particular viewpoint. When teams filter data, the goal is to produce quality and ensure that a dataset is shaped into something useful. 

Some ethical factors that can make this process complicated are the fact that not every collected data can be processed without the owner’s consent. This is a challenge that can not be ignored. Therefore, construction of the dataset is not a technical or policy problem. Broadly, this phase can set the limit of how a model is shaped. For instance, a model that lacks a certain dataset will produce low-quality content because it lacks some basic steps. 

Data processing pipelines: transforming text into training signals

Collected data is not ready for training without going through some interpretations that enable the model to understand it. This is what makes data pipelines more important than we think. Some of the steps the data goes through are: (a) tokenisation: this is the first step, where the data is broken down into smaller parts that the model can interpret. It may look like a minor technical activity, but this process is very important. The way the data is split determines how the model learns or handles the information contained in the data. (b) Versioning: whether it’s the addition of new data or the removal of old data, datasets tend to change over time. It is important to keep a record of these changes, especially for reproducibility. When a model behaves differently after the training of the datasets, engineers have to find out if the data or the model is the cause. (c) Cleansing and normalisation: Unfamiliar texts and unnecessary contents need to be managed because when they are not attended to, they generate noise problems that affect training. This is very challenging, given that LLM datasets are very big. To process them, engineers have to apply shared systems that can manage a large amount of data without issues. Small mistakes can grow into bigger problems across the training process. Therefore, pipelines have to be repeatable and big because they are the backbone of the whole system. Without pipelines, the model would not perform at all.

Training, Alignment, and Evaluation: where data shapes the model

Training an LLM is like a phase in a longer loop, not the main event. During the training process, the model learns patterns from the available data for better prediction and generation of texts. This is why the output is connected to the data it receives. The process has to be continued even after pretraining. More datasets, such as feedback from users, provide the model with helpful outputs, which are used to fine-tune the models to improve their usefulness. This makes the concept of data as a control mechanism very understandable. You have to adjust the training data if you need the model to be more polite and avoid unpredictable user behaviour. By implication, we do not rewrite the model; instead, we reshape it using carefully curated instances. 

The evaluation of LLMs also depends on the data it sees. Major data systems can capture only part of the larger picture when used to measure the function of a model, and when users interact with the model in unpredictable ways, real-world behaviours can be different. Also, bias and inequality are persistent challenges. The model can show a bias in the portrayal of ideas during data training. For engineers to handle this problem, they will have to revisit the data and make the required changes. This is evidence that LLM products are not an isolated system but one connected to the data that shapes and trains it. 

Deployment and feedback loops: LLMs as continuously evolving data products

An LLM does not remain the same once it has been deployed. This is mainly as a result of the new data it generates in the form of feedback or questions from every interaction. This became a valuable resource for the gradual improvement of the LLM product.  Gradually, the LLM product started to resemble data products. For instance, the interactions by users are entered into the system, analysed, and sometimes fed back into the system to structure future system updates. Patterns started to arise where the model either struggles or performs well. 

The fact that the changes in the behaviour of a user can make data drift occur introduces new challenges. For example, as new cases emerge, what functioned without any error in a previous case may work less well when applied to a new case. To capture these changes, monitoring systems were needed by engineers. Also, engineers need to know when and how to refrain because updating a model demands careful planning that would not cause new issues. Therefore, the system becomes a pipeline that rapidly changes with every data it collects or processes. Feedback becomes very powerful, and the gap between process and product starts to blur. 

Conclusion

LLM products are similar to breakthroughs in modelling when viewed from the outside. But from the inside, they look like enhanced data systems that are joined together by a series of engineering. The LLM products are still important like an AI system because they are the mechanism that changes data into usable output. That is why the model’s output is constantly determined by factors such as the data it surrounds itself with. The success of an LLM product is in creating strong models that can handle quality data. Also, it is about creating pipelines that can process new information over time. This idea changes our understanding of AI systems. Every LLM product is a data product, and the sooner we accept this idea, the better the future of LLM will be.