What Did I Learn from Building LLM Applications in 2024? — Part 2

January 17, 2025

What Did I Learn from Building LLM Applications in 2024? — Part 2

An engineer’s journey to building LLM-powered applications

Illustration of building AI application (image by author — generated using DALLE-3)

In part 1 of this series, we discussed use case selection, building a team and the importance of creating a prototype early into your LLM-based product development journey. Let’s pick it up from there — if you are fairly satisfied with your prototype and ready to move forward, start with planning a development approach. It’s also crucial to decide on your productionizing strategy from an early phase.

With recent advancements with new models and a handful of SDKs in market, it is easy to feel the urge to build cool features such as agents into your LLM-powered application in the early phase. Let’s take a step back and decide the must-have and nice-to-have features as per your use case. Begin by identifying the core functionalities that are essential for your application to fulfill the primary business objectives. For instance, if your application is designed to provide customer support, the ability to understand and respond to user queries accurately would be a must-have feature. On the other hand, features like personalized recommendations might be considered as a nice-to-have feature for future scope.

Find your ‘fit’

If you want to build your solution from a concept or prototype, a top-down design model can work best. In this approach, you start with a high level conceptual design of the application without going into much details, and then take separate components to develop each further. This design might not yield the best results at very first, but sets you up for an iterative approach, where you can improve and evaluate each component of the app and test the end-to-end solution in subsequent iterations.

For an example of this design approach, we can consider a RAG (Retrieval Augmented Generation) based application. These applications typically have 2 high-level components — a retrieval component (which searches and retrieves relevant documents for user query) and a generative component (which produces a grounded answer from the retrieved documents).

Scenario: Build a helpful assistant bot to diagnose and resolve technical issues by offering relevant solutions from a technical knowledge base containing troubleshooting guidelines.

STEP 1 – build the conceptual prototype: Outline the overall architecture of the bot without going into much details.

Data Collection: Gather a sample dataset from the knowledge base, with questions and answers relevant to your domain.Retrieval Component: Implement a basic retrieval system using a simple keyword-based search, which can evolve into a more advanced vector-based search in future iterations.Generative Component: Integrate an LLM in this component and feed the retrieval results through prompt to generate a grounded and contextual answer.Integration: Combine the retrieval and generative components to create a end-to-end flow.Execution: Identify the resources to run each component. For example, retrieval component can be built using Azure AI search, it offers both keyword-based and advanced vector-based retrieval mechanisms. LLMs from Azure AI foundry can be used for generation component. Lastly, create an app to integrate these components.Step 1 (image by author)

STEP 2 – Improve Retrieval Component: Start exploring how each component can be improved more. For a RAG-based solution, the quality of retrieval has to be exceptionally good to ensure that the most relevant and accurate information is retrieved, which in turn enables the generation component to produce contextually appropriate response for the end user.

Set up data ingestion: Set up a data pipeline to ingest the knowledge base into your retrieval component. This step should also consider preprocessing the data to remove noise, extract key information, image processing etc.Use vector database: Upgrade to a vector database to enhance the system for more contextual retrieval. Pre-process the data further by splitting text into chunks and generating embeddings for each chunk using an embedding model. The vector database should have functionalities for adding and deleting data and querying with vectors for easy integration.Evaluation: Selection and rank of documents in retrieval results is crucial, as it impacts the next step of the solution heavily. While precision and recall gives a fairly good idea of the search results’ accuracy, you can also consider using MRR (mean reciprocal rank) or NDCG (normalized discounted cumulative gain) to assess the ranking of the documents in the retrieval results. Contextual relevancy determines whether the document chunks are relevant to producing the ideal answer for a user input.Step 2 (image by author)

STEP 3 — Enhance Generative Component to produce more relevant and better output:

Intent filter: Filter out questions that don’t fall into the scope of your knowledge base. This step can also be used to block unwanted and offensive prompts.Modify prompt and context: Improve your prompts e.g. including few shot examples, clear instructions, response structure, etc. to tune the LLM output as per your need. Also feed conversation history to the LLM in each turn to maintain context for a user chat session. If you want the model to invoke tools or functions, then put clear instructions and annotations in the prompt. Apply version control on prompts in each iteration of experiment phase for change tracking. This also helps to roll back in case your system’s behavior degrades after a release.Capture the reasoning of the model: Some applications use an additional step to capture the rationale behind output generated by LLMs. This is useful for inspection when the model produces unpredictable output.Evaluation: For the answers produced by a RAG-based system, it is important to measure a) the factual consistency of the answer against the context provided by retrieval component and b) how relevant the answer is to the query. During the MVP phase, we usually test with few inputs. However while developing for production, we should carry out evaluation against an extensive ground truth or golden dataset created from the knowledge base in each step of the experiment. It’s better if the ground truth can contain as much as possible real-world examples (frequent questions from the target consumers of the system). If you’re looking to implement evaluation framework, take a look here.Step 3 (image by author)

On the other hand, let’s consider another scenario where you’re integrating AI in a business process. Consider an online retail company’s call center transcripts, for which summarization and sentiment analysis are needed to be generated and added into a weekly report. To develop this, start with understanding the existing system and the gap AI is trying to fill. Next, start designing low-level components keeping system integration in mind. This can be considered as bottom-up design as each component can be developed separately and then be integrated into the final system.

Data collection and pre-processing: Considering confidentiality and the presence of personal data in the transcripts, redact or anonymize the data as needed. Use a speech-to-text model to convert audio into text.Summarization: Experiment and choose between extractive summarization (selecting key sentences) and abstractive summarization (new sentences that convey the same meaning) as per the final report’s need. Start with a simple prompt, and take user feedback to improve the accuracy and relevance of generated summary further.Sentiment analysis: Use domain-specific few shots examples and prompt tuning to increase accuracy in detecting sentiment from transcripts. Instructing LLM to provide reasoning can help to enhance the output quality.Report generation: Use report tool like Power BI to integrate the output from previous components together.Evaluation: Use the same concepts of iterative evaluation process with metrics for LLM-dependent components.

This design also helps to catch issues early on in each component-level which can be addressed without changing the overall design. Also enables AI-driven innovation in existing legacy systems.

LLM application development doesn’t follow a one-size-fits-all approach. Most of the time it is necessary to gain a quick win to validate whether the current approach is bringing value or shows potential to meet expectations. While building a new AI-native system from scratch sounds more promising for the future, on the other hand integrating AI in existing business processes even in a small capacity can bring a lot of efficiency. Choosing either of these depends upon your organization’s resources, readiness to adopt AI and long-term vision. It is imperative to consider the trade-offs and create a realistic strategy to generate long-term value in this area.

Ensuring quality through an automated evaluation process

Improving the success factor of LLM-based application lies with iterative process of evaluating the outcome from the application. This process usually starts from choosing relevant metrics for your use case and gathering real-world examples for a ground truth or golden dataset. As your application will grow from MVP to product, it is recommended to come up with a CI/CE/CD (Continuous Integration/Continuous Evaluation/Continuous Deployment) process to standardize and automate the evaluation process and calculating metrics scores. This automation has also been called LLMOps in recent times, derived from MLOps. Tools like PromptFlow, Vertex AI Studio, Langsmith, etc. provide the platform and SDKs for automating evaluation process.

Evaluating LLMs and LLM-based applications is not the same

Usually an LLM is put through a standard benchmarks evaluation before it is released. However that does not guarantee your LLM-powered application will always perform as expected. Especially a RAG-based system which uses document retrievals and prompt engineering steps to generate output, should be evaluated against a domain-specific, real-world dataset to gauge the performance.

For in-depth exploration on evaluation metrics for various type of use cases, I recommend this article.

How to choose the right LLM?

Image by author — generated using DALLE-3

Several factors drive this decision for a product team.

Model capability: Determine your model need by the type of problem you’re solving in your LLM product. For example, consider these 2 use cases —

#1 A chatbot for an online retail shop handles product enquiries through text and images. A model with multi-modal capabilities and lower latency should be able to handle the workload.

#2 On the other hand, consider a developer productivity solution, which will need a model to generate and debug code snippets, you require a model with advanced reasoning that can produce highly accurate output.

2. Cost and licensing: Prices vary based on several factors such as model complexity, input size, input type, and latency requirements. Popular LLMs like OpenAI’s models charge a fixed cost per 1M or 1K tokens, which can scale significantly with usage. Models with advanced logical reasoning capability usually cost more, such as OpenAI’s o1 model $15.00 / 1M input tokens compared to GPT-4o which costs $2.50 / 1M input tokens. Additionally, if you want to sell your LLM product, make sure to check the commercial licensing terms of the LLM. Some models may have restrictions or require specific licenses for commercial use.

3. Context Window Length: This becomes crucial for use cases where the model needs to process a large amount of data in prompt at once. The data can be document extracts, conversation history, function call results etc.

4. Speed: Use cases like chatbot for online retail shop needs to generate output very fast, hence a model with lower latency is crucial in this scenario. Also, UX improvement e.g. streaming responses renders the output chunk by chunk, thus providing a better experience for the user.

6. Integration with existing system: Ensure that the LLM provider can be seamlessly integrated with your existing systems. This includes compatibility with APIs, SDKs, and other tools you are using.

Choosing a model for production often involves balancing trade-offs. It’s important to experiment with different models early in the development cycle and set not only use-case specific evaluation metrics, also performance and cost as benchmarks for comparison.

Responsible AI

The ethical use of LLMs is crucial to ensure that these technologies benefit society while minimizing potential harm. A product team must prioritize transparency, fairness, and accountability in their LLM application.

For example, consider a LLM-based system being used in healthcare facilities to help doctor diagnose and treat patients more efficiently. The system must not misuse patient’s personal data e.g. medical history, symptoms etc. Also the results from the applications should have transparency and reasoning behind any suggestion it generates. It should not be biased or discriminatory towards any group of people.

While evaluating the LLM-driven component output quality in each iteration, make sure to look out for any potential risk such as harmful content, biases, hate speech etc.t. Red teaming, a concept from cybersecurity, has recently emerged as a best practice to uncover any risk and vulnerabilities. During this exercise, red teamers attempt to ‘trick’ the models generate harmful or unwanted content by means of using various strategies of prompting. This is followed by both automated and manual review of flagged outputs to decide upon a mitigation strategy. As your product evolves, in each stage you can instruct red teamers to test different LLM-driven components of your app and also the entire application as a whole to make sure every aspect is covered.

Make it all ready for Production

At the end, LLM application is a product and we can use common principles for optimizing it further before deploying to production environment.

Logging and monitoring will help you capture token usage, latency, any issue from LLM provider side, application performance etc. You can check for usage trends of your product, which provides insights into the LLM product’s effectiveness, usage spikes and cost management. Additionally, setting up alerts for unusual spikes in usage can prevent budget overruns. By analyzing usage patterns and recurring cost, you can scale your infrastructure and change or update model quota accordingly.Caching can store LLM outputs, reducing the token usage and eventually the cost. Caching also helps with consistency in generative output, and reduces latency for user-facing applications. However, since LLM applications do not have a specific set of inputs, cache storage can increase exponentially in some scenarios, such as chatbots, where each user input might need to be cached, even if the expected answer is same. This can lead to significant storage overhead. To address this, the concept of semantic caching has been introduced. In Semantic caching, similar prompt inputs are grouped together based on their meaning using an embedding model. This approach helps in managing the cache storage more efficiently.Collecting user feedback ensures that the AI-enabled application can serve its purpose in a better way. If possible, try to gather feedback from a set of pilot users in each iteration so that you can gauge if the product is meeting expectations and which areas require further improvements. For example, an LLM-powered chatbot can be updated to support additional languages and as a result attract more diverse users. With new capabilities of LLMs being released very frequently, there’s a lot of potential to improve the capabilities and add new features quickly.

Conclusion

Good luck with your journey in building LLM-powered apps! There are numerous advancements and endless potentials in this field. Organizations are adopting generative AI with a wide array of use cases. Similar to any other product, develop your AI-enabled application keeping the business objectives in mind. For products like chatbots, end user satisfaction is everything. Embrace the challenges, today if a particular scenario doesn’t work out, don’t give up, tomorrow it can work out with a different approach or a new model. Learning and staying up-to-date with AI advancements are the key to building effective AI-powered products.

Follow me if you want to read more such content about new and exciting technology. If you have any feedback, please leave a comment. Thanks 🙂

What Did I Learn from Building LLM Applications in 2024? — Part 2 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.