The landscape of artificial intelligence is rapidly evolving, and Large Language Models (LLMs) are at the forefront of this transformation. What started as experimental projects are now delivering tangible returns on investment (ROI) for enterprises. This article delves into the strategies and lessons learned from companies that have successfully implemented LLMs in production, turning AI hype into real-world value.
According to Raza Habib, CEO and Cofounder of Humanloop, companies are now generating real revenue and cost savings from LLMs, marking a significant shift from the “promised land” of future potential. A prime example is Filevine, a legal tech company that doubled its revenue by launching six new AI-powered products in just one year.
This article is designed for business leaders, AI engineers, and product managers who are looking to understand how to effectively implement LLMs in their organizations. We will explore the fundamental building blocks of LLM applications, the essential team composition for success, robust evaluation frameworks, and the tooling and infrastructure required to achieve real ROI.
The Fundamental Building Blocks of LLM Applications
Most LLM applications, despite their apparent complexity, are composed of four key components chained together in various ways. Understanding these components is crucial for building robust and effective AI solutions.
The Four Core Components:
- Base Models: The foundation of any LLM application is the base model, which can range from large, general-purpose models to smaller, fine-tuned models. The choice of model depends on the specific requirements of the application, including factors like latency, accuracy, and cost.
- Prompt Templates: These are natural language instructions that guide the model’s behavior. Effective prompt engineering is essential for eliciting the desired responses from the LLM.
- Data Selection Strategies: This involves selecting the right data to feed into the model. Common strategies include Retrieval-Augmented Generation (RAG), which retrieves relevant information from a knowledge base, and API integration, which pulls data from external sources.
- Function Calling: This allows the LLM to interact with external tools and APIs, enabling it to perform actions beyond simple text generation.
Case Study: GitHub Copilot’s Architecture
GitHub Copilot, one of the first truly successful LLM applications, provides a clear example of how these components come together in practice.
- Base Model: A fine-tuned model optimized for code generation.
- Prompt Template: Instructions that guide the model to suggest code based on the user’s current context.
- Data Selection Strategy: The model considers the code immediately preceding the cursor and the last 10 or so files the user has touched, grabbing the most similar code from those sources.
- Evaluation: GitHub employs a rigorous evaluation process to measure the quality and usefulness of its suggestions.
Essential Team Composition for LLM Success
Building successful LLM applications requires a specific blend of expertise. Contrary to popular belief, you may need less machine learning expertise than you think.
The Shifting Role of Machine Learning Expertise
- Teams that succeed tend to be staffed more by generalist full-stack product engineers.
- The focus is shifting towards “AI Engineers” who understand prompting and the models themselves, rather than hardcore machine learning model training.
Domain Experts: The Hidden Key to Success
Domain experts play a crucial role in AI model development by bridging real-world expertise with machine learning capabilities. They contribute through several key methods:
Problem Formulation: Domain experts translate intricate field-specific challenges into structured machine learning tasks, ensuring alignment with practical objectives and constraints.
Data Collection and Preparation: They identify relevant data sources, oversee data collection, and help preprocess data to reflect its context-specific nuances while minimizing biases or inaccuracies.
Feature Engineering and Model Design: Experts assist in generating meaningful features based on domain insights, guiding the selection of appropriate algorithms suited to their domain’s unique needs, from choosing hyperparameters to structuring models effectively.
Integration of Expert Knowledge: Techniques like knowledge elicitation or attention mechanisms incorporate expert insights directly into the models by refining loss functions or imposing constraints informed by validated knowledge. For example, frameworks such as Advice-Based Learning (ABLe) allow iterative collaboration that systematically integrates expert corrections for improving model decision-making.
Evaluation and Feedback Iteration: Domain specialists provide ongoing evaluation of model outputs to ensure relevance and refine performance iteratively through active feedback loops—integral especially for nuanced applications such as healthcare diagnostics or financial modeling where misclassifications can have significant repercussions.
- Duolingo: Linguists handle the majority of prompt engineering, demonstrating the value of language expertise in AI applications.
- Filevine and Ironclad: Legal expertise is directly involved in the process, ensuring that the AI solutions meet the specific needs of the legal industry.
- Fathom: Product managers drive the summarization process, tailoring the output to different user roles and contexts.
Building Robust Evaluation Frameworks
Evaluation is critical at every stage of development, from prototyping to production. It’s about defining what “good” looks like and ensuring that your LLM application meets those standards.
The Three Stages of Evaluation:
- Prototyping: This is a highly iterative phase where you validate new ideas and evolve your evaluation criteria alongside the application itself.
- Production Monitoring: Once in production, you need to monitor the application’s performance and identify any issues that arise.
- Regression Testing: Before making changes to the application, you need to ensure that those changes don’t introduce regressions or unintended side effects.
User Feedback Integration
End-user feedback is invaluable for evaluating LLM applications, especially for subjective tasks like summarization and question answering.
- User Actions: What did the user do after receiving a generation?
- Issue Reporting: Did the user flag a specific issue with the output?
- Direct Voting: Thumbs up/thumbs down ratings.
- Corrections and Edits: Logging any corrections or edits that users make to the generated content.
GitHub Copilot uses a sophisticated end-user feedback mechanism that considers whether a suggestion was accepted and how much of the suggested code remained in the user’s code base over time.
Essential Tooling and Infrastructure
Having the right tooling and infrastructure is essential for building, deploying, and maintaining successful LLM applications.
Collaboration Optimization
- Design your systems to optimize team collaboration, especially between technical and non-technical experts.
- Make prompts accessible to domain experts who may not be familiar with traditional coding practices.
Comprehensive Logging Systems
- Capture inputs and outputs at every stage of the process.
- Enable the ability to replay these events and incorporate them into test sets.
Case Study: Ironclad’s Rivet Framework
Rivet, developed by Ironclad, is an open-source visual programming environment designed for building AI agents using large language models (LLMs). It simplifies the development process for creating complex logic chains in applications and enables collaboration between teams. Rivet allows users to visualize, debug remotely, and deploy sophisticated AI-powered workflows. This tool was partly created to expedite the development of Ironclad’s contract AI solutions, such as Contract AI (CAI).
- Ironclad nearly abandoned agents before implementing proper tooling.
- After building a logging and re-running infrastructure, they were able to achieve the performance needed for production deployment.
Evaluation Integration
- Use lightweight tools for prototyping.
- Implement systems for production monitoring and regression testing.
Real-World Success Stories and ROI
The real-world success stories of companies like Filevine and Ironclad demonstrate the potential ROI of LLM applications.
Filevine’s AI Transformation
- Launched six successful AI-powered products.
- Doubled revenue as a result.
- Leveraged Humanloop as their system of record for all prompts in production.
Ironclad’s Contract Automation
- Achieved a 50% auto-negotiation rate for top customers.
- Attributed their success to proper tooling and infrastructure.
Conclusion
Building successful LLM applications in production requires a strategic approach that considers team composition, evaluation frameworks, and tooling. By focusing on these key areas, organizations can move beyond experimentation and achieve real ROI from their AI investments.
Key Takeaways:
- Focus on generalist engineers and domain experts, not just machine learning specialists.
- Implement robust evaluation frameworks that incorporate user feedback.
- Invest in tooling that optimizes collaboration and enables comprehensive logging.
The future of enterprise AI is here, and by following these lessons, your organization can harness the power of LLMs to drive real business value.