Bringing ML products and services to where they add actual value to real users explains why the ML community is recently focusing on applying DevOps principles to ML systems -- MLOps. In this dev blog post, I’ll try to build on best practices of Architecture & CICD at ITP, and extend it with new concepts from the world of Machine Learning. Ready?
DevOps & MLOps
DevOps aims to shorten the systems development life cycle and provide continuous delivery of high software quality. From an abstract point of view, engineers produce source code that's checked into a continuous integration (CI) and continuous delivery (CD), where it flows through review, test, build, and deployment processes into a digital product or service.
ML systems add more complexity to this process via two different assets that need to be managed. First is the data used to train and test the ML models. Second, the trained models that were created during the training process. These two assets are tightly coupled with the source code written to train and use the ML models and pre/post -processing of data, and together, they underpin the need for specific MLOps tooling.
It's essential to be aware that the goals and merits we already know of DevOps also apply to MLOps. They help us build, test, and release software rapidly and reliably. They facilitate automation and reduce lead time. One new goal in MLOps tooling is to support the problem-solving cycle of rapid experimentation and feedback cycles when designing the algorithm and model architecture.
Software Architecture & Quality Attributes
Architecture is the process of designing and building durable structures that remain open for change without sacrificing quality. When we're building software that leverages data & ML to build better products and services, we apply the same tried and tested methods and processes that we're applying for all the digital products we develop at In The Pocket.
We begin with defining and prioritizing the Quality Attributes (QAs, aka non-functional requirements) of the system we're about to build. These QAs reflect the intrinsic values of the system, are unlikely to change soon, provide a shared understanding of the user needs, the clients challenge, or the system's goal. These QAs effectively help us make rapid and consistent architectural decisions.
Designers need to analyze trade-offs between multiple conflicting attributes to satisfy user requirements. The ultimate goal is the ability to quantitatively evaluate and trade off multiple quality attributes to arrive at a better overall system. We should not look for a single, universal metric, but rather for quantification of individual attributes and for trade-off between these different metrics. Source
You're probably familiar with external QAs related to the User Experience (e.g. Reusability, Correctness, Efficiency, .etc.) and internal QAs related to the Developer Experience (e.g. Maintainability, Portability, Testability, etc.).
Let me take you through how we can apply the QAs process to ML products; here's how ML connects to our shared belief that great digital products have a healthy architectural foundation: ML systems extend this list with their own set of ML QAs.
- 👩🏫 Explainability. The extent to which the internal mechanics of an ML model can be explained in natural language. Ensuring that a human can understand and explain an AI decision.
- 🪟 Transparency. A human can follow transparent models and see why and how an AI decides something.
- 🦜 Interpretability. The degree to which a human can understand the cause of a decision and to which a human can consistently predict the model's result.
- 🦹 Fairness. The ability to yield impartial and unbiased predictions to avoid favoritism or discrimination.
- 🪄 Generalizability. A model is said to generalize if it performs equally well on both training and testing data in the first order. In the second order, a model should perform equally well when deployed in real-world environments and figure out how to deal with situations not present in the training data (out-of-domain predictions).
- 🎓 Ethicality. Models should act according to laws, rules, regulations, and unwritten moral values.
- 🏭 Reproducibility. The ability of the model or its training process to behave consistently, not changing behaviour or outcomes without apparent changes.
- 🙌 Collaboration. The ability to work well with humans and augment processes and activities.
- 🧑🦯 Inclusivity. The ability of AI systems to empower everyone and engage people.
Each ML QA might be more or less significant and needs to be prioritized, along with the other QAs when creating a solution architecture for ML products and services. With alignment on QAs and a derived set requirements, we can start the design and make early decisions on architecture of the system we’re building.
Novel components for an ML Solution Architecture
In architecture, we’re connecting and isolating multiple components, together working as a system. You’ll see many familiar components from traditional digital systems come back in ML systems like API Gateways, Microservices, Load Balancers, Logging Services, NoSQL/SQL Databases and Cloud Storage. A list of 11 ML components can be found below that should all be considered when building ML Systems. Most ML components are related to either managing the lifecycle of datasets or trained models.
- 🔬 Experimentation. Doing Exploratory data analysis, creating prototype model architectures, and implementing training routines
- 🏗️ Data Processing. Prepare and transform large amounts of data for training & evaluation
- ⚙️ Model Training. Run powerful algorithms for training ML models
- 🧪 Model Evaluation. Assess the effectiveness of your model, interactively during experimentation and automatically in production
- 🍽️ Model Serving. Deploy and serve your models in different environments
- 📦 Model Registry. Govern the lifecycle of the ML models in a central repository
- 👤 Model Validation. Understand how newly trained models perform in front of real users
- 📊 Model Monitoring. Track the efficiency and effectiveness of the deployed models in production environments
- 👷♂️ML Pipelines. Instrument, orchestrate, and automate complex ML training and prediction pipelines in test/staging/production environments.
- 🤖CI/CD Infrastructure. Build, test, release, and operate software systems rapidly and reliably
- 🏦 Data Repository. Creating, maintaining, and reusing high-quality data for training and evaluating ML models
Now you can start building your solution with these ML components and other building blocks you’re already familiar with. There are different flavours, some better than others and some anti-patterns to steer away from, let me explain:
🗒️ The Jupyter Notebook way
Experimentation, preparing data, training and evaluating models — all happen within a Jupyter notebook. The trained model is served via a basic Python Flask API and packaged into a Docker container. It’s a convenient way to build an early prototype, but consider it an anti-pattern to bring ML models to production environments with a stack of Jupyter Notebooks. Next to being a hard format to collaborate or reproduce, no thoughts are given to the other ML components essential to building long lasting products.
☁️ The Cloud provider way
Google Vertex AI, AWS Sagemaker, Azure ML — they all got you covered with a full-blown solution. It comes with a price tag, an overwhelming amount of features that you might initially not need and add complexity, and the always unwanted vendor lock-in. As an example, no way of using cheap spot instances to run ML Pipelines on Google Vertex AI.
🕺 The Hipster way
There are start-ups and scale-ups for every piece of the puzzle. NannyML for Model Monitoring & Validation, Tecton and Feast for feature store, Apache Beam for Data Processing, MLFlow as a model registry and tracking your experiments, KServe for Model Serving on Kubernetes, KubeFlow to building ML Pipelines, DVC to version your datasets.
Mix and match the latest greatest, everyone shouting their part of the solution being the most critical piece of technology you really really must adopt. With many novel interconnected parts though, there’s a high risk of overengineering and spending time on technology discussions. Keep it simple! Avoid being distracted from building the right product and fulfilling real user needs.
📈 The Agile way
Guided by a North Star 💫, Quality Attributes and a Solution Architecture — build working end-to-end software and add value incrementally. This key principle of Agile Software Development also holds when building ML Products and Services.
While keeping an eye on all ML Components that are essential to building durable ML Solutions, it’s often better to go fast and build a working skeleton solution. Start simple, and make timely decisions to add more complexity to the solution.
Thanks for reading this far! Hope that you learned a few things and see how to connect the dots between DevOps and MLOps, you know what kind of quality attributes come with ML Products and Services, and you gained understanding of the plethora of ML Components and how to fit them into a solution architecture.
Next up, I’ll try to dive deeper into the CD4ML process and how we use it at ITP, and how the different ML components are relevant throughout the lifecycle of an ML system. 🙌
Want to know more?
Join #ml-engineering or DM me on Slack, or dive into one of these resources:
- Whitepaper on MLOPS by Google [https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf]
- MLOPS processes by INNOQ [https://ml-ops.org]
- Open ML learning community [https://madewithml.com/#mlops]