https://packt.link/EarlyAccessCommunity
Over the last thirteen chapters, we have explored the field of Machine Learning (ML) interpretability. As stated in the preface, it's a broad area of research, most of which hasn't even left the lab and become widely used yet, and this book has no intention of covering absolutely all of it. Instead, the objective is to present various interpretability tools in sufficient depth to be useful as a starting point for beginners and even complement the knowledge of more advanced readers. This chapter will summarize what we've learned in the context of the ecosystem of ML interpretability methods, and then speculate on what's to come next!
These are the main topics we are going to cover in this chapter:
First, we will provide some context on how the book relates to the main goals of ML interpretability and how practitioners can start applying the methods to achieve those broad goals. Then, we'll discuss what the current areas of growth in research are.
As discussed in Chapter 1, Interpretation, Interpretability, and Explainability; and Why Does It All Matter?, there are three main themes when talking about ML interpretability: Fairness, Accountability, and Transparency (FAT), and each of these presents a series of concerns (see Figure 14.1). I think we can all agree these are all desirable properties for a model! Indeed, these concerns all present opportunities for the improvement of Artificial Intelligence (AI) systems. These improvements start by leveraging model interpretation methods to evaluate models, confirm or dispute assumptions, and find problems.
What your aim is will depend on what stage you are at in the ML workflow. If the model is already in production, the objective might be to evaluate it with a whole suite of metrics, but if the model is still in early development, the aim may be to find deeper problems that a metric won't discover. Perhaps you are also just using black-box models for knowledge discovery as we did in Chapters 4, and 5; in other words, leveraging the models to learn from the data with no plan to take it into production. If this is the case, you might confirm or dispute the assumptions you had about the data, and by extension, the model.
In any case, none of these aims are mutually exclusive, and you should probably always be looking for problems and disputing assumptions, even when the model appears to be performing well!
And regardless of the aim and primary concern, it is recommended that you use many interpretation methods, not only because no technique is perfect, but also because all problems and aims are interrelated. In other words, there's no justice without consistency and no reliability without transparency. In fact, you can read Figure 14.1 from bottom to top as if it were a pyramid, because transparency is foundational, followed by accountability in the second tier, and, ultimately, fairness as the cherry on top. Therefore, even when the goal is to assess model fairness, the model should be stress-tested for robustness. All feature importances and interactions should be understood. Otherwise, it won't matter if predictions aren't robust and transparent/
There are many interpretation methods covered in Figure 14.1, and these are by no means every interpretation method available. They represent the most popular methods with well-maintained open source libraries behind them. In this book, we have touched on most of them, albeit some of them only briefly. Those that weren't discussed are in italics and those that were have the relevant chapter numbers provided next to them. There's been a focus on model-agnostic methods for black-box supervised learning models. Still, outside of this realm, there are also many other interpretation methods, such as those found in reinforcement learning, generative models, or the many statistical methods used strictly for linear regression. And even within the supervised learning black-box model realm, there are hundreds of application-specific model interpretation methods used for applications ranging from chemistry graph CNNs to customer churn classifiers.
That being said, many of the methods discussed in this book can be tailored to a wide variety of applications. Integrated gradients can be used to interpret audio classifiers, and hydrological forecasting models. Sensitivity analysis can be employed in financial modeling and infectious disease risk models. Causal inference methods can be leveraged to improve user experience and drug trials.
Improve is the operative word here, because interpretation methods have a flip side!
In this book, that flip side has been referred to as tuning for interpretability, which means creating solutions to problems with FAT. Those solutions can be appreciated in Figure 14.2:
I have observed five approaches to interpretability solutions:
There are also three areas in which these approaches can be applied:
There's a fourth area that can impact the other three; namely, data and algorithmic governance. This includes regulations and standards that dictate a certain methodology or framework. It's a missing column because very few industries and jurisdictions have laws dictating what methods and approaches should be applied to comply with FAT. For instance, governance could impose a standard for explaining algorithmic decisions, data provenance, or a robustness certification threshold. We will discuss this further in the next section.
You can tell in Figure 14.2 that many of the methods repeat themselves for FAT. Feature Selection and Engineering, Monotonic Constraints, and Regularization benefit all three but are not always leveraged by the same approach. Data Augmentation also can enhance reliability for fairness and accountability. As with Figure 14.1, the items in italics were not covered in the book, of which three topics stand out: Uncertainty Estimation, Adversarial Robustness and Privacy Preservation are fascinating topics and deserve books of their own.
One of the most significant deterrents of AI adoption is a lack of interpretability, which is partially the reason why 50-90% of AI projects never take off, and the other is the ethical transgressions that happen as a result of not complying with FAT. In this aspect, Interpretable Machine Learning (iML) has the power to lead ML as a whole because it can help with both goals with the corresponding methods in Figure 14.1 and Figure 14.2.
Thankfully, we are witnessing an increase in interest and production in iML, mostly under Explainable Artificial Intelligence (XAI) — see Figure 14.3. In the scientific community, iML is still the most popular term, but XAI dominates in public settings:
XAI VERSUS IML – WHICH ONE TO USE? My take: Although they are understood as synonyms in industry and iML is regarded as more of an academic term, ML practitioners, even those in industry, should be wary about using the term XAI. Words can have outsized suggestive power. Explainable presumes full understanding but interpretable leaves room for error, as there always should be when talking about models, and extraordinarily complex black-box ones at that. Furthermore, AI has captured the public imagination as a panacea or has been vilified as dangerous. Either way, along with the term explainable, it serves to make it even more filled with hubris for those who think it's a panacea, and perhaps calm some concerns for those who think it's dangerous. XAI as a marketing term might be serving a purpose. However, for those that build models, the suggestive power of the word explainable can make us overconfident of our interpretations. That being said, this is just an opinion.This means that just as ML is starting to get standardized, regulated, consolidated, and integrated into a whole host of other disciplines, interpretation will soon get a seat at the table.
ML is replacing software in all industries. And as more is getting automated, more models are deployed to the cloud. And it will get worse with the Artificial Intelligence of Things (AIoT). Deployment is not traditionally in the ML practitioner's wheelhouse. That is why ML increasingly depends on Machine Learning Operations (MLOps). And the pace of automation means more tools are needed to build, test, deploy, and monitor these models. At the same time, there's a need for the standardization of tools, methods, and metrics. Slowly but surely, this is happening. Since 2017, we have had the Open Neural Network Exchange (ONNX), an open standard for interoperability. And at the time of the writing, the International Organization for Standardization (ISO) has over two dozen AI standards being written (and one published), several of which involve interpretability. Naturally, some things will get standardized because of common use, due to the consolidation of ML model classes, methods, libraries, service providers, and practices. Over time one or a few in each area will become the victors. Lastly, given ML's outsized role in algorithmic decision-making, it's only a matter of time before they get regulated. Only some financial markets regulate trading algorithms, such as the Securities and Exchange Commission (SEC) in the United States and the Financial Conduct Authority (FCA) in the UK. Besides that, only data privacy and provenance regulations are widely enforced, such as HIPAA in the US and LGPD in Brazil. The GDPR in the European Union takes this a bit further with the "right to an explanation" for algorithmic decisions but the intended scope and methodology are still unclear.
ML interpretability is growing quickly but is lagging behind ML. Some interpretation tools have been integrated into the cloud ecosystem, from SageMaker to DataRobot. They are yet to be fully automated, standardized, consolidated, and regulated, but there's no doubt that this will happen.
I'm used to hearing the metaphor of this period being the "Wild West of AI", or worse, an "AI Gold Rush"! It conjures images of unexplored and untamed territory being eagerly conquered, or worse, civilized. Yet, in the 19th century, the United States' western areas were not too different from other regions on the planet and had already been inhabited by Native Americans for millennia, so the metaphor doesn't quite work. Predicting with the accuracy and confidence that we can achieve with ML would spook our ancestors and is not a "natural" position for us humans. It's more akin to flying than exploring unknown land.
The article Toward the Jet Age of machine learning (linked in the Further reading section at the end of this chapter) presents a much more fitting metaphor of AI being like the dawn of aviation. It's new and exciting, and people still marvel at what we can do from down below (see Figure 14.4)!
However, it yet had to fulfill its potential. Decades after the barnstorming era, aviation matured into the safe, reliable, and efficient Jet Age of commercial aviation. In the case of aviation, the promise was that it could reliably take goods and people halfway around the world in less than a day. In AI's case, the promise is that it can make fair, accountable, and transparent decisions — maybe not for any decision, but at least those it was designed to make, unless it's an example of Artificial General Intelligence (AGI):
So how do we get there? The following are a few ideas I anticipate will occur in the pursuit of reaching the Jet Age of ML.
As we intend to go farther with AI than we have ever gone before, the ML practitioners of tomorrow have to be more aware of the dangers of the sky. And by the sky, I mean the new frontiers of predictive and prescriptive analytics. The risks are numerous and involve all kinds of biases and assumptions, problems with data both known and potential, and our models' mathematical properties and limitations. It's easy to be deceived by ML models thinking they are software. Still, in this analogy, software is completely deterministic in nature – it's solidly anchored to the ground, not hovering in the sky!
For civil aviation to become safe, it required a new mindset — a new culture. The fighter pilots of WWII, as capable they were, had to be retrained to work in civil aviation. It's not the same mission because when you know that you are carrying passengers on board, and the stakes are high, everything changes. Ethical AI, and by extension, iML, ultimately require this awareness that models directly or indirectly carry passengers "on board." And that models aren't as robust as they seem. A robust model must be able to reliably withstand almost any condition over and over again in the same way the planes of today do. To that end, we need to be using more instruments, and those instruments come in the form of interpretation methods.
Tighter integration with many disciplines is needed for models that comply with the principles of FAT. This means more significant involvement of AI ethicists, lawyers, sociologists, psychologists, human-centered designers, and countless other professions. Along with AI technologists and software engineers, they will help code best practices into standards and regulations.
New standards will be needed not only for code, metrics, and methodologies, but also for language. The language behind data has mostly been derived from statistics, math, computer science, and econometrics, which leads to a lot of confusion.
It will likely be required that all production models fulfil the following specifications:
New regulations will likely create new professions such as AI auditors and model diagnostics engineers. But they will also prop up MLOps engineers and ML automation tools.
In the future, we won't program an ML pipeline; it will mostly be a drag-and-drop affair with a dashboard offering all kinds of metrics. It will evolve to be mostly automated. Automation shouldn't come as a surprise because some existing libraries perform automated feature-selection model training. Some interpretability-enhancing procedures may be done automatically, but most of them should require human discretion. However, interpretation ought to be injected throughout the process, much like planes that mostly fly themselves have instruments that alert pilots of issues; the value is in informing the ML practitioner of potential problems and improvements at every step. Did it find a feature to recommend for monotonic constraints? Did it find some imbalances that might need adjusting? Did it find anomalies in the data that might need some correction? Show the practitioner what needs to be seen to make an informed decision and let them make it.
Certifiably robust models trained, validated, and deployed at a click of a button require more than just cloud infrastructure – the orchestration of tools, configurations, and people trained in MLOps to monitor them and perform maintenance at regular intervals.
Much like aviation took a few decades to become the safest mode of transportation, it will take AI a few decades to become the safest mode of decision-making. It will take a global village to get us there, but it will be an exciting journey! And remember, the best way to predict the future is to create it.