CHAPTER 2
Data Management and Preparation

“Deep learning craves big data because big data is necessary to isolate hidden patterns and to find answers without over-fitting the data. With deep learning, the better-quality data you have, the better the results.”

—Wayne Thompson, SAS Chief Data Scientist

Data has become a vital resource for organizations, entities, and governments alike.

For the risk management function, data has always been a pivotal enabler. Since its inception as a scientific discipline, sound risk management has been underpinned by efficiency in obtaining and retrieving of information—at the time when it’s needed—and robust data management. To name a few examples, data is used to inform risk assessments, monitor risks, and help to detect new types of risks. For risk modeling, real-time and granular data are increasingly being used to develop, monitor, and maintain better and more innovative risk models.

A critical lesson learned from the Global Financial Crisis was that banks' information technology (IT) and data architectures were inadequate to support the broad management of financial risks. Some banks were unable to manage their risks properly because of weak risk data aggregation capabilities and risk-reporting practices. This had severe consequences for the banks themselves and to the stability of the financial system. In response, the Basel Committee on Banking Standards (BCBS) developed a standard with principles for effective risk data, its aggregation, and final reporting under standard number 239.

In recent years, digitalization and the digital footprints left by consumers and businesses have caused a rapid growth in data sources available for analysis, broadening the possibilities to generate insights beyond those from traditional data sources. Today, the risk function is operating in a world with increasing demand for better digital services and data-driven decision-making. Other initiatives like open banking (or consent-based data exchange), the availability of a plethora of proprietary and open-source tools, and cloud computing have a significant impact on value creation in financial services. With the growth of these initiatives and the mainstream use of advanced analytics and machine learning, these innovations require strong data foundations.

Notwithstanding the benefits of having more data, it also presents an increased number of challenges, including access to good-quality data, cybersecurity, and the responsible use of personal data. Robust data management is critical for the adoption of artificial intelligence (AI) and machine learning as the modeling approaches become much more data driven. AI and machine learning can identify and explore subtleties in big data and can better handle “alternative data.”

The term big data refers to data that is so large, fast, or complex that it is difficult or impossible to process using traditional methods.

The concept of big data gained momentum when industry analyst Doug Laney articulated its definition using three Vs:

  1. Volume. Organizations collect data from a variety of sources, including business transactions, third-party data providers, news articles, and more. In the past, storing it would have been a problem, but more economical storage on cloud-based platforms and data lakes has eased the burden.
  2. Velocity. With the growth in real-time data, data streams into businesses at an ever-increasing speed and must be handled in a timely manner.
  3. Variety. Using larger data means that diverse types of data must be leveraged that is not necessarily structured as dimensional tables. Unstructured data sources such as sensor data, images, and social media updates add to the variety of data.

Looking ahead, the expectations from regulators are clear: risk calculations will need to be performed at much higher levels of granularity, with greater emphasis on the use of high-quality data, especially when advanced algorithms are being used for risk model development.

The risks of incorporating these emerging data sources in risk management, such as “alternative data,” are that these may violate consumer protection and fair lending laws. That may lead to unintended consequences in the use of AI and machine learning. The unintended consequences of AI and machine learning are often amplified, as the lack of explainability of the AI and machine learning masks the underlying data issues. The abstraction hidden in the layers of a deep learning algorithm is where existing approaches for data quality may fall short: organizations may have to turn to AI to manage the risks in the data.

The use of AI and machine learning cannot function outside existing business processes. For organizations to benefit from its use, a comprehensive approach to enterprise data management is required, spanning across the silos of data collection, modeling, downstream processes, reporting, and disclosures. In essence, for AI and machine learning, the governance of data becomes as important as the governance of the models.

IMPORTANCE OF DATA GOVERNANCE TO THE RISK FUNCTION

Historically, many organizations have lacked a strategic approach to data management and preparation. For financial institutions, compliance with the BCBS239 principles meant that organizations had to fortify their data aggregation and reporting capability by tracking the accuracy, integrity, and completeness of the data used for risk measurement and management.

The BCBS239 guidance represented a major transformational challenge for financial institutions around the globe since its development in the wake of the Global Financial Crisis. The BCBS239 guidance has four very closely related topics that help with better and more uniform risk exposure management:

  1. Risk governance and infrastructure
  2. Risk data aggregation
  3. Risk reporting practices
  4. Risk supervisory review

These four topics are supported by 11 principles, as shown in Figure 2.1. Several of these principles focus on improving data quality and data management to help with better and more uniform risk exposure management.

An illustration of the framework of the Basel Banking Standards (BCBS) standard number 239.

Figure 2.1 The framework of the Basel Banking Standards (BCBS) standard number 239.

The BCBS239 has represented a major transformational challenge for banks around the globe since its development in the wake of the Global Financial Crisis.

Other regulations, such as GDPR (Global Data Protection Regulation) and open banking, also require financial institutions to strengthen data governance.

Regardless of regulatory expectations, strong data management practices are a competitive advantage for data-driven decision-making. For good-quality models and decisions, good-quality data is a prerogative.

As a first step, organizations need to establish an enterprise data governance framework. This will then be supplemented by sound data integration and data preparation practices.

However, without the fundamentals of data preparation and its management in place—that includes the use of data dictionaries, standardized datasets, and data quality and validation routines—adhering to BCBS239 (Figure 2.1) will continue to be a challenge for many financial institutions, and many are still in the process of full adoption.

FUNDAMENTALS OF DATA MANAGEMENT

Many tools can be used to explore data and prepare it for risk modeling, but what is commonly encountered is that the sample design and approaches adopted for traditional models are often considered enough for AI and machine learning to be applied with rigor. This assumes that the fundamentals of data preparation and its management are present within the financial institution. Unfortunately, these fundamentals are not always in place to a meaningful degree, especially to support data-driven algorithms, even 14 years after the Global Financial Crisis.

For the fundamentals of data preparation and management to be more easily achieved and to ensure that the data is fit-for-purpose to leverage advanced algorithms, an enterprise data strategy that will support best practices for AI and machine learning is needed. Let us look at each of the fundamental topics. In addition, AI and machine learning can be used to achieve these objectives more easily.

Master Data Management

Organizations have access to copious amounts of internal and external data. However, one of the largest gaps faced by financial institutions large and small is the key concept of data catalogs. Many institutions attempt to fill this gap with spreadsheet-based descriptions of data-sets that quickly become outdated as new columns or rows are added, like new financial products. Automation enabled by AI and machine learning can help build data catalogs that include the ability to easily search for data and shared columns or observations across datasets. The search queries utilize natural language processing to improve ease of use. Organizations can build custom solutions using a programmatic approach or utilize vendor-provided applications with built-in AI technologies.

For example, a standard data catalog provides the following basic metadata for each dataset:

  • Column details
  • Row details
  • Size of the data
  • Completeness of the data (i.e., number of “missing” fields)
  • How the data is accessed and where it is located
  • Data security
  • Purpose of the data (this can be manually added or automated)
  • Data lineage and history of usage (i.e., how many reports, models, and jobs have been completed using the data, and the main users of the table)

Standardizing Datasets and Ensuring Data Quality

For AI and machine learning, risk management teams may decide that all the data profiling metrics generated today are enough.

We recall a very profound statement made by a prolific professor at a conference when challenged about the output results of a model: “Well, the data is the data!” This can be true. Data that is sourced and used in a risk model is “the data,” but importantly, it can be improved and cleansed using sophisticated cognitive data-quality controls. These measures ascertain the data with regards to its uniqueness and descriptive statistics such as average values and mean, median, mode, and standard deviation. When more sophisticated and repeatable techniques are used, that “little bit of guesswork” is removed and the data cleansing becomes less manual.

In practical terms, the data catalog information can be extended further, by descriptive data quality metrics and whether the data matches the expected distributions (e.g., if the catalog expects integer data but it is a string, then it will be recorded as a mismatch). Data-quality metrics are also extremely useful to provide details on the completeness of the dataset. These include, but are not limited to:

  • Frequency of uniqueness, missing values, number of mismatches
  • Metadata measures that provide information on columns, including its type, format, length, and candidates for primary keys
  • Security scores that will automatically detect columns that contain personally identifiable or sensitive information
  • Semantic types by combining data-quality algorithms and natural language processing to automatically generate metadata (e.g., the columns representing numeric values, identification numbers, and postcodes)

The challenges with developing, deploying, and managing AI are better addressed by automating manual data-quality tasks to assure good quality data for AI.

Leveraging cognitive data-quality methods is one way to achieve a level of automation in data validation processes. Cognitive data quality autogenerate the data quality and cleansing routines, thus reducing the need for manual effort. This code can be executed as part of robust and repeatable data preparation control plans. The methods can scan the data, either structured or unstructured, and suggest possible data-quality transformations.

OTHER DATA CONSIDERATIONS FOR AI, MACHINE LEARNING, AND DEEP LEARNING

Utilizing “Alternative Data”

Alternative data is said to be any data gathered from nontraditional data sources—for example, online or geolocation data, sensor data, social media, satellite images, or network data. Increasingly, data used to develop risk models is becoming a blend of traditional and alternative types especially in applications utilizing AI and machine learning.

Typically, traditional data sources are structured and stored in relational databases. In some instances, AI systems employ alternative data that includes unstructured data in the form of real-time transactional data, news feeds, web browsing, geospatial data and so on.

A common data challenge with AI approaches when utilizing larger data volumes is that the time history is not sufficiently long for back-testing purposes (e.g., data capture may have started recently).

The two data types used to build risk models, traditional and alternative, are summarized in Table 2.1. Before we progress to the next parts of this chapter, we wanted to take time to expand on the common steps for “alternative data” before it becomes useful for analytically driven insights and models. Commonly, unstructured “alternative data” must be structured into a more useful tabulated form, so that can it can be combined with existing data used for risk management purposes, like risk modeling. The process also standardizes the data and reduces its dimensionality. For the inclusion of personal data, the firm will need to implement an information management standard to help govern its use. Such a framework is based on customer or consumer consent, ethical considerations, and regulations, as well as the internal guiding principles of the institution. Furthermore, organizations should also comply with the principles of BCBS239 to “alternative data”, as done for the more traditional data.

Table 2.1 Data Types Used to Generate Risk Models

Type of Data Definition Traditional Alternative
Transactional
  • Structured and detailed information that captures key characteristics of customer transactions (i.e., installment payments and cash transfers)
  • Stored in massive online transaction processing (OLTP) relational databases
  • Aggregated, structured, and typically summarized over longer-term horizons by aggregating into averages etc.
  • Granular, allowing for open banking aggregation
Internal data
  • Customer, product, and transactional history
  • Stored in enterprise databases
  • Call center data
  • Digital journeys
  • App usage
External data
  • Data available from third-party data providers and public information
  • For example, positive and negative credit data from a small group of bureaus
  • Macroeconomic information from bureau of statistics
  • Multiple sources of third-party data (e.g., bureau data, open-banking aggregation, payment data from utilities, purchasing data from online platforms, subscription information, browsing history, weather, and location data)
  • Public information
  • Social media data (care is needed if using especially concerning customer consent)
Unstructured data
  • Data not stored in a predefined structured format
  • Traditionally required human effort to analyze
  • Text documents (e.g., emails, web pages, claim forms)
  • Multimedia content
  • Network information (i.e., ownership structures, suppliers, liquidity, legal dependencies between counterparties)

Extending Risk Data to “Alternative Data” for AI and Machine Learning

Data preparation and its management for risk model development starts with the identification of the relevant data sources. In some cases, these sources need to be justified, which means that the reason for using the data must be evidenced as appropriate for the model to be built without causing bias or unfair outcomes (Figure 2.2).

Importantly, when risk modelers select data sources, there are key organizational processes and policies to be considered. Any such considerations will continue to apply whether AI or machine learning replace traditional risk models or when AI or machine learning augment traditional risk models. The considerations depend on the type of model. Regardless of whether it is a traditional or innovative risk model, it is always good to ensure that the data sources selected are appropriate and justified for the model's intended use. In this case, the intended use refers to the model objective, purpose, and the setting in which the model will be applied. For example, in risk departments, a range of model types are commonly used. A selected set of risk model types are listed and explained below, as well as examples of how risk modelers are utilizing alternative data sources. For example, alternative data from mobile phones and utilities helps lend to the unbanked and underbanked in many countries. This has a significant impact on access to credit for consumers and micro-entrepreneurs, and further helps to ensure fairer access to credit in countries that have historically disadvantaged groups:1

Schematic illustration of a typical relational risk management data model.

Figure 2.2 A typical relational risk management data model.

The data categories support business as usual (BAU) risk models but are extended for AI/ML models using alternative and third-party data sources.

  • Decision models, such as application and behavioral scorecards. For decision models, risk modelers may use transactional data as a source of alternative data to predict short-term and immediate events like loss of income. By enriching existing data with alternative sources, a more nuanced view of credit risk can be achieved, not only using banking data, but also telecommunications data like subscription data. For more information on the examples, refer to Chapter 3.
  • Credit risk parameter estimates for probability of default, loss given default, and exposure at default. Risk parameters are widely used for regulatory capital, provisions, stress testing, and other internal calculations. Risk modelers are utilizing location data from property intelligence platforms and map applications to develop location scores. Risk modelers then use these scores to improve prediction of probability of default (PD) and loss given default (LGD) models for consumer loan portfolios and loans to small and medium enterprises (SMEs).
  • Stress testing. These are forward-looking analyses to assess how adverse scenarios affect the resilience of a financial institution's balance sheet. This is normally done by stressing macroeconomic variables like GDP (gross domestic product), unemployment rates, and house price indices. A financial institution may take its individual portfolios and granular risk factors and stress its risk parameters using internally defined scenarios. Analysis can be applied to extract insights from news articles and social media to assess sentiment on emerging risks such as geopolitical uncertainty. For climate risk assessments, scenario-based analytics utilize climate scenarios such as those provided by the Network for Greening the Financial System (NGFS).

Synthetic Data Generation

With the use of AI and machine learning and the availability of additional compute power, the use of synthetic data is rising. Often, the available data used for training AI and machine learning has data limitations, such as gaps in the time series, insufficient data history, or in the level of granularity. Simulated data can help overcome these gaps. The methods that are typically applied range from bias adjustments (e.g., the fuzzy method for reject inferencing), repeat sampling (bootstrapping), and synthetic minority oversampling (SMOTE) to those that generate market states like restricted Boltzmann machines.2

Some AI and machine learning methods intrinsically generate their own data (generalized adversarial networks, variational auto-encoders). These methods can help stress test models and therefore improve the robustness of models as part of model validation. While these apply to high-dimensional data, it bears noting that by simulating data based on either an algorithmic approach or approximation, additional model risk can be introduced.

Typical Data Preprocessing, Including Feature Engineering

As described previously, AI and machine learning are increasingly leveraged in risk management to augment traditional risk models, because it realizes the long-needed risk transformation to automate mundane tasks and to process large volumes of data.

Getting the data correct for AI and machine learning typically involves standard tasks. Several of these have been covered, such as data quality, cognitive data quality, and extending risk data to “alternative data”. The risk model lifecycle also routinely involves feature engineering and dimensionality reduction.

Feature engineering helps to automatically build a range of features from input data. By using simpler features, AI and machine learning become easier to interpret and, possibly, to maintain over time. Well-engineered features also reduce the effort needed to optimize parameters for machine learning. Feature engineering is also important because high-dimensional data often observe sparsity in the data points as the number of inputs grow. This means that as more features are added, the feature space increases.

Certain machine learning algorithms are more sensitive to outliers, meaning that the output model is skewed toward the outlier population. Outliers are not easy to identify in a dataset and take time to interpret. Even well-trained analytics experts require time to explore the data for outliers and to then determine what to do with them. Are they real or due to a bona fide error? Outlier populations can be present in the data used for development of AI and machine learning due to system errors in data, for example, when updates occur, or files of data are transferred between different systems, or due to data entry errors caused by misinterpretation of internal policy, operational processes, and procedures.

It is also important to explore the relationships that exist between outlier populations, as these linked relationships can be helpful to interpret the outliers within, and between, datasets.

There are robust analytical techniques that are helpful to identify outliers. One way includes the use of Principal Components Analysis (PCA). Another unsupervised technique is the use of T-SNE data visualizations. The clusters or “outliers” can be further studied to determine if they are real sub-populations that could be separately modeled, or are outliers due to errors in the dataset.

CONCLUDING REMARKS

For effective data management, organizations rely on effective dataset design and their prevailing risk and control self-assessment frameworks. These include tracking the accuracy, integrity, and completeness of data used for risk measurement and management. Characteristically, although not always, AI approaches tend to utilize higher quantities of data, and therefore data quality and data processing become more prominent. Technology enablers in the form of catalogs, lineage tracking, as well as transparent data and model pipelines enhance the controls and the auditability of data processes.

To fully take advantage of AI and machine learning, current impediments to effective data management can be more easily addressed using automated and repeatable processes.

ENDNOTES

  1. 1.  Naeem Siddiqi, When Lending Inequities Fuel Housing Disparities (Chicago: BAI, 2021).
  2. 2.  Samuel Cohen, Derek Snow, and Lukasz Szpruch, Black-Box Model Risk in Finance (Mathematical Institute, University of Oxford, and The Alan Turing Institute, School of Mathematics, University of Edinburgh, February 9, 2021).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset