Chapter 13: Managing Your Data Journey

“You possess all the attributes of a demagogue; a screeching, horrible voice, a perverse, cross-grained nature, and the language of the marketplace. In you, all is united which is needful for governing.”

– Aristophanes, The Knights

In the previous chapters, we looked at the roles and responsibilities of the primary data personas, namely data engineers and data scientists, and ML practitioners, business analysts, and DevOps/MLOps personas. One persona that we have not talked about much is that of an administrator. They are the gatekeepers that hold the key to deploying infrastructure, enabling users and principals on a platform, setting ground rules on who can do what, being responsible for version upgrades, applying patches, security, and enabling new features, and providing direction for business continuity and disaster recovery, and so on. What will all this look like in a multitenant ecosystem where some lines of business have shared access to data and others don’t?

In particular, we will look at the following topics:

  • Provisioning a multi-tenant infrastructure
  • Democratizing data via policies and processes
  • Capacity planning
  • Managing and monitoring data
  • Data sharing
  • Data migration
  • Center of Excellence (COE) best practices

Let’s look at how an administrator goes about planning these various responsibilities. Delta may or may not directly relate to each sub-area.

Provisioning a multi-tenant infrastructure

The administrator is tasked with setting up the infrastructure for the tenants of an environment. One question that often arises is what should be the optimum balance of collaboration and isolation is. Creating single deployments and putting everyone there could lead to hitting rate limits and is not a sustainable strategy. Since we have the luxury of elasticity of the cloud, we can turn on as many environments as we wish to isolate data and users and provide better blast radius control in case of a security breach. Conversely, creating too many environments leads to harder governance and maintenance challenges, collaboration suffers, and the enablement cycle could be much longer.

Let’s examine the various scenarios:

  • Separate development, staging, and production environments.
  • Disaster recovery requires setting a parallel production environment in a different region.
  • Different lines of business units want a separate isolated environment.
  • Some lines of business wish to share some resources.

Within an environment, there will be several data personas who need access to compute and storage. Sometimes, these resources can be shared, and at other times, they need to be isolated. Some workloads need regulatory compliance, such as Payment Card Industry (PCI), Health Insurance Portability and Accountability Act (HIPAA), System and Organization Control (SOC), and General Data Protection Regulation (GDPR), which require more attention. Other times, there is a risk of Intellectual Property (IP) exposure and data exfiltration that needs additional sensitivity around both the compute and storage handling.

No matter what multi-tenancy strategy you choose, here are some key considerations:

  • Data should remain in an open format on cheap, reliable cloud storage so it can be accessed from multiple environments by multiple tools and frameworks. Delta with underlying Parquet is an ideal file format for this.
  • The users of an organization are already provisioned in a central Identity Provider (IDP), so it is a good idea to sync them into your environments using something such as System for Cross-Domain Identity Management (SCIM). The advantage is that groups can be synced, so if users get added or removed, that change is automatically picked up.
  • These principals (users, groups, and service principals) have entitlements that define their privilege levels of what they can access and what authorizations they have on that data to view, update and delete. Again, these entitlements should be defined once and synced into the various environments so that a user does not inadvertently provide access in one environment, leading to data and IP exfiltration.
  • Admins themselves can be of several types. There may be a super admin who can override everything; there may be other account-level admins supporting the super admin and helping to create other environments. Then, each of these environments will be assigned its own admin who has jurisdiction over only a particular environment.

Delta being an open format can be accessed from any environment if the access privileges allow for it. In the next section, we will look at the roles and responsibilities of an admin for a given environment.

Data democratization via policies and processes

If everything is locked down, then there is no threat of exposure. However, that is not the intended agenda of data organizations. Getting the relevant data in the hands of the right privileged audience helps a company innovate by allowing people to explore and discover new meaningful ways to add business value from the data. IT should not be the bottleneck in the process of data democratization. If new datasets are brought in, IT should not be overwhelmed with tickets from every part of the organization requesting access to them. So, enabling self-service with appropriate security guardrails is an important responsibility of an administrator. This is where policies play an important role in policing an environment, either preventing an unintended situation from taking place or reporting against it by running scans to detect patterns so that bad actors or novices can be corrected in time.

Policies can be of several types; some typical examples include the following:

  • Restricting the type and size of compute used in an environment:
    • A 200-node cluster when a 20-node would suffice is wasteful and can rack up a hefty bill. Similarly, using expensive GPU nodes when a workload requires only CPUs can be prevented by providing an allowable list of node types and a maximum size of the cluster.
  • Enforcing the use of tags:
    • In a shared environment, usage and billing attribution can be challenging. If every cluster/job is tagged by a team name, then the chargeback model and reporting dashboard can be simpler to interpret. So, enforcing a consistent tagging and naming convention, will help in the long run.

There might be one-off cases, such as a team actually needing 200 nodes where the provided policy templates do not work for a particular team. It is an exception rather than the norm that has not been accounted for because it is very specific and rare. Folks may get blocked and come to a stalemate situation, demanding that policies be loosened to take care of these special situations. You should not buckle and give in to these demands as it will hurt the majority of scenarios. The whole thing was done for better control and governance. Instead, a better way to handle it would be by a process where the affected team gets an exemption approved by a higher-level business executive who justifies that usage, and then a new policy can be created only for that specific group.

A good place to examine how to define policy is to look at a feature offering from a managed platform such as Databricks. So, now, we’ve established ground rules that others in an environment are going to play by, although rules are made to be broken. In the next section, we will see how to audit the adherence to policies and report against non-compliance.

Capacity planning

Data volumes are constantly growing. Capacity planning is the art and science of arriving at the right infrastructure that caters to the current and future needs of a business. It has several inputs, including the incoming data volume, the volume of historical data that needs to be retained, the SLAs for end-to-end latency, and the kind of processing and transformations that are done on the data. It is directly linked to your ability to sustain scalable growth at a manageable cost point. We may be tempted to think that leveraging the elasticity properties of cloud infrastructure absolves us from planning around capacity, which is in correct!

So, how do you go about forecasting demand? The simplest way is to use a sliver of data, establish a pilot workstream, take the memory, compute and storage metrics and project it out for the full workload, adding in some buffer for growth and then repeating it for every known use case, while keeping a buffer for unplanned activity. This exercise needs to be done over a 12-month period; in some cases, it may be longer.

The next thing is to determine a percentage of it for lower environments, such as development and staging or a production environment in a different region set up for business continuity or disaster recovery purposes. ML planning is a little harder because it is a true scientific experiment, and data scientists typically run hundreds of architectures before converging on one. In that case, using time and compute to come up with baselines is a more pragmatic approach. The worst thing that can happen is to be surprised by high costs and a use case being turned down or completely shut off because the initial projections were too low.

Managing and monitoring

Every organization has policies around data access and data use that need to be honored. In addition, there are compliance guidelines in some regulated industries to prove that compliance is honored, using an audit trail of the types of user access and manipulation of the data. Hence, there is a need to be able to set the controls in place, detect whether something has been changed, and provide a transparent audit trail. This includes access to raw data as well as via tables that are an artifact on top of the data.

The metrics collected from these logs need to be compared over a period of time to understand trend lines. Delta’s versioning capability comes in handy to monitor not only operations done on a table but metrics logged as well. It would be fair to say that these metrics need more permanence and some date/time stamp would be used to log them.

There are several types of logs in a system. The main ones include the following:

  1. Audit logs:
    • Who is doing what in the various environments – that is, an audit trail of user actions. For example, someone deleted a cluster or edited a pipeline workflow definition. Parsing the events/actions captured in the audit log helps to detect anomalous patterns that may need timely rectification.
    • Administrators, security, and compliance auditors value this information.
  2. Cluster logs:
    • The Spark UI provides a wealth of information around driver/worker logs, stdout/stderr and log4j outputs, and reflects the usage of jobs and the associated data crunching.
    • Administrators, data engineers, and ML personas often analyze these logs.
  3. Spark metrics:
    • Metrics around Spark jobs/stages/tasks such as the volume of data read, crunched, and written. Shuffle, spill, and garbage-collect details to help understand performance.
    • Typically useful for support and performance engineers looking to debug bottlenecks.
  4. Virtual Memory (VM) and system metrics:
    • This is the CPU/memory/storage utilization of the underlying VMs.
    • Performance can be monitored via tools such as Gaglia, Datadog, and other monitoring agents.
    • Typically useful for support and performance engineers looking to debug bottlenecks.
  5. Custom logging:
    • These are purely purpose-built to provide additional data points, such as detecting data or model drift, to alert data engineers and ML practitioners to reconsider a newer iteration of their baseline logic or model.
    • The following diagram provides a framework to monitor and manage the various logs in a multi-tenant environment.
Figure 13.1 – Logging and monitoring your data environments

Figure 13.1 – Logging and monitoring your data environments

Certain activities such as SSO provisioning, SCIM integration, and audit logging are provisioned once for all environments. Others such as cluster usage and billing logs are typically collected per environment but may benefit from rolling up as a centralized view. Marrying the audit logs and cluster logs can be very powerful to understand your most expensive workloads and users and help with demand forecasting, as well as tuning activities. This is done by directing them to a centralized cloud storage location, segregated by environment. This may look like a lot of work, but the price of failure and non-compliance is very high. So, all mature organizations should plan to have a solid strategy around logging and monitoring, as they provide a lot of telltale signals about the health of pipelines and how well they are being governed and managed.

Data sharing

This is a paradox to a lot of collaboration and isolation concepts we reviewed in earlier sections. When groups or lines of business have a lot of data dependencies, they are usually housed together to facilitate better collaboration, and if they do not have any operational dependencies, they can be segregated in their own environments – for example, HR and marketing may be in their own domain meshes. However, what happens if there is a need for them to share some insights? There should be a way to promote it, as it leads to better stakeholder engagement that improves enterprise value. However, all the painful architecting to ensure this accidental exposure does not happen will now have to be reconsidered. That is a lot of unnecessary complexity and re-architecting. Also, data replication to a shared location will lead to the two getting out of sync. Thankfully, Delta sharing comes to the rescue.

A simple, open, and secure way to share data can be achieved through Delta sharing without requiring multiple copies of data and any vendor lock-in propositions. We covered this in a previous chapter so will not go into the mechanics of it again. It suffices to say that the Delta Sharing server brokers the exchange between the data provider and the data recipient, and it helps facilitate any BI/AI use case, using any tool on any cloud, including on-premises.

Figure 13.2 – Delta Sharing

Figure 13.2 – Delta Sharing

But first, let’s see how it is different from other similar offerings:

  • Secure, cost-effective, and zero compute cost, barring some egress charges when it is cross-region.
  • Vendor-agnostic, multi-cloud, and open source.
  • Table/partition/DataFrame-level abstraction.
  • Scalable with predicate pushdown and object store bandwidth.
  • An asset is not only the data but other assets such as dashboards and models. Delta sharing allows for model sharing as well.

Now, let’s explore the main use cases benefiting from it:

  • Data sharing between Line of Business (LOBs) that do not typically have operational dependencies:
    • Expensive complex rearchitecting can be avoided.
    • Moreover, there may be a case where one LOB receives data from two or more siloed LOBs and is responsible for joining and consolidating all the pieces, either by maintaining anonymity or otherwise, and this enables new use cases that would otherwise have been very difficult to pull through.
  • Additional opportunities for data monetization:
    • The time and cost it has taken to curate and validate datasets can be useful to other organizations, allowing for an additional revenue stream.
  • Data sharing between disparate architectures:
    • All that is needed for Delta sharing is a laptop and the ability to read Parquet.
    • In a multi-cloud deployment, it serves as the glue between the various disparate environments and regions.

In the next section, we will look at another opportunity for large-scale data movement. However, unlike the other scenarios, this is usually a one-time operation.

Data migration

Technologies are constantly evolving. It is important to choose a platform and architecture that is future-proof and extensible and supports a pluggable paradigm to play nicely with other tools of an ecosystem. So gravitating towards open data formats, open source tooling, and cloud-based architecture with separation of compute and storage, you can dodge the main bullets. There will be a time when this is no longer sustainable and the whole data platform needs a refreshing overhaul. Some examples of this that we’ve seen in recent years is migration from Hadoop-based systems that are complex and difficult to manage to cloud-native data platforms. The same is true of expensive data warehousing solutions such as Netezza, Teradata, and Exadata. Migration projects are expensive, time-consuming, and critical to the overall value of a business and tech investments and need to be planned and executed very carefully.

How will you determine whether to patch an existing system or rebuild it with a newer tech stack? The main driving forces are as follows:

  • Expensive Total Cost of Ownership (TCO), which continues to grow as the volume of data grows.
  • Inflexible systems that worked well for some use cases but are no longer conducive for newer use cases, either because it is very difficult to build using them or, in some cases, just not possible. For example, leveraging unstructured data for advanced analytic use cases is not something that traditional databases support.
  • Vendor lock-in to proprietary formats and technologies where integration with other parts of a data ecosystem is difficult or unsupported.

To mitigate risk and ensure an on-time and on-cost migration, a phased approach is typically followed:

  1. A discovery phase where the existing workload is examined using a profiler and consultative approach to benchmark what workloads were running in older environments and bucket them into three areas of straightforward, moderate, and high complexity.
  2. The next phase is that of assessing the best fit of people, processes, and technology. This is where tooling and partners are lined up to examine what automation can be achieved by existing tooling and accelerators.
  3. This is followed by a migration workshop phase where all concerned stakeholders come together to draw up a master plan and strategy to execute upon. A technology mapping exercise is done at a finer granularity to determine which are the lift and shift jobs and which ones need architecting in the new environment, along with ballpark effort and cost figures.
  4. Next comes the pilot phase, where a typical use case is chosen to execute upon. The reference architecture is drawn up and the follow-up roadmap.
  5. All this has been a paper exercise. This is an actual pilot implementation phase where the rubber meets the road, and some common pain points are addressed to serve as lesson learning for subsequent iterations. The two environments remain up in parallel, allowing for easy validation.
  6. The same is done for all workloads, and rigorous testing and data reconciliation activities are done to prove the migration completion point. After a cutover, the older environment is gradually shut down and permanently retired.

The main technology mappings to consider are as following:

  • Data storage
  • Meta data storage
  • Code migration around compatible libraries and APIs
  • Data processing and transformations
  • Security
  • Orchestration of jobs and workflows

If data is not in Delta format, files can be converted to Delta using one of these options:

  • Convert a Parquet table to Delta:

    CONVERT TO DELTA <parquet table>

  • Convert files to Delta format and create a table using that data:

    CONVERT TO DELTA parquet.<`/data-path/`>

    CREATE TABLE <delta table> USING DELTA LOCATION <’/data-path/’>

  • Convert a non-Parquet format such as ORC to Parquet and then to Delta:

    CREATE TABLE <parquet table> USING PARQUET OPTIONS (path <’/data-path/’>)

    CONVERT TO DELTA <parquet table>

One thing to keep in mind while porting Delta is to avoid bypassing the transaction logs, as they retain the source of truth and can cause inaccuracies. Of course, running vacuum on a Delta table will remove all versions and make it look similar to a Parquet file. You can also generate a manifest file that can be read by other processing engines such as Presto and Athena using the following:

GENERATE symlink_format_manifest FOR TABLE <delta table>

In the next section, we will look at the need to establish a center of excellence within an enterprise and its roles and responsibilities.

COE best practices

Establishing an internal steering committee/team as the Center of Excellence (COE) for creating advanced analytics is a complex process. Its primary purpose is to provide a blueprint to onboard data teams, and enable them with technical and operational practices, support for handling issues and tickets, and executive alignment to ensure that technical investments align to business objectives and value can be realized and quantified. The role is that of an enabler, as a governance overseer, but never to the point of a bottleneck. In some organizations, the COE team is responsible for managing all or part of an infrastructure and the shared data ingestion process, ratifying vendor tools and frameworks for internal consumption. They are either funded directly or get compensated by a chargeback model from the individual lines of business that they service.

The foundational blocks include the following aspects:

  • Cloud strategy: Which cloud to use, whether a multi-cloud to be considered, and what workloads belong where, depending on special regulatory and compliance guidelines such as FedRAMP
  • Architecture blueprints: For common data patterns so that reusable assets are created once, hardened, and used multiple times
  • Security and governance: The deployment model and the mapping of entitlements to principals so that privileges are never misused or misinterpreted

The next set of concerns focus on data ingestion and network connectivity:

  • What tools, platforms, and licenses are approved in the areas of ingestion, ETL, streaming, migrations, warehousing, data exploration/visualization, and reporting.
  • Individual members from LOB request assistance via a prescribed process such as a ServiceNow ticket.
  • Providing scalable operations using Infrastructure As Code (IaC). For example, a data user files a ticket to request assistance in debugging a connectivity issue or enabling a new feature or a new environment. What is the set of prerequisites to facilitate the process?
  • Best practices for logging, monitoring, cost, and performance tuning.
  • Training and enabling sessions for all data personas. All documentation and training material should be well documented in internal-facing wikis, SharePoint, or similar.
  • Building central dashboards where downstream teams can view their usage and billing.

The final set of concerns focus on specialized AI/ML and data science activities:

  • Guidance around combining business domain knowledge to power ML activities that are interpretable and explainable. Is business value realized?
  • Guidelines around data asset sharing (features data, model, and insights).
  • Model acceptance criteria that not only the line of business but also the entire enterprise is accountable for. For example, an insurance model approving loan applications has to be fair and balanced; otherwise, the credibility of the entire organization is at stake.

Usually, a COE is a cross-functional group of members who serve a wide range of users with varying skill sets, from novice data citizens to more advanced players. Their responsibilities fall into the following main buckets:

  • Governance and setup
  • Deploying infrastructure:

Different organizations have different policies around the types of environment to provide. Typical ones include the following:

  • Sandbox: Users can bring in their own datasets and experiment in a shared environment. Usually, non-sensitive data is allowed.
  • Developer: This is typically for a data team and can be shared. Access to data is well guarded.
  • Staging: This mimics production. You may have read-only access to production data and access to other secure non-production data.
  • Production: Usually, this is off limits to all data teams. All jobs run as service principals and are deployed through CI/CD pipelines.
  • Security, monitoring, and alerts/actions:
    • Data and intellectual property loss are the main concerns that are addressed so that people with the right privileges have access to sensitive data through logging, monitoring, and auditing.
    • DevOps and application life cycle management to ensure a high level of automation.
  • Approving data projects:
    • Typically, a data team comes with its proposal, which includes what datasets it plans to use, the volume of data, its ETL and consumption strategies, and the use cases it intends to solve. There will be an approximate cost estimate in terms of time, resources and money. There is a score assigned to the project that indicates the priority, COE resources are then assigned to help onboard the team, and a chargeback model ensures that individual teams manage their own operational budgets.
  • Nurture and growth:
    • Enable the data community with training and how-tos to facilitate adoption.
    • Provide support to unblock users around tooling and infrastructure.
    • Create reusable assets to facilitate the adoption journey.
  • Attain executive sponsorship:
    • This ensures that the vision and mission are clearly articulated with well-defined success criteria. Process efficiencies and accountability are put in place early on.

The establishment of a COE is no guarantee that all data and ML initiatives will be successful, but it is a foundational piece in any enterprise’s data journey to ensure a sound governance body for better data management and decision making.

Summary

The previous chapters focused on the role of data engineers, data scientists, business analysts, and DevOps/MLOps personas. This chapter focused on the admin persona who plays a pivotal role in an organization’s data journey by enabling the infrastructure, onboarding users, and providing data governance and security constructs so that self-service can be fully and safely democratized. We looked into various tasks, such as COE duties and responsibilities and data migration efforts, among others, which require admins to do a lot of the heavy lifting. Consolidating data into a common, open format such as Parquet with a transactional protocol such as Delta helps in use cases involving data sharing and migrations. It is important to keep in mind that technology and business users need to plan an enterprise’s data initiatives together to make sure that the insights generated are relevant and useful for the enterprise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset