In the previous chapters, we looked at the roles and responsibilities of the primary data personas, namely data engineers and data scientists, and ML practitioners, business analysts, and DevOps/MLOps personas. One persona that we have not talked about much is that of an administrator. They are the gatekeepers that hold the key to deploying infrastructure, enabling users and principals on a platform, setting ground rules on who can do what, being responsible for version upgrades, applying patches, security, and enabling new features, and providing direction for business continuity and disaster recovery, and so on. What will all this look like in a multitenant ecosystem where some lines of business have shared access to data and others don’t?
In particular, we will look at the following topics:
Let’s look at how an administrator goes about planning these various responsibilities. Delta may or may not directly relate to each sub-area.
The administrator is tasked with setting up the infrastructure for the tenants of an environment. One question that often arises is what should be the optimum balance of collaboration and isolation is. Creating single deployments and putting everyone there could lead to hitting rate limits and is not a sustainable strategy. Since we have the luxury of elasticity of the cloud, we can turn on as many environments as we wish to isolate data and users and provide better blast radius control in case of a security breach. Conversely, creating too many environments leads to harder governance and maintenance challenges, collaboration suffers, and the enablement cycle could be much longer.
Let’s examine the various scenarios:
Within an environment, there will be several data personas who need access to compute and storage. Sometimes, these resources can be shared, and at other times, they need to be isolated. Some workloads need regulatory compliance, such as Payment Card Industry (PCI), Health Insurance Portability and Accountability Act (HIPAA), System and Organization Control (SOC), and General Data Protection Regulation (GDPR), which require more attention. Other times, there is a risk of Intellectual Property (IP) exposure and data exfiltration that needs additional sensitivity around both the compute and storage handling.
No matter what multi-tenancy strategy you choose, here are some key considerations:
Delta being an open format can be accessed from any environment if the access privileges allow for it. In the next section, we will look at the roles and responsibilities of an admin for a given environment.
If everything is locked down, then there is no threat of exposure. However, that is not the intended agenda of data organizations. Getting the relevant data in the hands of the right privileged audience helps a company innovate by allowing people to explore and discover new meaningful ways to add business value from the data. IT should not be the bottleneck in the process of data democratization. If new datasets are brought in, IT should not be overwhelmed with tickets from every part of the organization requesting access to them. So, enabling self-service with appropriate security guardrails is an important responsibility of an administrator. This is where policies play an important role in policing an environment, either preventing an unintended situation from taking place or reporting against it by running scans to detect patterns so that bad actors or novices can be corrected in time.
Policies can be of several types; some typical examples include the following:
There might be one-off cases, such as a team actually needing 200 nodes where the provided policy templates do not work for a particular team. It is an exception rather than the norm that has not been accounted for because it is very specific and rare. Folks may get blocked and come to a stalemate situation, demanding that policies be loosened to take care of these special situations. You should not buckle and give in to these demands as it will hurt the majority of scenarios. The whole thing was done for better control and governance. Instead, a better way to handle it would be by a process where the affected team gets an exemption approved by a higher-level business executive who justifies that usage, and then a new policy can be created only for that specific group.
A good place to examine how to define policy is to look at a feature offering from a managed platform such as Databricks. So, now, we’ve established ground rules that others in an environment are going to play by, although rules are made to be broken. In the next section, we will see how to audit the adherence to policies and report against non-compliance.
Data volumes are constantly growing. Capacity planning is the art and science of arriving at the right infrastructure that caters to the current and future needs of a business. It has several inputs, including the incoming data volume, the volume of historical data that needs to be retained, the SLAs for end-to-end latency, and the kind of processing and transformations that are done on the data. It is directly linked to your ability to sustain scalable growth at a manageable cost point. We may be tempted to think that leveraging the elasticity properties of cloud infrastructure absolves us from planning around capacity, which is in correct!
So, how do you go about forecasting demand? The simplest way is to use a sliver of data, establish a pilot workstream, take the memory, compute and storage metrics and project it out for the full workload, adding in some buffer for growth and then repeating it for every known use case, while keeping a buffer for unplanned activity. This exercise needs to be done over a 12-month period; in some cases, it may be longer.
The next thing is to determine a percentage of it for lower environments, such as development and staging or a production environment in a different region set up for business continuity or disaster recovery purposes. ML planning is a little harder because it is a true scientific experiment, and data scientists typically run hundreds of architectures before converging on one. In that case, using time and compute to come up with baselines is a more pragmatic approach. The worst thing that can happen is to be surprised by high costs and a use case being turned down or completely shut off because the initial projections were too low.
Every organization has policies around data access and data use that need to be honored. In addition, there are compliance guidelines in some regulated industries to prove that compliance is honored, using an audit trail of the types of user access and manipulation of the data. Hence, there is a need to be able to set the controls in place, detect whether something has been changed, and provide a transparent audit trail. This includes access to raw data as well as via tables that are an artifact on top of the data.
The metrics collected from these logs need to be compared over a period of time to understand trend lines. Delta’s versioning capability comes in handy to monitor not only operations done on a table but metrics logged as well. It would be fair to say that these metrics need more permanence and some date/time stamp would be used to log them.
There are several types of logs in a system. The main ones include the following:
Certain activities such as SSO provisioning, SCIM integration, and audit logging are provisioned once for all environments. Others such as cluster usage and billing logs are typically collected per environment but may benefit from rolling up as a centralized view. Marrying the audit logs and cluster logs can be very powerful to understand your most expensive workloads and users and help with demand forecasting, as well as tuning activities. This is done by directing them to a centralized cloud storage location, segregated by environment. This may look like a lot of work, but the price of failure and non-compliance is very high. So, all mature organizations should plan to have a solid strategy around logging and monitoring, as they provide a lot of telltale signals about the health of pipelines and how well they are being governed and managed.
This is a paradox to a lot of collaboration and isolation concepts we reviewed in earlier sections. When groups or lines of business have a lot of data dependencies, they are usually housed together to facilitate better collaboration, and if they do not have any operational dependencies, they can be segregated in their own environments – for example, HR and marketing may be in their own domain meshes. However, what happens if there is a need for them to share some insights? There should be a way to promote it, as it leads to better stakeholder engagement that improves enterprise value. However, all the painful architecting to ensure this accidental exposure does not happen will now have to be reconsidered. That is a lot of unnecessary complexity and re-architecting. Also, data replication to a shared location will lead to the two getting out of sync. Thankfully, Delta sharing comes to the rescue.
A simple, open, and secure way to share data can be achieved through Delta sharing without requiring multiple copies of data and any vendor lock-in propositions. We covered this in a previous chapter so will not go into the mechanics of it again. It suffices to say that the Delta Sharing server brokers the exchange between the data provider and the data recipient, and it helps facilitate any BI/AI use case, using any tool on any cloud, including on-premises.
But first, let’s see how it is different from other similar offerings:
Now, let’s explore the main use cases benefiting from it:
In the next section, we will look at another opportunity for large-scale data movement. However, unlike the other scenarios, this is usually a one-time operation.
Technologies are constantly evolving. It is important to choose a platform and architecture that is future-proof and extensible and supports a pluggable paradigm to play nicely with other tools of an ecosystem. So gravitating towards open data formats, open source tooling, and cloud-based architecture with separation of compute and storage, you can dodge the main bullets. There will be a time when this is no longer sustainable and the whole data platform needs a refreshing overhaul. Some examples of this that we’ve seen in recent years is migration from Hadoop-based systems that are complex and difficult to manage to cloud-native data platforms. The same is true of expensive data warehousing solutions such as Netezza, Teradata, and Exadata. Migration projects are expensive, time-consuming, and critical to the overall value of a business and tech investments and need to be planned and executed very carefully.
How will you determine whether to patch an existing system or rebuild it with a newer tech stack? The main driving forces are as follows:
To mitigate risk and ensure an on-time and on-cost migration, a phased approach is typically followed:
The main technology mappings to consider are as following:
If data is not in Delta format, files can be converted to Delta using one of these options:
CONVERT TO DELTA <parquet table>
CONVERT TO DELTA parquet.<`/data-path/`>
CREATE TABLE <delta table> USING DELTA LOCATION <’/data-path/’>
CREATE TABLE <parquet table> USING PARQUET OPTIONS (path <’/data-path/’>)
CONVERT TO DELTA <parquet table>
One thing to keep in mind while porting Delta is to avoid bypassing the transaction logs, as they retain the source of truth and can cause inaccuracies. Of course, running vacuum on a Delta table will remove all versions and make it look similar to a Parquet file. You can also generate a manifest file that can be read by other processing engines such as Presto and Athena using the following:
GENERATE symlink_format_manifest FOR TABLE <delta table>
In the next section, we will look at the need to establish a center of excellence within an enterprise and its roles and responsibilities.
Establishing an internal steering committee/team as the Center of Excellence (COE) for creating advanced analytics is a complex process. Its primary purpose is to provide a blueprint to onboard data teams, and enable them with technical and operational practices, support for handling issues and tickets, and executive alignment to ensure that technical investments align to business objectives and value can be realized and quantified. The role is that of an enabler, as a governance overseer, but never to the point of a bottleneck. In some organizations, the COE team is responsible for managing all or part of an infrastructure and the shared data ingestion process, ratifying vendor tools and frameworks for internal consumption. They are either funded directly or get compensated by a chargeback model from the individual lines of business that they service.
The foundational blocks include the following aspects:
The next set of concerns focus on data ingestion and network connectivity:
The final set of concerns focus on specialized AI/ML and data science activities:
Usually, a COE is a cross-functional group of members who serve a wide range of users with varying skill sets, from novice data citizens to more advanced players. Their responsibilities fall into the following main buckets:
Different organizations have different policies around the types of environment to provide. Typical ones include the following:
The establishment of a COE is no guarantee that all data and ML initiatives will be successful, but it is a foundational piece in any enterprise’s data journey to ensure a sound governance body for better data management and decision making.
The previous chapters focused on the role of data engineers, data scientists, business analysts, and DevOps/MLOps personas. This chapter focused on the admin persona who plays a pivotal role in an organization’s data journey by enabling the infrastructure, onboarding users, and providing data governance and security constructs so that self-service can be fully and safely democratized. We looked into various tasks, such as COE duties and responsibilities and data migration efforts, among others, which require admins to do a lot of the heavy lifting. Consolidating data into a common, open format such as Parquet with a transactional protocol such as Delta helps in use cases involving data sharing and migrations. It is important to keep in mind that technology and business users need to plan an enterprise’s data initiatives together to make sure that the insights generated are relevant and useful for the enterprise.