This Book Comes with Free Online Content
With this book, you get unlimited access to web-based CTA exam prep tools like flashcards and exam tips.
Figure 2.1 – CTA online resources dashboard
To unlock the content, you’ll need to create an account using your unique sign-up code provided with this book. Refer to the Instructions for Unlocking the Online Content section in the Preface on how to do that.
Accessing the Online Content
If you’ve already created your account using those instructions, visit packt.link/ctabookwebsite or scan the following QR code to quickly open the website.
Figure 2.2 – QR Code to access CTA online resources
Once there, click the Login link in the top-right corner of the page to access the content using your credentials.
As you learned in the previous chapter, the expectations are high for a Salesforce CTA. The architect is expected to have a deep understanding of the platform’s features and capabilities. In addition to a deep understanding of particular architectural concepts, the architect is also expected to have a breadth of knowledge across many architectural domains. Data is one of the key architectural domains that architects need to master. Data is all around you in your daily life, and every day, you discover new potential for it. From sales to services, marketing, business intelligence, and artificial intelligence, data is at the heart of today’s modern systems.
In this chapter, you are going to cover the following main topics:
People have used databases in their day-to-day activities for centuries. Although they have only been given the name “databases” recently, they have been developed for years with more and more use cases invented for them. Most modern applications utilize a database of some sort. Theoretically, a database is simply a collection of related data. The software system that manages this data is called a database management system (DBMS). The DBMS is also responsible for controlling access to the database.
Databases have evolved over the years from simple file-based systems to sophisticated cloud-based relational database management systems and in-memory databases.
Understanding the problems of file-based systems could help you avoid challenges that could occur in modern database systems. File-based systems were designed for a specific set of use cases. This was primarily driven by an attempt to digitalize the activities that are usually done. Take the manual filing system as an example. Enterprises and some libraries used to have an organized way to store the different files and books they owned. A typical arrangement would see these assets stored in labeled cabinets. The cabinets themselves might have locks to control who could open them and who could not, and an entire set of cabinets could be kept in secured rooms or areas to ensure the right security measures were taken. The simplest way to find a document in this arrangement would be to go through all the documents one by one until you found what you were after.
Eventually, indexing systems were used to help locate a specific file more quickly. In other words, you could have divisions in the filing system or a summarized index sheet that pointed to the location of each stored document.
This system works as long as only a small number of items are stored, or if all you want to do is simply store and retrieve items. You can probably imagine the complexity of retrieving cross-reference data or trying to get any intelligence out of the gathered data.
Relational databases became very popular at the end of the last century. They dominated the enterprise landscape and are still relevant today. The data in RDBs is split across multiple related tables, where each row of these tables has a unique identifier called a primary key. Related tables can be linked by referencing those primary keys with foreign keys. Eventually, the concept of an alternate key or secondary key was introduced. Like the primary key, a secondary key contains a unique identifier for each row. However, for one reason or another, it has not been chosen as the primary key for the table.
In the Salesforce world, each object has a unique ID field that represents a primary key. Lookup and master-detail relationship fields act as foreign keys for the child object. Unique external keys are a good example of alternate keys.
Data is normally accessed using a structured query language (such as SQL or SOQL in Salesforce). The speed of retrieving this data is impacted by several factors, such as the underlying infrastructure, the way data is identified and indexed, and the amount of data that can be stored and retrieved while the data is unindexed.
Relational databases transactions are defined by the following four characteristics (you can remember these with the acronym ACID):
Relational databases are ideal for complex operations and data analysis tasks. They are designed to value consistency over availability and provide a rigid structure that the data must fit into.
In the 1990s, when the internet took off, a new challenge arose since web applications started to produce data that was not necessarily structured or organized and was sometimes difficult to fit into such a rigid structure. With this came the rise of non-relational databases, which are now known as not only SQL or NoSQL databases.
Non-relational databases share the following three qualities (you can remember these with the acronym BASE):
Non-relational databases are highly scalable, though this comes at a cost. Data consistency is not guaranteed at every point in time, which means that different users might see different versions of the same data at the same time – even though the data would eventually end up in a consistent state. Non-relational databases – quite the opposite of relational databases – value availability over consistency.
While designing your Salesforce solution, you might find that you are presented with a challenge that requires a highly available and scalable database system. Heroku supports some of the most popular NoSQL databases as add-ons, such as MongoDB and CouchDB.
The following use cases will help you understand more about the usage of NoSQL:
The Salesforce Platform is not designed for use cases where you need to receive and ingest tons of incoming data. Think of an IoT scenario where there is normally a need for a highly available and scalable platform to receive, ingest, and aggregate the data before processing it or transferring it to another platform that would handle the processing. This is a use case where a platform such as Heroku can add a lot of value to your solution, especially since it has a built-in connector to the Salesforce Platform via Heroku Connect. During the review board, avoid mistakes that would end up overstretching one of your solution components, such as driving IoT inbound communications into Salesforce directly.
Salesforce is slightly different from regular RDBMS. As an experienced Salesforce Architect, you are already aware that it is a multi-tenant database where the data of different tenants is separated using OrgIDs. However, you need to know about some of the concepts that could be considered suboptimal in standard RDBMS design but are fine in the Salesforce world, such as the following:
However, these functionalities normally create some supporting data behind the scenes in order to drive visibility requirements. You need to understand how they work and anticipate what kind of challenges you might come across while using each different sharing approach. You will also cover this in Chapter 6, Formulating a Secure Architecture in Salesforce.
Now that you understand the main differences between classic RDBMS and Salesforce, move on and explore a common topic between the two, which is also key to the success of any enterprise’s data strategy—data governance.
Data governance, in an enterprise context, is a data management concept that aims to ensure a high level of data quality throughout the complete life cycle of the data.
The data governance concept can be extended to several focus areas. Enterprises typically focus on topics such as data usability, availability, security, and integrity. This includes any required processes that need to be followed during the different stages of the data life cycle, such as data stewardship, which ensures that the quality of the data is always up to a high standard, and other activities that ensure the data is accessible and available for all consuming applications and entities.
Data governance aims to do the following:
Data governance bodies usually create and maintain the following artifacts:
A well-designed data governance program typically includes a combination of data stewards and a team that acts as a governing body. They work together to create the required standards and policies for governing the data, as well as implementing and executing the planned activities and procedures. This is mainly carried out by the data stewards.
A data steward is a role within the enterprise and is responsible for maintaining and using the organization’s data governance processes to ensure the availability and quality of both the data and metadata. The data steward also has the responsibility to utilize policies, guidelines, and processes to administer the organization’s data in compliance with given policy and/or regulatory obligations. The data steward and the data custodian may share some responsibilities.
A data custodian is a role within the enterprise and is responsible for transporting and storing data, rather than topics such as what data is going into the system and why. Data stewards are normally responsible for what is stored in datasets, while data custodians cover the technical details, such as environment and database structure. Data custodians are sometimes referred to as database administrators or extract transform load (ETL) developers.
Now that you understand the activities that are covered by the data governing body, as well as data stewards and custodians, move on to have a look at one of the key topics they need to cover in their data strategy—data security.
Data security is one of the greatest concerns for enterprises today, especially with the ever-increasing amount and value of collected data. It is all about protecting digital data from the actions of unauthorized users (such as data leaks or breaches) or from destructive forces. As part of that, there are a few concepts that you need to become familiar with, including encryption, backup and restore, data masking, and data erasure. You will get to know each of them in the next sections.
Data encryption can be applied at multiple levels and stages of the data life cycle. This includes when the data is stored at its final data store (encryption at rest) and while data is in motion, moving from one system to another (encryption in transit).
Encryption in transit is typically achieved by encrypting the message before it is transmitted and decrypted at the destination. This process intends to protect data while being transferred against attackers who could intercept the transmission or what are sometimes referred to as man-in-the-middle attacks. This is normally achieved by utilizing a secure channel such as HTTPS, although higher levels of security can be applied. You will do a deep dive into this topic in Chapter 3, Core Architectural Concepts: Integration and Cryptography, to better understand how encryption algorithms work and how they are used to exchange data in a secure manner.
Encryption at rest is all about storing the data that has been encrypted. This makes it impossible to read and display the decrypted version of it without having access to a specific encryption key. Some applications or platforms provide this out of the box. This is a protection mechanism against attackers who can gain access to the database or to the physical disk where the data is stored.
Salesforce Shield provides an encryption solution for encrypting data at rest. This is applicable to the filesystem, the database, and the search index files. If you are planning to use Salesforce shield as part of your solution, you need to highlight that clearly in your landscape architecture.
Backup and restore solutions are used to ensure data is available in a safe location/source in case there is a need to restore or recover it. In most industries, it is essential to keep a backup of any operational data. And most importantly, you must have a clear restoration strategy. Data restoration is typically more challenging than backing it up as it comes with additional challenges, such as restoring partial data, reference data, and parent-child records and relationships.
Note
Salesforce announced that effective July 31, 2020, data recovery as a paid feature would be deprecated and no longer available as a service. However, based on customers’ feedback, Salesforce decided to reinstate its data recovery service. Then, during Autumn 2021, Salesforce announced a new built-in platform with a native backup and restore capability.
Due to this, it is important to create a comprehensive data backup and restore strategy as part of your data governance strategy. There are several tools that can be used to back up and restore data from and to the Salesforce Platform, including Salesforce’s Backup and Restore, in addition to some AppExchange products. A custom-made solution through implementing ETL tools is also possible, despite the additional build cost associated with it. As an architect, you are expected to be able to walk your stakeholders through the various options that are available, as well as the potential pros and cons.
Note
During the review board, you are expected to come up with the best possible solution technically. Cost should not be a consideration unless clearly mentioned in the scenario. Buy versus build decisions always tend to pick the buy option due to its quick return on investment.
Data masking (also known as data obfuscation) of structured data is the process of covering the original data with modified content. This is mainly done to protect data that is classified as personally identifiable information (PII) or sensitive commercial or personal data. An example is masking national identity numbers to display only the last four digits while replacing all other digits with a static character, such as a wildcard. Data is normally obfuscated to protect it from users, such as internal agents, external customers, or even developers (who normally need real production-like data to test specific use cases or fix a particle bug) to be compliant with regulatory requirements.
There are two common techniques for data obfuscation, namely pseudonymization, and anonymization. Here is a brief description of the two:
Which technique amongst these two you should choose depends on the degree of risk associated with the masked data and how the data will be processed. Pseudonymous data still allows some sort of reidentification (even if it is remote or indirect), while anonymous data cannot be reidentified. A common way to anonymize data is by scrambling data, a process that can sometimes be reversible; for example, “London” could become “ndooln.” This masking technique allows a part of the data to be hidden with a static or random character. On the other hand, data blurring uses an approximation of data values to make it impossible to identify a person or to make the data’s meaning obsolete.
Data erasure (also referred to as data clearing, data destruction, or data wiping) is a software-based activity where specific data is overwritten with other values to destroy electronic data and make it unrecoverable. This is different from data deletion, even though they sound the same.
Data deletion can leave data in a recoverable format (for example, by simply removing the reference to it from an index table while still maintaining it on the storage disk). Data erasure, on the other hand, is permanent and particularly important for highly sensitive data. It is important to understand the difference between these terms so that you can suggest the best strategy to your stakeholders while also taking into consideration the limited control they have over how the data is ultimately stored in Salesforce.
It is worth mentioning that encrypted data can be destroyed/erased permanently by simply destroying the encryption keys.
Another key topic the data governing body needs to cover is data regulatory compliance. With the increased amount of gathered customer and business data, it has become essential to introduce rules that govern the use of that data. As an architect, you must be aware of these regulations to design a fully compliant solution. You will likely need to work with subject matter experts to ensure your solution fulfills all regulatory requirements, but you should still be able to cover a good amount of that by yourself. You also need to be able to explain how your solution is compliant with these regulations to your stakeholders.
There are several data regulations in place based on industry standards and government regulations. These have been created to mitigate the risk of unauthorized access to corporate, personal, or government data. Some of the regulations that you may come across more frequently these days while designing a Salesforce solution include the following:
Failure to comply with these regulations can result in hefty fines and public damage to the enterprise’s reputation, in addition to civil and criminal liability. The details of these regulations are not covered in this book, but you are encouraged to seek additional knowledge online and from other books. You will come across requirements that must be resolved, along with considerations regarding some of the previously mentioned regulations, in the chapters to come.
PCI DSS, also known as PCI compliance, is an information security standard that’s applicable to all companies involved in activities related to credit card information, such as accepting, processing, storing, or transmitting credit card information from credit card providers. Salesforce billing became PCI level 1 compliant in 2012 and has retained its compliance every year since.
You maintain compliance with PCI by never storing any credit card information in Salesforce at all payment method collection stages (before, during, and after). The payment card’s information should only be transmitted to the payment processors via a token and never stored within Salesforce. Each of these tokens is unique per customer, payment card, merchant, and payment processor. Once the token has been submitted, the payment processor can link it to the actual personal account number stored.
Using tokens, Salesforce can store a representation of the customer’s payment card without storing the actual details of the card itself. If the token falls into the wrong hands, it is useless because it only works when it is being used by the original merchant and payment processor.
The information that can be stored in Salesforce while still maintaining PCI compliance includes data such as the following:
It is important to understand how online payment mechanisms work and how to select the right one for your solution. Suggesting a non-PCI-compliant solution during the review board could open the door to many other discussions that you do not want to have. When you suggest a suboptimal solution, the judges will try to understand the rationale behind your suggestion. Failing to show enough knowledge and understanding of regulations and compliance could lead to exam failure.
For online payments, you need to know that there are generally two main ways to integrate with a payment gateway. You can either create a checkout page that is hosted on your own website (for example, Salesforce communities) or utilize a checkout page that is provided and hosted by your payment gateway.
In terms of the externally hosted checkout approach, your checkout page is hosted on a different website provided by the payment provider. When the customer clicks on the checkout button on your website, they are redirected to the payment page, which is hosted on a different domain name. The order information and cart total amount will normally be shown on the payment page. The customer will enter their card details on this page and then hit the submit button. Notice that the card details are entered on the external payment website only; your pages and forms are never used to capture such details. Once the payment is successful, the customer is redirected to a landing page on your website. A confirmation or transaction code will also be passed back to your website.
The hosted checkout approach has the following advantages:
Regarding the onsite checkout approach, the customer provides their payment details on a page hosted on their own website (for example, Salesforce communities). You have full control over the look and feel of this page, but you are also responsible for the implementation of every required functionality, validation, and security measure.
The onsite checkout approach has the following advantages:
If you are taking payments via a hosted page, it is much easier to fill in and submit the required forms to stay PCI-compliant. This also makes the maintenance process simpler. If there is no clear need to go with an onsite checkout approach, then it is always safer to suggest an externally hosted page approach.
Now that you have covered some of the key topics you need to know about regarding data regulations, move on to understand the different data categories that are available.
Reference data and master data are two common data categories that an architect would typically come across in most projects. In addition to these, in this section, you will also discover the characteristics of transactional data, reporting data, metadata, big data, and unstructured data. Having a deeper understanding of these different data categories will help you craft your overall data strategy, including data governance. This will also help you speak the same language your data architects prefer to use. Have a look at each of them closely.
Transactional data is generated by regular business transactions. It describes business events. Normally, it is the most frequently changing data in the enterprise. Transactional data events could include the following:
Transactional data is normally generated and managed by operational systems, such as CRM, ERM, and HR applications.
Enterprises normally provide key business information that supports daily transactions. Such data normally describes customers, products, locations, and so on.
Such data is called master data, and it is commonly referred to as parties (employees, customers, suppliers, and so on), places (sites, regions, and so on), and things (products, assets, vehicles, and so on).
The usual business operations normally author/create and use master data as part of the normal course of business processes. However, operational applications are usually designed for an application-specific use case for the master data. This could result in a misalignment with the overall enterprise requirement of high-quality, commonly used master data. This would result in the following:
Master data management (MDM) is a concept widely used to describe the discipline where IT and business work together to ensure the accuracy and uniformity of the enterprise master data using specific tools and technologies. Maintaining a single version of the truth is one of the highest-priority topics on the agenda of most organizations.
A master data management tool is used to detect and remove duplicates, mass maintain data, and incorporate rules to prevent incorrect data from being entered.
There are different MDM implementation styles, such as the registry, consolidation, and coexistence styles. These styles are the foundation that MDM tools are based on. The business type, its data management strategy, and its situation will largely impact which style is selected.
The main difference between these implementation styles is in the way they deal with data, as well as the role of the MDM tool itself (it is a hub that controls the data or synchronizes data with other data stores). Here are three common ones in use:
In this approach, data is not sent back to the source systems. Changes to master data will continue to take place in source systems. The MDM tool assumes that the source systems can manage their own data quality. When a 360-degree view of a particular data asset is required (for example, the customer needs it), the MDM tool uses each reference system to build the 360-degree views in real time using the unique global identifier. There is normally a need for a continuous process to ensure the unique global identifier is still valid for a given dataset.
During these processes, the tool might use configured field weights. Fields with higher weights are normally considered more important factors when it comes to determining the attributes that would eventually form the golden record.
It is also good to understand how the matching logic normally works. There are two main matching mechanisms:
As a Salesforce Architect, you need to understand the capabilities and limitations of the out-of-the-box tools available in the platform, in addition to the capabilities that are delivered by key MDM players in the market—particularly those that have direct integration with Salesforce via an AppExchange application.
In your review board presentation, as well as in real life, you are expected to guide your stakeholders and explain the different options they have. You should be able to suggest a name for a suitable tool. It is not enough to mention that you need to use an MDM tool as you need to be more specific and provide a suggested product name. This is applicable to all third-party products you may suggest. You may also be asked to explain how these MDM tools will be used to solve a particular challenge. You need to understand the MDM implementation style that is being adopted by your proposed tool.
Think of data such as order status (created, approved, delivered, and so on) or a list of country names with their ISO code. You can also think of a business account type such as silver, gold, platinum, and so on. Both are examples of reference data.
Reference data is typically static or slowly changing data that is used to categorize or classify other data. Reference datasets are sometimes referred to as lookup data. Some of this reference data can be universal (such as the countries with ISO codes, as mentioned earlier), while others might be domain-specific or enterprise-specific.
Reference data is different from master data. They both provide context to business processes and transactions. However, reference data is mostly concerned with categorization and classification, while master data is mainly related to business entities (for example, customers).
Data organized in a specific way to facilitate reporting and business intelligence is referred to as reporting data. Data used for operational reporting either in an aggregated or non-aggregated fashion belongs in this category.
Reporting data is created by combining master data, reference data, and transactional data.
Metadata has a very cool definition; it is known as the data that describes other data. eXtensible Markup Language (XML) is a commonly used metadata format.
A simple example is the properties of a computer file, that is, its size, type, author, and creation date. Salesforce utilizes metadata heavily since most of its features can be described using well-structured metadata. Custom objects, workflows, report structures, validation rules, page layouts, and many more features can all be extracted from the Salesforce Platform as metadata. This enables automated merge and deployment processes, something that you will cover in the chapters to come.
Big data refers to datasets that are too massive to be handled by traditional databases or data processing applications; that is, datasets containing hundreds of millions (even billions) of rows. Big data became popular in the past decade, especially with the decreased cost of data storage and increased processing capacity.
Businesses are increasingly understanding the importance of their data and the significant benefits they can gain from it. They do not want to throw away their data and want to make use of it for many purposes, including AI-driven decisions and personalized customer-centric services and journeys. It is worth mentioning that there have been some critics of the big data approach. Some describe it as simply dumping the data somewhere in the hope that it will prove to be useful someday. However, it is also clear that when big data is used in the right way, it can prove to be a differentiator for a particular business. The future seems promising for big data, which is a key ingredient for machine learning algorithms.
As an architect, you should be able to guide the client through combining the technologies required to manage a massive quantity and variety of data (likely in a non-relational database) versus relational data (to handle complex business logic).
Unstructured data became more popular due to the internet boom. This data does not have a predefined structure and therefore cannot be fit into structured RDBMSs. In most cases, this type of data is presented as text data; for example, a PDF file where text mining would help with extracting structure and relevant data from this unstructured document.
The rise of big data and unstructured data takes us straight to our next topic—data warehousing and data lakes.
Data warehouse (DW or DWH) is a central repository of current and historical data that has been integrated from one or more disparate sources. The DWH (also referred to as an enterprise data warehouse (EDW)) is a system that is used for data analysis and reporting. It is usually considered the core of an enterprise business intelligence strategy.
Data stored in a DWH comes from multiple systems, including operational systems (such as CRM systems). The data may need to undergo a set of data cleansing activities before it can be uploaded into the DWH to ensure data quality.
Some DWH tools have built-in extract, transform, and load (ETL) capabilities, while others rely on external third-party tools (you will cover ETL tools and other integration middleware in Chapter 3, Core Architectural Concepts: Integration and Cryptography). This ETL capability will ensure that the ingested data has a specific quality and structure. Data might be staged in a specific staging area before it is loaded into the DWH. You need to become familiar with the names of some popular data warehousing solutions, such as Amazon Redshift, which is a fully managed, cloud-based data warehouse.
A data lake is a system or repository of data that is stored in an unstructured way. Data is held in its rawest form; it is not processed, modified, or analyzed. Data lakes accept and store all kinds of data from all sources. Structured, semi-structured, processed, and transformed data can also be stored in data lakes (such as XML data or data coming from databases). The data that is gathered will be used for reporting, visualization, business intelligence, machine learning, and advanced analytics. You need to become familiar with the names of some popular data lake solutions, such as AWS Lake Formation.
You also need to know which solution you should propose and the reason. If you need to offer a platform that can provide historical trending reports, deep data analysis capabilities, and the ability to report on a massive amount of data, then a DWH is more suitable for your use case. Keep in mind that data that has been extracted from Salesforce is mostly in a structured format (files are an exception).
In many review board scenarios, you will come across the need to archive the platform data. You will discover the options that are available in Chapter 7, Designing a Scalable Salesforce Data Architecture, but at the moment, all you need to know is that one of these options is a DWH.
Data lakes could be the right solution if you need to store structured and unstructured data in one place to facilitate tasks such as machine learning or advanced analytics.
Another common requirement in today’s enterprise solutions is document management systems. You will learn more about these in the next section.
Electronic documents are everywhere nowadays in enterprises. An electronic document can simply be thought of as a form of electronic media content that can be used either in its electronic form or as printed output. This does not include system files or computer applications. A document management system (DMS) is a computer application that is used to store, track, and manage electronic documents throughout their life cycles. This includes activities related to these documents, such as versioning and workflows.
Document management overlaps with content management, even though they were originally different in nature. They are often seen combined today as enterprise content management (ECM) systems, mainly because many of today’s electronic documents that are available for a particular enterprise are not necessarily generated by the enterprise itself.
Document management systems today differ in terms of complexity, features, and scope; from simple standalone applications to enterprise-scale systems that can serve massive audiences. Some of the key capabilities available in today’s ECM systems are as follows:
As a Salesforce Architect, it is important to understand the capabilities of document management systems and enterprise content management. You must also understand what additional tools are needed to incorporate a full-fledged DMS into your architecture.
Salesforce comes with built-in features that can deliver some DMS capabilities. It is important to understand the capabilities and limitations of each (including availability since some features may retire with time). These features are as follows:
Most enterprises will already have a DMS of some sort; for example, enterprise-wide applications such as Microsoft SharePoint, Box, Google Drive, One Drive, and others. You will most likely need to integrate with these systems to access and manage external documents within the Salesforce UI rather than create things from scratch.
To do that, you can use another out-of-the-box tool in Salesforce called files connect. It is important to get some hands-on experience with files connect to get a better understanding of how it works. And, since it has its own limitations, it is also important to become familiar with some of the other alternatives available on AppExchange. You are not expected to know the names and capabilities of all the tools under the sun but just make yourself familiar with some of them and understand exactly what they are capable of.
As an architect, you will come across several DMS requirements and challenges. You will also definitely come across business requirements that require your skills and experience to design a solid data architecture. Moving on to the next section, you will learn more about key data architecture concepts.
Previously, you understood the importance of data, as well as the importance of data architecture for the enterprise. The data architect—whom you, as a Salesforce Architect, need to work closely with or, in smaller projects, act as one—needs to tackle the data architecture in a similar fashion to normal building architecture. And to a good extent, similar principles should be followed in software architecture.
First, it is imperative to understand the business processes and create a conceptual and logical blueprint. Then, you need to know the underlying technology to build a detailed design and implementation. To understand these better, take a closer look at the three stages/levels of data architecture design—the conceptual, logical, and physical levels.
The data architect, at the conceptual-level stage, needs to gain a deep understanding of business knowledge, business processes, and operations. This includes knowledge related to data flows, business rules, and the way data is used or expected to be used in the enterprise. This could be extended to areas such as financial, product, marketing, and industry-specific knowledge domains (such as manufacturing, insurance, and so on). The data architect can then build a blueprint by designing data entities that represent each of the enterprise’s business domains.
The architect will also define the various taxonomies and data flow behind each business process. The data blueprint is crucial for building a robust and successful data architecture. On many occasions, this activity is done at the project level rather than the enterprise-wide level. And it is usually included as part of the business analyst’s activities. Typically, the following areas need to be covered:
This data architecture design is sometimes also referred to as data modeling. At this stage, the architect will start considering the type of technology being dealt with and define the data formats to use. The logical data model connects business requirements to the technology and database’s structure. The data architect should consider the standards, capabilities, best practices, and limitations of each of the underlying systems or databases. Data flows must be defined clearly at this stage. Typically, the following areas need to be covered:
This data architecture design is also known as the internal level. This is the lowest level of data design. At this stage, the architect is concerned with the actual technical aspects of the underlying system or database. It can even go as far as defining how the data is stored in the storage devices (such as within the specific folder structure).
As a Salesforce Architect, you are likely going to be involved in the conceptual design and, more likely, expected to be involved in the logical design. Your involvement in the physical design is not going to be as intensive and mostly be in an advisory context.
In Chapter 1, Starting Your Journey as a CTA, you covered the importance of creating a data model diagram during the review board exam. This is likely to be based on the logical mode (you will see several examples throughout this book). In real life, a logical model diagram is also required during the design phase of your project, while a physical model diagram is a must-have during the implementation phase.
You are expected to document (or lead the activity of documenting) the full Salesforce physical data model of your solution. This includes providing a full description of each standard and custom object used, how it is used (even for standard objects, it is not enough to mention that you are using the account object; for example, you must clearly explain how it is used), and the source of data in each integrated field, with a clear description of all involved data flows.
Next, you will understand more about the principles behind designing and properly documenting a data model.
The importance of your data model’s design cannot be emphasized enough. It might get overlooked in many cases in the Salesforce world due to the fact that Salesforce comes with a pre-built data model, in addition to great flexibility for creating and changing custom fields and objects.
However, the proper data model design could be the difference between a smart, flexible, and scalable solution (that delivers valuable real-time insight that justifies all efforts required to enter the data) and a poorly designed database with tons of dumped data (that gives an overall feeling that data entry is an overhead).
Your data model is the foundation of your solution. Several solution aspects depend on your data models, such as the data sharing and visibility strategy and your ability to identify large data volume (LDV) objects.
You can read through the solution by understanding its data model. Your solution will be as strong as its foundation. You need to get your data model design right, and to do so, you need to understand key concepts in data modeling. This starts with normalization versus denormalization, going through three of the standard normal forms for database design, and finally, understanding the common relationship types between database tables and how that is reflected in Salesforce.
Normalization is the process of arranging data in a database efficiently based on its relationships. This approach aims to remove data redundancy as it wastes disk space, slows down queries, and costs more processing time to execute create, read, update, and delete (CRUD) operations. Moreover, redundant data might also increase data inconsistency. For example, when the same data persists in multiple places and gets updated in one of them, you need to ensure that the change gets reflected in all other occurrences; otherwise, you may risk having data inconsistencies. Normalization should aim to get rid of data redundancy, but not at the cost of losing data integrity.
Normalization is based on the concept of what is known as normal forms. The dataset must meet specific criteria in order to be considered in one of these normal forms. There are three main normal forms (1NF, 2NF, and 3NF), all of which you will cover shortly, and others such as BCNF, 4NF, 5NF, and so on, which you are not going to cover in this book for brevity.
Denormalization can be considered the opposite of the normalization process. In a denormalized dataset, you intentionally use redundant information. This is done for several purposes but mostly to improve performance while executing queries and performing analytics. There is an associated overhead, of course, in keeping all the redundant data consistent and aligned.
The denormalization process reduces the number of tables (though it consumes more storage) and simplifies complicated table joins, which effectively enhances the performance while querying data that resides in multiple tables. The concept adopted by denormalization is that by placing all the data in one place, you could simplify the search process as it only needs to be executed on one table.
While designing your data model, you may come across use cases that could be better solved using a normalized set of tables. In other cases, denormalization could be the answer. You would understand the difference better with reference to these user stories.
The focus of this user story is on analytics/reports. Considering the standard reporting capabilities in Salesforce, it makes sense to store the account and account address details in two separate tables. This will also enable us to link the shipment records to the right address straight away, as well as build the desired report with minimal effort, even though Salesforce is doing additional processes behind the scenes to query data from multiple tables.
The following figure represents the proposed data model and an example of the data that is stored:
Figure 2.3 – Data model in a normalized form example
Now, explore the second user story.
The focus here is on the user experience while entering or viewing the data. In this case, it makes sense to use a denormalized dataset. These denormalized fields can easily be added to list views and page layouts. They can also be edited using fewer clicks.
You will come across more complicated scenarios where you could utilize a denormalized dataset to reduce data storage throughout this book. Although theoretically, a denormalized dataset consumes more storage data, in Salesforce, the data storage for the records of most objects is roughly 2 KB (with a few exceptions, such as person accounts and articles). This is true regardless of the number of fields in it, as well as if these fields are filled in or not (some field types are exceptions, such as rich text fields). As mentioned earlier in this book, there are some concepts of data modeling that may look different in Salesforce.
The following figure represents the proposed data model and an example of the data that is stored:
Figure 2.4 – Example of a data model in a denormalized form
The differences between normalized versus denormalized datasets can be summarized as follows:
The standard Salesforce data model is in normalized form by default. To further understand the normalization process, you need to understand the three main different types of normal forms.
As you explored earlier, normalization is all about arranging data in a database efficiently based on its relationships. There are three common forms of data normalization. Explore each of these next.
A database is considered in the first normal form if it meets the following conditions:
Figure 2.5 – Table 1, which does not meet the first normal form (1NF)
Figure 2.6 – Table 2, which does not meet the first normal form (1NF)
To bring the table shown in Figure 2.5 into the 1NF, you must split the table into the following two tables:
Figure 2.7 – Table 1 from Figure 2.5 modified to meet the first normal form (1NF)
A database is considered in the second normal form if it meets the following conditions:
Figure 2.8 – A table that does not meet the second normal form (2NF)
To bring the table into 2NF, you must split the table into the following two tables:
Figure 2.9 – The table from Figure 2.8 modified to meet the second normal form (2NF)
A database is considered in the third normal form if it meets the following conditions:
Figure 2.10 – A table that does not meet the third normal form (3NF)
To bring this table into 3NF, you must split the table into the following two tables:
Figure 2.11 – The table from Figure 2.10 modified to meet the third normal form (3NF)
Now that you have covered the three normalization normal forms, you will explore the types of relationships that can be created between the different database tables.
One of the goals of the normalization process is to remove data redundancy. To do that, you must divide tables into multiple related subject-based tables. There are three main types of relationships—one-to-one, one-to-many, and many-to-many. Take a look at these in more detail:
In Salesforce, this is particularly useful considering the rich sharing model available. You can implement one-to-one relationships in many ways, with the most common being to create a lookup relationship from each object to the other, with an automation script (trigger, workflow, process builder, and others) to ensure both lookup fields are always populated with the right values and each object is linked to its counterpart:
Figure 2.12 – One-to-one relationship
To have this relationship, you would need to use what are known as primary keys and foreign keys. In the child table, the primary ID would be the unique identifier for each record in that table, while the foreign key would refer to the unique identifier of the parent’s primary key. This type of relationship can be created in Salesforce using lookup fields or master-detail fields. This book assumes you already know the differences between the two. You will also cover some particular differences that you need to remember in later chapters:
Figure 2.13 – One-to-many relationship
To create a many-to-many relationship between two tables, you need to create a third table, often called a bridge table or a junction table. This many-to-many relationship would effectively be broken into two one-to-many relationships. The junction table’s records would hold the value of two foreign keys, each pointing to a record from the two main tables.
Junction objects can be created in Salesforce by creating a custom object with two master-detail fields, or with two lookup fields (in this case, it is more often referred to as a bridge rather than a junction). Both approaches are valid and have their most suitable use cases.
You can even create a junction object using a mix of master-detail fields and a lookup field. This is assuming you put all the validation and business rules required in place to ensure they are always populated with the right values. You will come across examples of and use cases for Salesforce junction objects in later chapters:
Figure 2.14 – Many-to-many relationship
Keep in mind that the cardinality of these relationships can take different forms. For example, a one-to-many relationship might be better represented as one-to-zero-or-many if there are occurrences where the first table will have zero related records from the second table, as illustrated in the students and courses example earlier.
The following diagram illustrates a list of other relationships’ cardinalities. The one-and-only-one relationship could be a bit confusing. It is used when the relationship between two records from two different tables cannot be modified—for example, a user and a login ID. Once a login ID is associated with a user, it cannot be assigned to any other user after that. And it can be associated with one user only. This differs from the ownership relationship between a user and a mobile phone, for example, where the relationship itself might get updated in the future if the phone were to be sold to another owner:
Figure 2.15 – Table relationship cardinalities
In this chapter, you learned some historical information about database design, as well as how a proper design can be crucial to the overall Salesforce solution. You also learned about some key concepts, such as reference data, reporting data, and big data, in addition to general information about document management systems.
You then covered various data modeling principles and activities, including the process of normalizing and denormalizing data. You also explored other common architectural knowledge areas, such as normal forms, database relationships, and cardinality.
All this knowledge will be required when you are designing your secure and scalable Salesforce solution. In the next chapter, you will acquire another set of essential common architectural skills. This time, you will focus on the ever-growing and always-interesting integration domain.
Before you proceed to the next chapter, it is recommended that you go through the flashcards from this chapter first. These flashcards condense all the chapter concepts into smaller and easily manageable chunks that will help you with quick review and retention. By engaging with these flashcards, you will strengthen your understanding of key topics, identify areas that require further study, and build your confidence before moving on to new concepts.
The following image shows an example of the flashcards interface.
Figure 2.16 – CTA flashcards
To access the end-of-chapter flashcards from this chapter, follow these steps:
Figure 2.17 – Chapter summary and login
You can also scan the following QR code to access the website:
Figure 2.18 – QR code to access Chapter 2 flashcards
After a successful login, you will see the following screen:
Figure 2.19 – Chapter summary and end-of-chapter flashcards