CHAPTER 5

Vertical Federated Learning

We learned from Chapter 4 that horizontal federated learning (HFL) is applicable to scenarios where participants’ datasets share the same feature space but differ in sample spaces. Thus, HFL is convenient to be applied to build applications powered by a massive amount of mobile devices [McMahan et al., 2016a, McMahan and Ramage, 2017]. In those cases, the targets being federated are the individual consumers of applications, which can be considered as B2C (business-to-consumer) paradigm. However, in many practical scenarios, the participants of federated learning are organizations that collected different data features for the same group of people for pursuing different business goals. Those organizations often have strong motivations to cooperate to improve business efficiency, which can be considered as B2B (business-to-business) paradigm.

For example, assume that there is a user has some records in a bank that reflect the user’s revenue, expenditure behavior and credit rating. Also, the same user has some other information stored in a e-commerce marketplace retaining the user’s online browsing and purchasing history. Although the feature spaces of the two organizations are quite different, they have a close association with each other. For instance, the user’s purchasing history may somewhat determine the user’s credit rating. Such scenarios are common in real life. Retailers may partner with banks to offer personalized services or products based on the purchasing history and the expenditure behavior of the same user. Hospitals can collaborate with pharmaceutical companies to make use of the medical records of common patients so as to treat chronic diseases and to reduce risks of future hospitalization.

We categorize federated learning on participants whose datasets share the same sample space but differ in feature space as Vertical Federated Learning (VFL). The word “vertical” comes from the term “vertical partition,” which is widely used in the context of the traditional tabular view of a database (e.g., columns of a table are vertically partitioned into different groups and each column represents a feature of all samples). In this chapter, we introduce VFL, covering its concept, architecture, algorithms, and open research challenges.

5.1    THE DEFINITION OF VFL

The datasets maintained by different organizations having different business goals usually have different feature spaces, while those organizations may share a large pool of common users. This is illustrated in Figure 5.1. With VFL, also called feature-partitioned federated learning, we can leverage the heterogeneous feature spaces of distributed datasets maintained by those organizations to build better machine learning (ML) models without exchanging and exposing the private data.

Image

Figure 5.1: Illustration of VFL, a.k.a., feature-partitioned federated learning [Yang et al., 2019].

Under the federated learning framework, the identity and the status of each participating party is the same, and the federation helps everyone establish a “commonwealth” strategy, which is why this is called “federated learning.” For such a VFL system, we have:

Image

where X and Y denote the feature space and the label space, respectively. I is the sample ID space, and matrix D represents the data held by different parties [Yang et al., 2019]. The objective for all parties is to collaboratively build a shared ML model by exploiting all features collected by participating parties.

In VFL settings, there are several underlying assumptions for achieving security and preserving privacy. First, it is assumed that the participants are honest-but-curious. This means that the participants attempt to deduce as much as possible from the information received from other participants, although they abide by the protocol without disturbing it in any way. Since they also intend to build a more accurate model, they do not collude with one another. Second, it is assumed that the information transmission process is safe and reliable enough to defend against attacks. It is further assumed that the communication is lossless without tampering with the intermediate results. A semi-honest third party (STP) may also join the participants to assist the two parities. The STP is independent from both of the two parties. The STP collects the intermediate results to compute the gradients and loss, and distributes the results to each party. The information that the STP receives from the participants is either encrypted or obfuscated. The participants’ raw data are not exposed to each other, and each participant only receives the model parameters related to its own features.

Security definition of a VFL system. A VFL system typically assumes honest-but-curious participants. In a two-party case, for example, the two parties are non-colluding and at most one of them are compromised by an adversary. The security definition is that the adversary can only learn data from the party that it corrupted but not data from the other party beyond what is revealed by the input and output. To facilitate the secure computations between the two parties, sometimes a STP is introduced, in which case it is assumed that STP does not collude with either party. MPC provides formal privacy proof for these protocols [Goldreich et al., 1987]. At the end of learning, each party only holds the model parameters associated with its own features, therefore at inference time, the two parties also need to collaborate to generate output.

5.2    ARCHITECTURE OF VFL

For ease of elaboration, we use an example to describe the architecture of VFL. Suppose that Companies A and B would like to jointly train an ML model. Each of them has their own data. In addition, B also has labeled data that the model needs to perform prediction tasks. For user privacy and data security reasons, A and B cannot directly exchange data. In order to ensure data confidentiality during the training process, a third-party collaborator C can be involved. Here, we assume that C is honest and does not collude with A or B, but A and B are honest-but-curious. The trusted third-party C is a legitimate assumption since the role of C can be played by authorities such as governments or replaced by secure computing nodes such as Intel Software Guard Extensions (SGX) [Bahmani et al., 2017]. An example of vertical federated learning architecture is illustrated in Figure 5.2a [Yang et al., 2019, Liu et al., 2019]. The training process of a VFL system typically consists of two parts. It first establishes alignment between entities sharing the same IDs of two parities. Then, the encrypted (or privacy-preserving) training process is conducted on those aligned entities.

Part 1: Encrypted entity alignment. Since the user groups of the two companies A and B are not the same, the system uses an encryption-based user ID alignment technique such as Liang and Chawathe [2004] and Scannapieco et al. [2007] to confirm the common users shared by both parties without A and B exposing their respective raw data. During entity alignment, the system does not expose users who belongs to only one of the two companies, as shown in Figure 5.3.

Part 2: Encrypted model training. After determining the common entities, we can use these common entities’ data to train a joint ML model. The training process can be divided into the following four steps (as illustrated in Figure 5.2b).

Image

Figure 5.2: Architecture for a vertical federated learning system [Yang et al., 2019].

Image

Figure 5.3: Illustration of encrypted entity alignment [Cheng et al., 2019].

•  Step 1: C creates encryption pairs, and sends the public key to A and B.

•  Step 2: A and B encrypt and exchange the intermediate results for gradient and loss calculations.

•  Step 3: A and B compute encrypted gradients and add an additional mask, respectively. B also computes the encrypted loss. A and B send encrypted results to C.

•  Step 4: C decrypts gradients and loss and sends the results back to A and B. A and B unmask the gradients, and update the model parameters accordingly.

5.3    ALGORITHMS OF VFL

In this section, we elaborate two VFL algorithms to help the readers better understand how VFL works.

5.3.1    SECURE FEDERATED LINEAR REGRESSION

The first algorithm is federated linear regression, which was first presented in Yang et al. [2019]. This algorithm exploits homomorphic encryption to protect the privacy of local data belonging to each participating party during the training process of the federated linear regression model. For ease of reference, the notations used in this section are summarized in Table 5.1.

Table 5.1: Notations

Image

To train a linear regression model with gradient descent methods, we need a secure way to compute the model loss and gradients. Given a learning rate η, a regularization parameter λ, a dataset Image, and model parameters ΘA, ΘB corresponding to the feature space of Image, respectively, the training objective can be written as:

Image

Let Image, the encrypted loss is:

Image

where the additive homomorphic encryption operation is denoted as ⟦·⟧. Let Image, and Image, then

Image

Similarly, let Image. Then, the gradients of the loss function with respect to the training parameters are given by

Image
Image

Note that party A and party B are able to compute Image and Image using only their respective local information. However, the term di involves both Image and Image. It cannot be computed by either party alone. As a result, party A and party B should collaboratively compute di while preserving the privacy of Image and Image against the other party. In the homomorphic encryption setting, to prevent party A and party B from peeking into Image and Image, respectively, Image and Image are encrypted via the public key held by a third party C. The third party C in the loop is mainly response for decrypting encrypted information passed from party A and party B, and orchestrating the training and inference processes.

Introducing a third party into the loop is not always feasible to many practical scenarios where the legitimacy and accountability of the third party are hard to be guaranteed. Secure multi-party computation techniques such as secret sharing can be applied to remove the third party and decentralize federated learning. We refer interested readers to Mohassel and Zhang [2017] for details. Here, we continue the elaboration with the architecture having a third party.

The Training Process

We summarize the detailed steps of training the federated linear regression model in Table 5.2. During entity alignment and model training, data owned by parties A and B are stored locally, and the interactions in model training do not lead to data privacy leakage. Note that potential information leakage to party C may or may not be considered to be privacy violation since party C is a trusted party. To further prevent party C from learning information about parties A or B in this case, parties A and B can further hide their gradients from party C by adding encrypted random masks.

Table 5.2: Training steps for secure linear regression

Image

The training protocol shown in Table 5.2 does not reveal any information to C, because all C learns are the masked gradients, and the randomness and secrecy of the masked matrix are guaranteed [Du et al., 2004]. In the above protocol, party A learns its gradient at each step, but this is not enough for A to learn any information about B according to Equation (5.5), because the security of scalar product protocol is well-established based on the inability of definitively solving for more than n unknowns with only n equations [Du et al., 2004, Vaidya and Clifton, 2002]. Here, we assume the number of samples NA is much greater than nA, where nA is the number of features. Similarly, party B cannot learn any information about A. Therefore, the security of the protocol is proven.

Note that we have assumed both parties to be semi-honest. If a party is malicious and cheats the system by faking its input, for example, party A submits only one non-zero sample with only one non-zero feature, it can tell the value of Image for that feature of that sample. It still cannot compromise Image or ΘB though, and the deviation will distort results for the next iteration, thereby alerting the other party who can then terminate the learning process in response. At the end of the training process, each party remains oblivious about the data structure of the other party, and it obtains the model parameters associated only with its own features.

Because the loss and gradients each party receives are exactly the same as the loss and gradients they would receive if jointly building a model with data gathered at one place without privacy constraints, this collaboratively trained model is lossless and its optimality is guaranteed.

The efficiency of the model depends on the communication overhead and the computation overhead incurred for encrypting the data. In each iteration, the information sent between A and B increases with the number of overlapping samples. Therefore, the efficiency of this algorithm can be further improved by adopting distributed parallel computing techniques.

The Inference Process

At inference time, the two parties need to collaboratively compute the inference results, as the steps summarized in Table 5.3. During inference, the data belonging to each party would not be exposed to the other.

Table 5.3: Inference steps for secure linear regression

Image

5.3.2    SECURE FEDERATED TREE-BOOSTING

The second example is secure federated tree-boosting (termed as SecureBoost for short), which was first studied in Cheng et al. [2019], in the setting of VFL. This work proves that SecureBoost is as accurate as other non-federated gradient tree-boosting algorithms that require data being collected at one central place. That is, SecureBoost provides the same level of accuracy as its non-privacy-preserving variants, while at the same time, reveals no information about each private data owner. Note that, in the description of this example, the coordinator corresponds to the so-called active party originally defined in Cheng et al. [2019], which has both data and labels.

Secure Entity Alignment

Similar to federated secure linear regression described in Section 5.3.1, SecureBoost consists of two major steps. First, it aligns the data under the privacy constraint. Second, it collaboratively learns a shared gradient-tree boosting model, while keeping all the training data secret over multiple private parties.

The first step in the SecureBoost framework is entity alignment, which is to find a common set of data samples (i.e., the common users) among all participating parties so as to build a joint ML model. When the data are vertically partitioned over multiple parties, different parties hold different but partially overlapping users’ data. The common users may be identified by their unique user IDs. In particular, we can align the data samples under an encryption scheme by using the privacy-preserving protocol for inter-database intersections, see, e.g., Liang and Chawathe [2004].

Review of XGBoost

After aligning the data across different parties under the privacy constraints, we now consider the problem of jointly building the tree ensemble model over multiple parties without violating privacy through VFL. To achieve this goal, there are three key questions that need to be answered.

•  How can each party compute an updated model based on its local data without reference to class labels?

•  How can the coordinator aggregate all the updated models and obtain a new global model?

•  How to share the updated global model among all parties without leaking any private information at the inference time?

To help find answers to these questions, we first take a quick review of the tree ensemble model, XGBoost [Chen and Guestrin, 2016], in the non-federated setting.

Given a data set XRn×d with n samples and d features, XGBoost predicts the output by using K regression trees:

Image

To avoid bogging down into the mathematical details of tree boosting, we bypass the derivation of the loss function for learning the set of regression trees used in Equation (5.7). Instead, we introduce the training rules with pain words, hoping that it is more accessible. The objective of learning the regression tree ensemble is to find a best set of trees that provides small classification loss, as well as low model complexity. In gradient tree boosting, this objective is approached by optimizing the loss (e.g., squared loss or Taylor approximation of a loss function) between label and prediction iteratively. In each iteration, we try to add a new tree that reduces the loss as much as possible while do not introduces much complexity. Hence, the objective function at t th iteration can be written as

Image

where gi and hi are two groups of independent variables, and Ω(ft) are the complexity of the new tree. This indicates that we only need to find a new tree that can optimize this objective function, and it should be easy to find a optimal tree given this objective function with the optimal score noted as objt.

Hence, the construction of a newly added regression is a process to build a tree that minimize objt, i.e., from the depth of 0, deciding the split of each leaf node until reaching the maximum depth. Now the question lays on how to decide a optimal split of a leaf node at each level of the tree. The performance of a “split” is measured by the split gain, which can be calculated from the aforementioned variables gi and hi. We have the following observations.

(i)

The evaluation of split candidates and the calculation of the optimal weight of a leaf node only depend on the variables gi and hi.

(ii)

The class label is needed for the calculation of gi and hi, and is easy to recover the class label from gi and hi once we obtain the value of Image from the (t – 1)th iteration.

Algorithm 5.1 Aggregate Dencrypted Gradient Statistics (adapted from Cheng et al. [2019])

Image

The Training Process of SecureBoost

We know from the above observations that each party can determine the local optimal split independently with only its local data once it obtains gi and hi. The optimal split can be found if the split gain can be calculated for every possible splits, by using the sum of groups of gi and hi.

In order to keep gi and hi confidential to avoid privacy leakage, gi and hi shall be encrypted before being sent to the other parties. As an example, we show here how to calculate the candidate split gain with encrypted gi and hi using the additive homomorphic encryption scheme [Paillier, 1999].

With the additive homomorphic encryption scheme, the split gain for every split candidate can be computed by the sum of groups of ciphertexts of gi and hi, respectively. Therefore, the best split at each party can be found by evaluating all the possible split gains in the coordinator that can hence apply a global optimal split.

However, this solution is not efficient since it requires the transmission for all possible split candidates, which incurs enormous communication overhead. To construct a boosting-tree with lower communication cost, we can take advantage of the approximate framework proposed in Chen and Guestrin [2016], where the detailed calculation is shown in Algorithm 5.1.

Algorithm 5.2 Find Optimal Split (adapted from Cheng et al. [2019])

Image

With Algorithm 5.1, for each party, instead of computing ⟦gl⟧ and ⟦hl⟧ directly, it maps the features into buckets and then aggregates the encrypted gradient statistics based on the buckets. In this way, the coordinator only needs to collect the aggregated encrypted gradient statistics from all parties. As a result, it can determine the globally optimal split more efficiently, as described in Algorithm 5.2.

After the coordinator obtains the global optimal split, represented as [party id (i), feature id (k), threshold id (v)], it returns the feature id k and threshold id v to the corresponding party i. Party i decides the value of the selected attribute based on the values of k and v. Then, it partitions the current instance space according to the value of the selected attribute. In addition, it builds a lookup table locally to record the value of the selected attribute, [feature, threshold value]. After that, it returns the index of the record and the instance space of left side nodes after the split (IL) back to the active party. The active party splits the current node according to the received instance space and associates the current node with [party id, record id], until a stopping criterion or the maximum depth is reached. All the leaf nodes are stored inside the active party.

In summary, the step-by-step training process of SecureBoost can be concluded as follows.

•  Step 1: Starting from the active party, it first calculates gi and hi, i ∈ {1, …, N}, and encrypts it using AHE.

•  Step 2: The active party generates encrypted buckets according to Algorithm 5.1 and send them to all passive parties.

•  Step 3: For each passive party, it maps the features into buckets and then aggregates the encrypted gradient statistics based on the buckets. The results are sent to the active party.

•  Step 4: The active party decrypts the aggregated result, and determines the global optimal split according to Algorithm 5.2, and returns kopt and vopt to the corresponding passive party.

•  Step 5: The passive party determines the attribute’s value according to kopt and vopt received from the active party, and returns the corresponding records to the active party.

•  Step 6: Repeat Steps 2–4 iteratively until the maximum score is reached.

The Inference Process of SecureBoost

For federated model inference, we are able to classify a new sample even though its features are privately distributed on different parties. Since all leaf nodes are stored in the active party, the inference process should be coordinated by the active party with the information from other passive parties, which have party-specific lookup table consisting of [feature, threshold value]. The inference process is simply recursive steps as follows.

•  Step 1: The active party refers to the owner (i.e., party-id) of the root node with the related feature-threshold tuple (i.e., record-id).

•  Step 2: The retrieved party compare the value of the corresponding attribute with the threshold from the lookup table, and decide which child node to retrieve.

•  Step 3: The active party returns the party-id and record-id of the retrieved node.

•  Step 4: Repeat Step 2 until a leaf node is reached.

In each step, the active party only reacts to the retrieval of node-id and corresponding party-id and record-id. And the actual attribute values are only exposed to the parties that owns corresponding attribute values. Therefore, the federated inference is not only private but also lossless.

5.4    CHALLENGES AND OUTLOOK

VFL enables participants to build a shared model based on data with heterogeneous features in a privacy-preserving manner. Unlike HFL in which a common model is shared by all participants, in VFL the model is partitioned into multiple components each maintained by a participant with relevant but different data features. Thus, participants in VFL have a closer interdependent relationship with each other. More specifically, the training of each model component must follow a certain computation order specified by the underlying VFL algorithm. In other words, participants have dependent computations and need to frequently interact with each other to exchange the intermediate results.

Therefore, VFL is vulnerable to communication failures and thus requires reliable and efficient communication mechanisms. Transferring the intermediate results from one participant to another can be expensive, as long-haul connections must be established between two participants located in different geographical regions. Such slow data transfer, in turn, results in inefficient utilization of computing resources, as a participant cannot start training until it has received all the required intermediate results. To address this issue, we may need to design a streaming communication mechanism that can judiciously schedule the training and communication of each participant to offset the data transfer delay. A fault-tolerant mechanism should also be designed to prevent the VFL process from crashing in the middle of training.

Currently, on the setting of federated learning, most works proposed to reduce information leakage or prevent malicious attacks are applied in HFL. As VFL typically requires a closer and more direct interaction between participants, flexible secure protocols that can meet secure requirements of each participant are needed. Previous works [Baldimtsi et al., 2018, Bost et al., 2015] have demonstrated that different secure tools are optimal for different types of computations, e.g., garbled circuits give efficient comparisons whereas secret sharing and homomorphic encryption yield efficient arithmetic function evaluation. We may explore hybrid strategies for conversion among secure techniques, aiming to achieve locally optimal performance for each part of the computation. In addition, efficient secure entity alignment is also worth being explored since it is a crucial preprocessing component of VFL.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset