CHAPTER 3

Distributed Machine Learning

As we know from Chapter 1, federated learning and distributed machine learning (DML) share several common features, e.g., both employing decentralized datasets and distributed training. Federated learning is even regarded as a special type of DML by some researchers, see, e.g., Phong and Phuong [2019], Yu et al. [2018], Konecný et al. [2016b] and Li et al. [2019], or seen as the future and the next step of DML. In order to gain deeper insights into federated learning, in this chapter, we provide an overview of DML, covering both the scalability-motivated and the privacy-motivated paradigms.

DML covers many aspects, including distributed storage of training data, distributed operation of computing tasks, and distributed serving of model results, etc. There exist a large volume of survey papers, books, and book chapters on DML, such as Feunteun [2019], Ben-Nun and Hoefler [2018], Galakatos et al. [2018], Bekkerman et al. [2012], Liu et al. [2018], and Chen et al. [2017]. Hence, we do not intend to provide another comprehensive survey on this topic. We focus here on the aspects of DML that are most relevant to federated learning, and refer the readers to the references for more details.

3.1    INTRODUCTION TO DML

3.1.1    THE DEFINITION OF DML

DML, also known as distributed learning, refers to multi-node machine learning (ML) or deep learning (DL) algorithms and systems that are designed to improve performance, preserve privacy, and scale to more training data and bigger models [Trask, 2019, Liu et al., 2017, Galakatos et al., 2018]. For example, as illustrated in Figure 3.1, a DML system with three workers (a.k.a. computing nodes) and one parameter server [Li et al., 2014], the training data are split into disjoint data shards and sent to the workers, and the workers carry out stochastic gradient descent (SGD) at their locality. The workers send gradients Δwi or model weights wi to the parameter server, where the gradients or model weights are aggregated (e.g., via taking weighted average) to obtain the global gradients Δw or model weights w. Both synchronous and asynchronous SGD algorithms can be applied in DML [Ben-Nun and Hoefler, 2018, Chen et al., 2017].

DML can generally be categorized into two classes, namely the scalability-motivated DML and the privacy-motivated DML. The scalability-motivated DML refers to the DML paradigm that is designed to address the ever-increasing scalability and computation requirements of large-scale ML systems. For example, in the past decades, the scales of the problems that ML and DL methods have to deal with have increased exponentially. Training a sophisticated DL model with a huge amount of data can easily exceed the capability of the traditional ML paradigm that relies on a single computing entity. One outstanding example is the famous BERT model [Devlin et al., 2019], which requires multiple tensor processing units (TPUs) for pre-training and it may take several days even with a fleet of TPUs. To cope with such scenarios, the fast-developing DML methods are considered as the answer to the ever-increasing size and complexity of ML models.

Image

Figure 3.1: Illustration of a distributed machine learning (DML) system.

Scalability-motivated DML approaches provide feasible solutions to large-scale ML systems when memory limitation and algorithm complexity are the main obstacles. Besides overcoming the problem of requiring centralized storage of training data, DML system can have more elastic and scalable computing resources, such as adding more computing entities on-demand. This is particularly helpful in the age of cloud computing, where we can ask for more processors (such as CPUs, GPUs, or even TPUs) and memory in an on-demand manner. In light of this feature, the scalability-motivated DML is widely applied in the scenarios with horizontally partitioned datasets, where disjoint subsets of training data are stored at different computing entities.

Different from the scalability-motivated DML, the primary goal of privacy-motivated DML paradigm is to preserve user privacy. As user privacy and data security become a global concern (see also Chapter 1 and Appendix A) [Mancuso et al., 2019], privacy-preserving ML is becoming a new trend in the ML community (see also Chapter 2) [Yang et al., 2019]. In a privacy-motivated DML system, there are multiple parties and each of them holds a subset of the training data. Due to privacy concerns, the parties do not wish to expose their data to each other. Thus, distributed learning schemes are required to make use of the data of each participating party to collaboratively train an ML model. The datasets held by different parties may have different attributes, resulting in the so-called vertical partition of training data. That is to say, privacy-motivated DML is often applied in the scenarios with vertically partitioned datasets, with subsets of training data with different attributes held by different parties.

3.1.2    DML PLATFORMS

Because of the distributed and parallel computing architecture of DML, specialized ML platforms are required in order to reap the benefits of DML. There exist numerous commercial and open-source DML platforms. We introduce here some of the representative frameworks.

One of the most widely used distributed data processing systems for ML is Apache Spark MLlib [Apache MLlib, 2019]. MLlib is Apache Spark’s scalable ML library. It is a memory-based DML framework and makes practical ML systems scalable and easy to deploy. MLlib offers distributed implementations of the conventional ML algorithms (as compared to DL), such as classification, regression, and clustering, etc. Apache DeepSpark offers implementation of distributed training framework for DL [DeepSpark, 2019].

Graph-based parallel processing is a relatively new approach for DML, which is also called graph parallelism in the context of DML (see Section 3.2.2). The platform GraphLab [Turi-Create, 2019, Low et al., 2010] offers scalable ML toolkits and implements fundamental algorithms like SGD and gradient descent with superior performance. Another graph parallelism-based computation platform is the Apache Spark GraphX, a new component in Spark, which implements a Pregel-like bulk-synchronous message passing application programming interface (API) [Apache GraphX, 2019], and Pregel is the parallel graph processing libraries from Google that is based on the Bulk Synchronous Processing (BSP) model [Malewicz et al., 2010].

The Distributed ML Toolkit (DMTK) released by Microsoft contains both algorithmic and system innovations [DMTK, 2019]. DMTK supports a unified interface for data parallelization, a hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.

DL requires training deep neural networks (DNNs) with massive number of parameters on a huge amount of data. Distributed and parallel computing is a perfect tool to take full advantage of the modern hardware. As for distributed DL, in addition to Apache DeepSpark, the popular DL frameworks, such as TensorFlow and PyTorch, all support distributed training and deployment.

TensorFlow supports distributed training of DNNs via tf.distribute, e.g., (i) allowing portions of the graph to be computed on different processes or even on different servers, and (ii) employing multiple processors or even servers to train the same model over different slices of input datasets [Distributed TensorFlow, 2019]. TensorFlow offers the possibility to split big models over many devices, carrying out the training in parallel on different devices if the models are too large to fit in the memory of a single device. In addition, this can be used to distribute computation to servers with powerful GPUs, and have other computations done on servers with more memory. With distributed TensorFlow, we can scale up distributed model training to hundreds of GPUs. We can massively reduce the experimentation (e.g., hyper-parameter tuning) time by running many experiments in parallel on many GPUs and servers.

The distributed package included in PyTorch (i.e., torch.distributed) enables researchers and practitioners to easily parallelize their computations across processes and clusters of machines [Arnold, 2019]. Similar to TensorFlow, distributed PyTorch allows a model to be logically split into several parts (i.e., some layers in one part and some in others), then placing them on different computing devices. PyTorch leverages the message passing semantics and allows each process to communicate data to any of the other processes. As opposed to the multiprocessing (e.g., torch.multiprocessing) package, processes in PyTorch can use different communication backends and are not restricted to being executed on the same machine.

3.2    SCALABILITY-MOTIVATED DML

In this section, we provide a brief review of the existing works on scalability-motivated DML methods. Readers are referred to Feunteun [2019], Ben-Nun and Hoefler [2018], Galakatos et al. [2018] and Bekkerman et al. [2012] for comprehensive surveys of DML schemes and the references therein for more technical details.

3.2.1    LARGE-SCALE MACHINE LEARNING

With the emergence of widespread communication and sensing devices, such as smartphones, portable gadgets, IoT sensors, and wireless cameras, data are ubiquitously available in enormous volumes. In this big data era, the bottleneck of ML methods has shifted from being able to infer from small training samples to dealing with large-scale high-dimensional datasets. With this trend shift, the ML community is faced with the challenge that the computation power and time do not scale well with the dataset size, making it impossible to learn from large-scale training samples with reasonable computation effort and time. We summarize in the following the major challenges that conventional ML methods are faced with when dealing with large-scale datasets.

1.  Memory shortage. Conventional ML methods operate with the training samples entirely in one main memory. Therefore, if the computational complexity of the training samples exceed the main memory, the following problems may arise: (i) the trained model may not converge or may result in poor performance (such as bad precision or recall), and (ii) in the worst-case scenario, the ML models cannot be trained due to memory shortage.

2.  Unreasonable training time. Some optimization process in ML algorithms may not scale well with respect to the training samples, such as Gaussian Mixture Model (GMM) and polynomial regression. As a result, when dealing with large-scale training samples, the time consumed by the training process may be too long for practical use. On top of training, tuning hyper-parameters of ML models also takes a lot of time as we may need to try many different settings. Hence, if the training process takes too long, hyper-parameters tuning cannot be performed effectively, which may result in poor ML models.

Distributed ML algorithms are part of large-scale learning algorithms which has received considerable attention over the last few years, thanks to its ability to distribute the learning process onto several devices to scale up learning algorithms. Recent advances on DML make ML tasks on big data feasible, scalable, flexible, and more efficient.

3.2.2    SCALABILITY-ORIENTED DML SCHEMES

Excessive research efforts have been cast on presenting effective frameworks and methods for dealing with large-scale datasets and ML models. Particularly, training large-scale DL models is very time-consuming, with the training period ranging from days to even weeks. More recently, numerous research works have been carried out to push the frontiers of DML, aiming to reduce training time and cope with large-scale DL models. We review here some of the popular scalability-oriented DML schemes, covering data parallelism, model parallelism, graph parallelism, task parallelism, hybrid parallelism, and mixed parallelism.

Data Parallelism

The first instinct around DML is partitioning the training data into subsets, which are put on multiple computing entities that train the same model in parallel. This is known as the data parallelism approach, also known as the data-centric approach [Jia et al., 2019, Das, 2019, Wang, 2016]. In other words, data parallelism refers to processing multiple pieces (technically called shards) of training data through multiple replicas of the same model with different computing devices and communicating model information periodically. This approach can naturally scale up well with increasing amounts of training data. However, as a replica of the model (e.g., an entire DNN) has to reside on a single device, it cannot deal with ML models with high memory footprints.

There are mainly two common approaches for data parallelism-based distributed training, namely synchronous training and asynchronous training. With synchronous training, all computing entities train on replicas of the same model over different slices of the training data in synchronization, and the gradients (or model weights) produced by the computing entities are aggregated after each training step carried out by the entities. With asynchronous training, all entities independently train replicas of the same model over subsets of the training data and update model weights or gradients asynchronously. Typically, synchronous training is supported by the AllReduce architecture [Apache MapReduce, 2019, Fukuda, 2019], and asynchronous training by the parameter server architecture [Li et al., 2014].

Data parallelism can be used in the case that the training data is too large to be stored in a single device or to achieve faster training with parallel computing. Much work has been conducted for training DL models with distributed data. For example, the distributed frameworks, including DistBelief (which was later integrated into TensorFlow) from Google [Dean et al., 2012] and Project Adams [Chilimbi et al., 2014] from Microsoft, tend to train large-scale models with thousands of processors by utilizing both data and model parallelism.

Model Parallelism

As DL models are getting larger and larger, e.g., the BERT model [Devlin et al., 2019], we may face the problem that a DNN model cannot be loaded into the memory of a single computing entity. In such scenarios, we need to split the model and put parts of the model into different entities. This is called the model parallelism approach, also known as the model-centric approach [Jia et al., 2019, Das, 2019, Wang, 2016]. In other words, model parallelism refers to the case that a model (e.g., a DNN model) is being logically split into several parts (i.e., some layers in one part and some layers in other parts for a DNN model), then placing them in different computing devices. Although doing so does reduce execution time (asynchronous processing of data), it is usually employed to address memory constraints. Models with a very large number of parameters, which are difficult to fit into a single system due to high memory footprint, benefit from this type of strategy. For example, a single layer of a large DNN model can be fit into the memory of a single device and forward and backward propagation involves communication of output from one device to another in a serial fashion. We usually resort to model parallelism only if the model cannot fit into a single machine, not primarily to speed up the training process.

Existing works on model parallelism-based distributed training are extensive. One outstanding example is AMPNet, studied in Gaunt et al. [2018]. AMPNet was implemented on multi-core CPUs and it was shown that AMPnet training converged to the same accuracy as conventional synchronous training algorithms in a similar number of epochs, but took significantly shorter overall training time. A more recent example is OptCNN [Jia et al., 2018], which uses layer-wise parallelism for parallelizing convolutional neural networks (CNNs) with linear computation graphs. OptCNN allows each layer in a CNN to use an individual parallelization strategy. It was shown in Jia et al. [2018] that layer-wise parallelism outperforms state-of-the-art approaches by increasing training throughput, reducing communication costs, achieving better scalability to multiple GPUs, while maintaining original model accuracy.

Earlier exemplary works on model parallelism include Dean et al. [2012] and Kim et al. [2016], and Jia et al. [2019]. Particularly, in Dean et al. [2012], Google presented downpour SGD, which provides an asynchronous and distributed implementation of SGD. Downpour SGD combines data parallelism and model parallelism, which divides training samples among different machines, and each machine has a unique copy of the whole/partial network. DeepSpark was first proposed in Kim et al. [2016]. It allows distributed execution of both Caffe and Google’s TensorFlow DL jobs on an Apache Spark cluster of machines [DeepSpark, 2019]. DeepSpark makes deploying large-scale parallel and distributed DL easy and intuitive for practitioners.

Graph Parallelism

As graph-based ML algorithms are fast-growing [Zhang et al., 2018], graph parallelism based DML approaches are also receiving more attention. Graph parallelism, also known as the graph-centric approach, is a relatively new technique to partition and distribute training data and execute ML algorithms that is orders of magnitude faster than data parallelism-based approaches [Tian et al., 2018, Wang, 2016, Low et al., 2010].

GraphLab, first studied in Low et al. [2010] as an improvement upon abstractions like MapReduce, implements asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency. It achieves a high degree of parallel performance. GraphLab is able to achieve excellent parallel performance on large scale real-world ML tasks.

More recently, Xiao et al. [2017] proposed a new distributed graph engine called TUX2. TUX2 is intentionally optimized for DML to support heterogeneity, with a Stale Synchronous Parallel model, and a new MEGA (Mini-batch, Exchange, GlobalSync, and Apply) model. TUX2 puts forward the convergence of graph computation and DML, with a flexible graph model to express ML algorithms efficiently. Advances in graph computation and DML will allow more ML algorithms and optimization to be expressed and implemented easily and efficiently at scale.

Task Parallelism

Task parallelism, also known as the task-centric approach, covers the execution of computer programs across multiple processors on the same or multiple machines. It focuses on executing different operations in parallel to fully utilize the available computing resources in the form of processors and memory. One example of task parallelism would be an application of creating threads for doing parallel processing where each thread is responsible for performing a different operation. Examples of the big data frameworks that utilize task parallelism are Apache Storm [Apache Storm, 2019] and Apache YARN [Apache YARN, 2019].

It is common to combine task parallelism and data parallelism for DML. One outstanding example is Boehm et al. [2016], which presented a systematic approach for combining task parallelism and data parallelism for large-scale ML on top of MapReduce [Apache MapReduce, 2019]. The proposed framework enables users to specify task- and data-parallel ML algorithms in an easy and flexible way via a high-level primitive. The combined task and data parallelism on top of MapReduce opens ways to share cluster resources with other MapReduce based systems since the MapReduce scheduler provides global scheduling.

Hybrid Parallelism and Mixed Parallelism

In practical implementations of DML systems, we often need to combine different types of parallelism methods, resulting in the so-called hybrid parallelism, such as Apache YARN [Apache YARN, 2019] and SystemML [Pansare et al., 2018, Boehm et al., 2016] for both data and task parallelism. In fact, it is very common in practice to use both data and model parallelism simultaneously, such as Google downpour SGD [Dean et al., 2012] and the distributed DL framework proposed in Shrivastava et al. [2017]. Wang et al. [2018] to unify data, model, and hybrid parallelism via tensor tiling. The SOYBEAN system proposed in Wang et al. [2018] is a hybrid between data parallelism and model parallelism, and it performs automatic parallelization.

Under the broader umbrella of hybrid parallelism, there is also mixed parallelism, such as the work of Krizhevsky [2014] and Song et al. [2019]. This kind of parallelism is sometimes adopted for training large-scale DNNs, by distributing some layers using data parallelism and other layers using model parallelism. Readers are referred to Wang et al. [2018] and Song et al. [2019] for more information and related works on hybrid parallelism and mixed parallelism.

3.3    PRIVACY-MOTIVATED DML

The DML system can not only accelerate the computing of large-scale data, but also integrate data from different sites. In many practical areas, data are distributed to different clients, entities, and institutions. To collect more data to improve the performance, companies will also collect and analyze data from individuals, which comes with issues of user privacy and data security. For example, in medical applications, a hospital or medical institution is forbidden to share medical data according to regulations (e.g., HIPAA). Another example is that smart wearable devices are always collecting sensitive individual data, which are critical for wearable applications. However, sharing these data for model training also raises concerns about privacy leakage.

In a nutshell, sharing data and distributed computation is a trend in the era of big data, as it can (1) improve the computational efficiency, and (2) improve the model performance. In the meanwhile, the increasing awareness of privacy and data security requires a DML system to take privacy-preserving into consideration. Therefore, building a privacy-motivated DML system has become an important research direction. In this section, we start from a privacy-preserving decision tree example and further introduce several privacy-preserving techniques and their applications in a DML system.

For a privacy-preserving DML system, it generally protects some or all of the following information [Vepakomma et al., 2018].

1.  Input training data;

2.  Output predicted labels;

3.  Model information, including parameters, architecture, and loss function; and

4.  Identifiable information, such as which site a record comes from.

3.3.1    PRIVACY-PRESERVING DECISION TREES

The decision tree is an important kind of supervised ML algorithm, which is widely used in classification and regression. The learned model of a decision tree is explainable and understandable to people. There are variants of decision trees, and ID3 [Quinlan, 1986] is one of the most famous of them. In distributed decision tree algorithms, it is normally divided into two categories according to the data distribution, formally defined as follows.

1.  Horizontally partitioned datasets:

Image

2.  Vertically partitioned datasets:

Image

where X is the feature space of the data, and I is the sample space (i.e., the identification of each sample) of the data. We next provide an explanation on this definition.

In the scenario of the horizontally partitioned dataset, each participant (noted as an entity) in the DML system owns different samples, and samples in all entities have the same attribute categories. For example, the data collected by different wearable devices have the same set of attributes since the sensors of the devices are the same. However, as a result of different environments, the data samples collected by different entities should be different.

In the scenario of the vertically partitioned dataset, the attribute sets of data owned by different entities are different, but these samples are possibly referring to the same group of users. For example, the medical records of the same patient in different medical institutions record different physiological indices or disease examination results.

For horizontally partitioned DML, the aggregation of samples is equivalent to enlarging the dataset, while in the vertically partitioned scenario, it is similar to augmenting the features of the samples. In either way, distributed training provides a manner to expand the dataset.

Different from other ML algorithms, the data partition is critical for decision trees, because the learning of a decision tree needs to determine the split of the attribute set, depending on both the attribute category and the number of samples under a particular attribute with a class label.

Lindell and Pinkas [2002] first propose a privacy-preserving distributed decision tree algorithm based on the horizontally partitioned dataset. They introduced an oblivious secure protocol which computes (v1 + v2) log(v1 + v2) without revealing each value to other participants. This secure computation allows the distributed decision tree to privately calculate the node split across the samples in different participants. Wang et al. [2006] and Du and Zhan [2002] first addressed the problem of designing a vertically partitioned distributed decision tree with privacy protection. However, their solutions assume all of the participants own the class attribute.

Fang and Yang [2008] completed their work by allowing only one entity in the vertically partitioned distributed system to have the class attribute. Their work is based on ID3-tree, where the learning of the tree is decomposed into different components including attribute checking, distribution counts, class checking, attribute information gain checking, and information gain computation. Each part of distributed computation is protected by secure protocols. Besides, this work provides a loosely secure version and a completely secure version, providing a trade-off between efficiency and security.

Cheng et al. [2019] proposed a vertically distributed boosting tree based on secure set intersection protocol and partially homomorphic encryption. They prove that their method is not only secure and privacy-preserving but also lossless. There are also privacy-preserving distributed decision trees using differential privacy (DP) to protect individual privacy by adding noise to the statistics [Jagannathan et al., 2009].

The development of privacy-motivated decision tree algorithms considers the data partition and utility of privacy-preserving tools. As a preliminary of privacy-preserving distributed ML systems, we briefly introduce some commonly used tools for privacy and security protections in the next subsection.

3.3.2    PRIVACY-PRESERVING TECHNIQUES

In privacy-motivated DML system, the popular tools to protect data privacy can be roughly categorized into the following two categories.

1.  Obfuscation: to randomize or modify the data to conceal a certain level of privacy, e.g., DP.

2.  Cryptographic Methods: to secure the distributed computation process without revealing input values to other participants, e.g., secure multi-party computation (MPC), including oblivious transfer (OT), secret sharing (SS), garbled circuit (GC), and homomorphic encryption (HE).

Please refer to Section 2.4 for a quick review of the aforementioned privacy-preserving techniques.

3.3.3    PRIVACY-PRESERVING DML SCHEMES

In the following, we give a quick review of representative works on privacy-preserving DML, emphasizing on how they utilize privacy protection tools mentioned above to protect the data and model security in a distributed environment. According to the aforementioned tools, we first summarize the DML algorithms using obfuscation and then introduce those algorithms that use cryptographic methods.

Chaudhuri and Monteleoni [2009] proposed a privacy-preserving logistic regression algorithm based on DP. They tackle the optimization over randomized data, making it possible to take a balance between model performance and privacy protection and make the privacy bound tighter. Following the definition given by Dwork [2008], they prove that their work guarantees ε-differential privacy, and provide a novel algorithm with better performance. In the proposed work, a randomized vector is generated using a Gamma function, which participates in the optimization of the logistic regression parameter θ. Moreover, they concluded that their work reveals the relation between perturbation-based privacy protection and regularization.

Wild and Mangasarian [2007] and Mangasarian et al. [2008] studied privacy-preserving support vector machines (PPSVMs) on horizontally and vertically partitioned datasets, respectively. They concealed the originally learned kernel with a randomly generated kernel, achieving comparable performance to the non-private SVMs. The privacy proof is based on the fact that there are infinite possible input data that can be recovered from the perturbed kernel. Therefore, sharing the perturbed kernel will not cause privacy leakage. However, these methods require participants to share the randomly generated kernel, limiting the application of these methods.

Apart from logistic regression, SVMs, and decision trees, perturbation-based privacy protection methods are also widely used in DL systems. With reference to the survey [Zhang et al., 2018], we selectively introduce some representative works. Song et al. [2013] proposed a differentially private stochastic gradient descent algorithm (DP-SGD), which clips the gradients and injects noise to them during training, so that the learned DL model preserves (ε, δ-differential privacy. Different from previous works, Song et al. [2013] and Shokri and Shmatikov [2015] utilize another obfuscation method, i.e., a distributed selective stochastic gradient descent algorithm. It allows the local model to selectively share part of the parameters, avoiding information leakage and also preserving the performance of the joint learning model. Except for jointly learning a prediction model, Dwork [2011] proposed a differentially private autoencoder to learn the representation of local data.

DP is also used for unsupervised learning. Park et al. [2016] proposed a differentially private EM algorithm (DP-EM) based on moment perturbation. They utilize moment accountants [Abadi et al., 2016] and zCDP to reduce the magnitude of noise added to the EM process while maintaining the same level of privacy protection compared to the original analysis technique. In their work, they compare different randomization mechanisms and their composition settings, and find that in DP-EM, using Gaussian mechanism in every stage of EM can achieve the tightest privacy budget.

In conclusion, owing to the computational efficiency and implementation convenience, obfuscation-based privacy-preserving techniques are popular in privacy-motivated DML systems. Meanwhile, perturbation affects the utility of the data and model. In practice, researchers have to make a tradeoff between privacy protection and performance. Compared to obfuscation-based methods, cryptographic methods do not need to sacrifice data accuracy and model performance.

In Aono et al. [2016], authors utilize HE to protect the data during the training of logistic regression. Their method uses a two-degree approximation to the log-linear objective function, making the training process compatible with the additive HE method, which improves computational efficiency while maintaining a comparable performance. Besides, they claim that their output is compatible with DP. They also analyze the storage and computation complexity of their system, showing that their system supports large-scale distributed computation. Fienberg et al. [2006] also considered linear regression (LR) on the horizontally partitioned dataset, utilizing MPC method to aggregate the calculation. However, in their setting, the features are categorical, which means the computation space is small. Slavkovic et al. [2007] made significant progress, using secure summation protocol and secure matrix multiplication for the aggregation of distributed learning of LR, which supports both vertical and horizontal data partition.

Vaidya and Clifton [2004] designed a secure parameter sharing mechanism for privacy-preserving naive Bayes classifiers on vertically partitioned data, where each participant contributes to a conditionally independent probability. Each individual parameter is indistinguishable from random noise, while only the aggregation is meaningful. However, the extra computational complexity for secure computation is also significant. Yu et al. [2006] and Zhan and Matwin [2007] introduce secure dot product and secure summation protocol to protect the data during kernel computation in SVMs. Xu et al. [2015] embed the secure summation protocol into the Reducer of Hadoop system, implementing an efficient distributed SVM system that supports large-scale data. Similarly, Lin et al. [2005] use secure summation protocol to aggregate the local computation results for distributed EM algorithm, preventing data leakage from local computation results.

As for DL, one of the most representative work is SecureAggregation, proposed by Bonawitz et al. [2016]. This work is based on the federated learning algorithm, FedAvg, first adopted for federated learning by Google [McMahan et al., 2016b], further introducing secret sharing, and oblivious transfer to FedAvg, in a complex mobile environment where the communication is expensive, and the dropout and join-in of clients are frequent. To ensure data security, each leaving data is randomized, and only the aggregation of these shares is meaningful. To reduce the communication cost, they only exchange random seed using secure key-exchange protocol instead of random noise. To handle the challenge that clients can dropout unexpectedly, they use secret sharing so even some client are lost, the system can still recover the data using the remaining shares. Besides SecureAggregation, there are other DL algorithms that use MPC methods [Liu et al., 2016, Shokri and Shmatikov, 2015].

In addition, Mohassel and Zhang [2017] propose a set of privacy-preserving ML algorithm based on MPC methods, supporting LR, logistic regression, and SGD, and provide C++ implementation. Different from obfuscation-based privacy-preserving DML algorithms, cryptographic method-based methods emphasize on taking balance between computational and communication complexity, and security.

In the above, we have briefly introduced some representative privacy-motivated DML algorithms, as well as some widely used privacy-protection tools. For a more detailed treatment of this topic, the following surveys Vepakomma et al. [2018], Zhang et al. [2018] and Mendes and Vilela [2017] are recommended.

3.4    PRIVACY-PRESERVING GRADIENT DESCENT

Gradient descent method is one of the central algorithms in ML. Privacy-preserving gradient descent methods have been widely studied. In this section, we review different privacy-preserving techniques proposed for gradient descent. There is a trade-off between efficiency, accuracy and privacy protection. Developing good privacy-preserving gradient descent methods needs an artful balance of the efficiency-accuracy-privacy trade-off.

Methods preferring higher efficiency while facing less privacy concern may sacrifice data privacy for higher computational efficiency. For example, the gradients are sent to a coordinator in plain-text for model update in the gradient averaging approach [McMahan et al., 2016b], to trade privacy for efficiency without degrading the global learning accuracy. Methods aiming for the highest data privacy and security generally choose to use HE and MPC, which in turn leads to high computation complexity and communication overhead.

In addition to the approaches discussed in Chapter 2, there also exist some other privacy-preserving approaches with different privacy guarantees. In some cases, the privacy model only aims to guarantee that the raw form of input data of each party could not be revealed by the adversary. By providing such weak privacy guarantees, researchers aim to trade privacy for efficiency. Proposed approaches vary significantly.

Typical privacy-preserving gradient descent approaches include naive federated learning, algebraic approaches, sparse gradient update approaches, obfuscation approaches and cryptographic approaches (e.g., HE and MPC). Obfuscation approaches are based on randomization, generalization or suppression mechanisms (e.g., gradient quantization, DP, k-anonymity). Generally in the naive federated learning, algebraic approaches and sparse gradient update approaches, each party sends clear-text gradient to the coordinator for model update, which only protects raw data and yields weak privacy guarantee with quite high efficiency. The sparse gradient update approaches also trade accuracy for efficiency and privacy by updating a subset of entries in the gradient. Approaches based on randomization mechanism such as DP and Gaussian Random Projection (GRP) trade accuracy for privacy by adding random noise to the data or gradient. Approaches based on the generalization and suppression also trade accuracy for privacy by generalizing attributes or removing some instances. We review here some approaches for privacy-preserving gradient descent with roughly increasing privacy guarantees.

3.4.1    VANILLA FEDERATED LEARNING

Federated Averaging (FedAvg) was first employed for federated learning over horizontally partitioned dataset. In federated averaging, each party uploads clear-text gradient to a coordinator (or a trusted dealer, or a parameter server) independently, then the coordinator computes the average of the gradients and update the model. Finally, the coordinator sends the clear-text updated model back to each party [McMahan et al., 2016b]. When the dataset is vertically partitioned, the model is distributed among the parties. In gradient descent methods, the objective function can be decomposed into a differentiable function and a linearly separable function [Wan, 2007]. To conduct gradient descent, each party applies its data on its partial model to get intermediate computation results and send them to the coordinator in clear-text. The coordinator accumulates the intermediate results and evaluates the differentiable function to compute the loss and gradient. Finally, the coordinator updates the whole and send the updated partial model to each corresponding party. It is assumed that the coordinator is honest and incurious, and does not collude with any party. If the coordinator is corrupted, the gradient information of each party can be disclosed. Although the raw form of training data is not likely to be inferred from the gradient of each party, it has been demonstrated that it is possible to infer considerable information from the gradient uploaded from each party [Aono et al., 2018].

3.4.2    PRIVACY-PRESERVING METHODS

Algebraic Approaches

Algebraic approaches aim to protect raw training data by leveraging the algebraic properties of data transmitted. They preserve privacy by guaranteeing that there exist infinite valid input-output pairs of each honest party against the input and output of the adversaries, that is, the raw form of input data is protected. Wan [2007] proposed a secure gradient descent approach for two-party vertical federated learning by decomposing the target function into a differentiable function and a linearly separable function. In this approach, the model parameters of the two parties are mutually masked, and only the clear-text gradient is disclosed and used for model update. Such approaches implicitly radically assumes that each party has no knowledge of the records of the other party. If a small subset of records are disclosed to the other party (e.g., by data poisoning), the model can be easily disclosed via solving-equation attack.

To defend against the equation-solving attack, a secure two-party computation approach for vertical federated linear regression and classification is proposed by introducing the concept of k-secure [Du et al., 2004]. In this approach, the instances are aligned and the dependent attribute is in public. Arithmetic secret sharing is used here. As the addition operation in arithmetic secret sharing is done locally, thus it is information-theoretical secure. The two-party multiplication is conducted as follows. First, the two parties jointly generate a random invertible matrix M. Then, one party A does matrix multiplication with its input matrix A to the left-side and right-side sub-matrix of M in turn, and sends the first A1 result to party B. The other party B does matrix multiplication with its input matrix to the top-side and bottom-side sub-matrix of M in turn, and send the second result B2 to party A. Finally, both parties compute Va = A1 · B1 and Vb = A2 · B2, where Va + Vb = A · B.

The security of this protocol is based on the algebraic property that 2n2 equations cannot determine n × N variables, where Nn. Although the gradient descent approach is not demonstrated in this study, it is straightforward to implement it according to what is discussed in Chapter 2.

Sparse Gradient Update Approaches

The sparse gradient update approaches preserve privacy by updating only a subset of gradients. Such approaches trade accuracy for efficiency and weak privacy. As the gradient is in the clear, they also trade privacy for efficiency. For example, Shokri and Shmatikov [2015] disclose only a subset of clear-text gradient parameters to the coordinator for model update. Strategies for improving communication efficiency include structured update and sketched update [Konecný et al., 2016a]. The structured update strategy only updates a sparse or a low-rank gradient matrix, and the sketched update strategy utilizes subsampling and quantization to eliminate the volume of gradients.

Multi-objective evolutionary computation is also explored in federated learning to learn sparse model parameters [Zhu and Jin, 2018]. Sparse gradient update and gradient compression are widely studied for stronger privacy preservation. However, there is little formal analysis of the privacy preserved by the gradient compression.

Obfuscation Approaches

Obfuscation approaches obfuscate data via randomization, generalization and suppression, which leads to an improvement of privacy at the cost of a deterioration of accuracy. In federated learning, local differential privacy (LDP) can be used by applying an additive noise mask to the gradient of each party. Jiang et al. [2019] propose an approach that applies independent Gaussian random projection (GRP) to protect the raw form of training data. Each participant first generates a Gaussian random matrix and project its raw training data for obfuscation. Then the obfuscated data are sent to the coordinator for model training. The privacy protected is only the raw form of the training data of each participant. The GRP used here also faces a problem of scalability in terms of both number of participants and attribute dimension. The gradient quantization strategy quantizes each gradient value to an adjacent value, which trades model accuracy for efficiency and privacy [Konecný et al., 2016a].

Cryptographic Approaches

The approaches discussed above disclose clear-text gradient of each party to the coordinator or each other parties. In contrast, cryptographic approaches leverage HE and MPC to preserve the privacy of the gradient of each party in the gradient descent procedure. The security models vary from security against honest-but-curious adversaries to security against malicious adversaries, and the corruption assumptions also vary a lot. In addition to the security model, the information disclosed by each approach also varies. Cryptographic approaches trade efficiency for privacy. As it could be very computation or communication inefficient, the approximations of nonlinear functions are generally introduced to further trade accuracy for efficiency.

The secure aggregation approaches introduce a coordinator that is only allowed to learn the clear-text average of a group of gradients. Secure aggregation preserves the privacy of each party’s gradient via Shamir’s threshold secret sharing scheme, so that the coordinator can only disclose the average of a group of gradients [Bonawitz et al., 2016]. However, when the coordinator and n – 1 parties collude in a group with n parties, the input gradient can be easily disclosed. As such secure aggregation is anonymous, extreme poisoning attack may occur. Bonawitz et al. [2016] proposes “trimmed mean,” where the gradients are trimmed coordinate-wisely to prevent extreme poisoning adversary.

Other cryptographic approaches introduce one or more non-colluding coordinators, while the coordinators are not allowed to learn anything about the gradient and the model. For HE-based approaches, this can be done by adding a random mask to the value to be decrypted [Liu et al., 2019]. For MPC-based approaches, the trusted dealer can be asked to generate computation-independent materials (e.g., Beaver triples) [Beaver, 1991]. When there are multiple non-colluding coordinators introduced, each party generates secret sharing of its private data and shares it with each coordinator accordingly [Mohassel and Zhang, 2017, Wagh et al., 2018]. The gradient descent is then conducted between the coordinators.

When it is infeasible to introduce a coordinator, some coordinator-free cryptographic approaches can be adopted, where the secure multi-party gradient descent is conducted so that each party can learn nothing beyond its output and its input. Existing MPC protocols secure against a majority of colluding malicious adversaries include SPDZ [Damård et al., 2011], SPDZ2k, Overdrive [Keller et al., 2018], and MASCOT [Keller et al., 2016]. In such approaches, an offline phase is implemented where the Beaver triples are generated by MPC before secure multi-party gradient descent. Bonawitz et al. [2016] demonstrate an actively-secure MPC protocol based on SPDZ2k for decision tree and SVM. The gradient descent functions can also be evaluated similarly based on the SPDZ2k protocol.

3.5    SUMMARY

This chapter presents a brief introduction of the scalability-motivated DML and the privacy-motivated DML. The scalability-motivated DML is widely employed for addressing the computation resource and memory limitations in large-scale ML problems. Parallelism techniques (such as data parallelism, model parallelism, and hybrid parallelism) are the major choices for implementing and expanding the scalability-motivated DML systems. The privacy-motivated DML is primarily adopted for preserving user privacy and ensuring data security with decentralized data sources. MPC, HE and DP are the main privacy-preserving techniques for realizing the privacy-motivated DML. Privacy-preserving gradient descent methods have also been widely used for empowering the privacy-motivated DML.

DML has received abundant attention in past years, and has been fast-developed into open-source and commercial products. Yet, there are still practical challenges that the existing DML systems cannot address. Federated learning is a special type of DML, and it can further address the issues that the conventional DML systems are faced with and enable us to build privacy-preserving AI. We will elaborate with details in the subsequent chapters on federated learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset