9
A Secure Data Learning Scheme for Big Data Applications in the Smart Grid

In this chapter, a secure data learning scheme is proposed for big data applications in the information and communication technology infrastructure of the smart grid. The proposed scheme allows multiple parties to find the predictive models from their overall data, while not revealing their own private data to one another at the same time. Instead of deploying a centralized data learning process, the scheme distributes data learning tasks to the local learning parties as their own local data learning tasks to learn the value from local data, thus preserving privacy. In addition, an associated secure scheme is proposed to guarantee the privacy of learning results during the information reassembly and value‐response process. An evaluation is performed to verify the privacy of the training data set, as well as the accuracy of learning weights. A case study is presented based on an open metering data analysis.

9.1 Background and Related Work

9.1.1 Motivation and Background

Nowadays, information and communication technology (ICT) infrastructure has been able by modern control techniques to tremendously improve the efficiency, reliability, and security of information systems [145]. Big data is an emerging topic due to its various applications [146]. However, it would not be so prevalent without the underlying support of the ICT, due to its extremely large volume of data and computing complexity. When a huge volume of data is quickly generated, it is crucial to process, aggregate, store, and analyze such a massive amount of data in real time [147, 148].

For large data sets acquired by ICT companies, the overall data set does not always reside on the same database; instead, it may reside on disparate databases among various locations of companies. The challenge is that data analytics needs to deal with the big data problem. Many ICT companies do not have the infrastructure to support such needs. In terms of possible approaches, a data‐learning task would be carried out by a decision maker from an ICT company. We categorize it into a centralized learning task and a distributed local learning task. Currently, most schemes are designed and dedicated to manage traditional small‐scale amounts of data in a centralized approach. Figure 9.1 shows an example of a centralized learning process for big data applications. However, due to the performance bottleneck of a central hub, access to the data on the central hub is relatively inefficient in the traditional scheme. Data preprocessing could be relatively slow, and information privacy could be easily jeopardized. The central hub is not being released to enable performing heavily loaded analytics on massive amounts data in the traditional approach. Meanwhile, few of the schemes have effectively incorporated strong security and privacy protection measures during the process of big data learning among multiple entities. Security and privacy are without doubt the top concerns, no matter which learning approach a decision maker adopts, because of the crucial information about users' personal data and usage information the aggregated data may contain.

Diagrammatic illustration of a traditional centralized learning process for big data applications.

Figure 9.1 Traditional centralized learning process for big data applications.

To close the gap, this chapter proposed a secure data learning scheme chapter for multiple entities in an environment of big data applications. It focuses on improving the disadvantages of traditional centralized learning tasks with large data sets. Specifically, the scheme can push the major learning processes from the site of a central hub to the sites of multiple learning entities (i.e. every origin of aggregated data) in a way that reduces the communication overhead. Moreover, based on the current learning algorithms, a penalty term is enforced on the objective function to avoid the overfitting problem, which a learning scheme may encounter while solving common regression problems. Additionally, we design an associated secure scheme with privacy preservation and identity protection to obtain the final learning result. The major security objectives of our proposed scheme are as follows:

  1. Privacy. During the data learning process, only learning results are revealed by learning entities; learning results need to be controlled to maintain robust security against statistical attacks.
  2. Malicious learning entity detection. The secure learning scheme should be able to detect when an entity has been compromised and subject to a malicious attack, so the system can maintain a high level of reliability.

9.1.2 Related Work

Several learning algorithms have been proposed to target the privacy and security issues in the data learning process. For example, in [149], the authors studied a privacy‐preserving secure 2‐entity computation framework based on linear regression and classification. In [150], based on constructing artificial neural networks, the authors developed a deep learning system to enable multiple entities to learn an objective without sharing their own data sets. In [151], the authors surveyed the basic paradigms of secure multiple entities computation and discussed their relevance to the field of privacy‐preserving data mining schemes. In [152], the authors discussed approaches for privacy‐preserving learning of distributed data in different applications and presented a solution of combined components for specific privacy‐preserving data learning applications. In [153], the authors presented a cryptographically secure protocol for privacy‐preserving decision trees. In [154], based on a differential privacy model, the authors constructed a privacy‐preserving naïve Bayes classifier in a centralized scenario, where data miners could access a data set in a centralized way, and the data miner could deploy a classifier on the premise that the private information of data owners could not be inferred from the classification model.

9.2 Preliminaries

In this section, the fundamentals of generalized linear learning models are illustrated. The security model for big data applications is also given in this section.

9.2.1 Classic Centralized Learning Scheme

In this study, we adopt a supervised learning algorithm to evaluate the centralized data learning process. Given a large data set images, the objective of a centralized learning scheme is to minimize a cost function images, avoid the phenomenon of overfitting by penalizing heavily weighted parameters of complex models, and provide high accuracy of predicted images. In real world applications, a preprocessed data set possessed by a learning entity is a matrix images with images observations (rows) and images features (columns).

Based on the applications, the main learning tasks for our proposed scheme are carried out under supervised learning algorithms. From the existing data set images as the training data set, our objectives are as follows:

  • To find a proper hypothesis function images to solve a regression problem, so that images.
  • To minimize the cost function images (i.e. images) of this hypothesis function images such that images.

Note that a hypothesis function images of a major supervised learning algorithm is mainly parameterized by the hypothesis weights images. In a normal case, a cost function images is applied to approximate how well a hypothesis function images performs when the true output is images. For a data set possessed by one learning entity, we define the “true” risk images as

(9.1)images

where images is the true distribution over images and images. However, risk images cannot be computed directly, since the distribution images is unknown to the learning algorithm. Therefore, we apply empirical risk minimization [155] to compute an approximation of risk images by averaging the loss function on a single training data set:

(9.2)images

Finally, the learning algorithm selects a hypothesis function images that minimizes the empirical risk as the optimal learning model to perform centralized learning scheme. The hypothesis is defined as follows:

(9.3)images

9.2.2 Supervised Learning Models

9.2.2.1 Supervised Regression Learning Model

The quadratic cost function images for supervised linear regression algorithms is defined as follows:

(9.4)images

For instance, if linear regression is used as the training model to solve this problem, then the hypothesis function images for the training data set is presented as follows:

(9.5)images

where images is defined as the space of a linear function mapping from images to images and images is the value of the images‐th feature in a training sample. The quadratic cost function for the training data set is defined as follows:

(9.6)images

The cost function is to measure the distance between a images to the corresponding images given a value images.

9.2.2.2 Regularization Term

For generalized linear learning models, the cost function images is usually treated as being convex [156]; thus a summation of multiple cost functions remains convex. In order to solve these specific optimization problems, we first transform them to unconstrained minimization problems. Readers should refer to [157] for detailed methods of solving an unconstrained problem.

With a relatively small size local data set, learning models may encounter an overfitting problem [158], especially when empirical risk minimization is adopted in the models. In order to prevent overfitting while minimizing the empirical risk images, a penalty term is needed based on the learning model to form an images ridge regularization regression [159]. This process aims at minimizing the squares of the parameters. By performing an images ridge regularization regression this penalty term would not only effectively prevent an overfitting problem but also regularize the unconstrained minimization problem. The penalty approach used in this scheme is defined as follows:

(9.7)images

where images is treated as the penalty coefficient of the images ridge penalty term images. Normally an unconstrained minimization problem with a convex function can be solved by existing descent optimization algorithms, e.g. the gradient descent method and the classic Newton's method. However, the learning rate images used in a descent method is hard to choose. In this study, we are inspired by the stochastic gradient descent method [160] to perform fast computation.

9.2.3 Security Model

The security model used for analysis is a semimalicious model, which is a subcategory of secure multiple learning entities computation [161]. In a semimalicious model, all the entities are preregistered with the central hub. In particular, the central hub is a trusted learning entity, while some of the entities are treated as potential malicious identities that are compromised by any adversarial group. A compromised and malicious learning entity may perform as follows:

  1. Refuse to participate in the learning scheme.
  2. Deliberately substitute or falsify its local learning result.
  3. In the worst case, viciously abort the learning scheme prematurely.

From the discussions mentioned before, there are two major security and privacy objectives in designing such a learning scheme: (i) During the learning process, only the final training result is revealed with protection; local data information should remain private under any circumstances. And (ii) during the learning process, a compromised and malicious learning entity would be detected by other normal entities as the learning scheme is being adopted. In the next section, we will illustrate how the secure learning scheme will perform the distributed data learning tasks while identifying malicious entities and guaranteeing the security and privacy of data information from being viciously exposed to others.

9.3 Secure Data Learning Scheme

In this section, we present the secure data learning scheme. We first introduce the proposed data learning scheme, with a focus on the algorithm design. We then introduce the associated security scheme with respect to data privacy and identity protection.

9.3.1 Data Learning Scheme

A learning entity images fits its hypothesis function images when a data set of images is arrived during a fixed time interval images. For this learning entity images, its cost function images of the hypothesis function images could be denoted as images. In light of the principles of descent algorithms, during the images‐th iteration of this learning study, we formulate this as a minimization problem images for a learning entity images, with the objective of minimizing the cost function,

In the beginning, each learning entity only trains its own private local data to fit its own training model. The local training model is a relatively suitable model for the local learning entity; however it would most likely be a poor fit as a training model for the overall data set. Hence, the objective becomes to solve images and to find the optimal value of the hypothesis weights vector images from images and images.

In order to find the optimal value of the hypothesis weights vector images from images and images, we need to find the solution to Eq. (9.8) by calculating the partial derivatives of images. Note that some of the learning models we discussed before can be applied to the hypothesis function to solve certain regression problems. As a general process, during the images‐th data learning iteration, the learning model of entity images will update its hypothesis function images based on the empirical risk minimization discussed before. Inspired by the stochastic gradient descent algorithm, a detailed learning algorithm to obtain hypothesis weights vector images for an entity images in images‐th iteration is presented in Algorithm 9.1. For simplicity, we use images as the learning weight vector for a local training data set images, where images, images, images and images for the cost and hypothesis function of images during the images‐th iteration, respectively.

images

In Algorithm 9.1, images represents the current optimal weight before performing the images ridge penalty term. For different supervised learning models, there are different hypothesis functions for images. For example, if a polynomial regression problem is encountered, the gradient of cost function images is defined as images, where in this case (images,images) represents the index of each observation and feature of a data set. If a classification problem is encountered, then the gradient of a 0–1 logistic regression cost function images can be defined as images. When we take a look into the hypothesis weights vector images, we can see that the output of images depends on the supervised learning algorithm. It minimizes the cost function and makes the training result better fit the learning model.

9.3.2 The Proposed Security Scheme

9.3.2.1 Privacy Scheme

In this part we present the proposed privacy scheme. As shown in Figure 9.1, the step of feature normalization towards a large aggregated data set is crucial. When features differ by orders of magnitude, a standard feature normalization process performs feature scaling so that descent algorithms can converge faster. During the aggregation process, our objective is to find an associated privacy scheme to incorporate with the feature normalization process.

For a joint data set, the goal of images‐score normalization is to rescale the features so that the properties of a standard normal distribution with images and images can be found and the standard scores of the joint data set could be calculated images for each data entry. By subtracting the mean value of each feature from the data set and scaling the feature values by their respective standard deviations, the exact value in each data entry is in a range of images. In light of this, the exact values of data content from each observation and feature have been normalized, and they are different from their original data. By performing images‐score normalization, we could expedite the process of data learning when the sample size and the number of learning entities increase. After the images‐score normalization, the local value images can also be a images based matrix.

images‐score normalization has modified the original data content from the malicious users. However, in order to fully achieve the goal of preserving data privacy, it is necessary to prevent exposure of the real size of the data set. A noise images between 0 and 1 is added to increase security robustness against statistical analysis during normalization in the learning entity images. When the features are normalized, it is quite important to store the related values we used for normalization computation. Otherwise, after obtaining the parameters from the model, malicious attackers would still be able to recover the raw values, perform predictions, and compromise the integrity of the data. A set‐up noise would be a good disguise against those passive attacks.

9.3.2.2 Identity Protection

If a learning entity images is compromised by a malicious entity, the trusted centralized hub should be able to detect the abnormality immediately. Once a images is compromised, it normally will not reveal its malicious identity to the central hub but will keep its secret malicious identity anonymous and deliberately falsify or substitute its local learning result to the central hub for the purpose of jeopardizing the learning scheme and the whole overall learning result. Therefore, our objective is to determine whether the identity of a images is legitimate or not, without requesting any identity information from a malicious images itself, and to report the malicious images to the system administrator.

Specifically, a fast security solution scheme inspired by the zero‐knowledge proof [162] is proposed. A images first computes the 512‐bit hash value of its local learning result images during images‐th iteration. images then combines it with a time‐stamp images as a joint message images, encrypts the ticket using its own private key images, and then sends the encrypted message images to the central hub. Once the central hub receives the encrypted ticket, it sends a 512‐bit hash value of the current overall learning value images back to images. After receiving it, images concatenates the joint message of images, images, and the updated time‐stamp images together as images, encrypts it with its private key images, and sends it as images to the central hub again. Since all the entities are preregistered, the central hub then looks up images's public key images from a public keys repository for all entities located at the trusted side. Once images is obtained, the central hub could decrypt the encrypted joint message images to perform the identity check and determine whether images is truly legitimate or not. If images is legitimate, the central hub allows the access of images so that images submits its local learning result to the central hub during images‐th iteration.

Diagrammatic illustration of a proposed security scheme based on zero-knowledge proof.

Figure 9.2 Proposed security scheme based on zero‐knowledge proof.

The proposed security scheme based on zero‐knowledge proof is illustrated in Figure 9.2. Note that to address security concerns, the scheme should be processed at the beginning of each data learning iteration, so that any abnormal operation occurring in an iteration can be detected. A nice property of this zero‐knowledge proof based scheme is that one entity could be easily verified to possess certain information simply based on revealing its basic public information. Through the illustration, one entity's challenge process could be proven without revealing secret information. A security analysis of this associated security scheme is presented in 9.3.4 to demonstrate its security robustness.

9.3.3 Analysis of the Learning Process

In the proposed algorithm, we repeatedly run through the data set, and each time we encounter a learning example, we update the parameters according to the gradient of the error term images with respect to that single learning example only. It could immediately start to perform the learning process and keep track of it with each observation. As we discussed in section IV.A, inspired by the stochastic gradient descent, it could reach the optimal weight much faster and is a reasonably good choice when facing a large data set.

We would also expect to find out that as the number of iterations increases, the minimized value cost function will converge. This also means that for a convex cost function images, the hypothesis function would be images, and images with a very small of tolerance.

9.3.4 Analysis of the Security

In order to fully achieve the goal of preserving data privacy, the protection of noise images is crucial in terms of generation and computation during the images‐score normalization. Here we consider images generated from an irreversible hash function with the input images and convert it into a decimal value images. Then the noise images would be fixed into the range of 0 and 1 by performing images.

During the verification process, a reverse computation could be adopted to check on the images form of weights images and the decimal hash value. Since images is computationally irreversible, the range of noise images could be guaranteed in this case. When it comes to the anonymization of data set sample size, images would play an important role in incorporating the anonymity of a local value. Even if an adversary obtains the dimension information of a possible local value, it would still be in vain, since images is not computationally accessible to the adversary. Additionally, images being in the range of 0 and 1 still holds a good property of computational efficiency, as inherited from the images‐score normalization.

9.4 Smart Metering Data Set Analysis—A Case Study

In this section, we describe a case study on using the proposed secure data learning scheme to solve the regression problem of smart metering data sets from the UMass Trace Repository [163]. We first introduce the networking model of a smart grid AMI, then perform regression work on the data sets.

9.4.1 Smart Grid AMI and Metering Data Set

As one of the most critical infrastructures, the smart grid has adopted advanced and powerful ICT to improve all sorts of aspects of the power grid [7, 164171, 147]. In a smart grid advanced metering infrastructure (AMI), as shown in Figure 9.3, a data aggregation point (DAP) aggregates massive amounts of smart metering data from distributed smart meters at each household, preprocesses the aggregated data, and forwards the preprocessed data to the operation center via the smart grid's wide‐area networks (WAN) through a local master gateway of the designated AMI concentrator. This high volume of collected data, which contains the summary of power usage and personal behavior patterns, is analyzed by a big data platform and metering data management system (MDMS) for accurate real‐time control and optimized resource allocation.

Diagrammatic illustration of the ICT architecture of advanced metering infrastructure (AMI) in the smart grid.

Figure 9.3 The ICT architecture of AMI in the smart grid.

Table 9.1 A sample set of smart metering data.

NameMeter #CurrentPreviousimagesTotal (KWh)
Suite 10145600321.2485.2images258.7
Suite 10245601320.8483.9images341.0
Suite 10345602322.7486.1images493.2
imagesimagesimagesimagesimagesimages
Suite 89947212320.3482.9images234.6

An example smart metering data set, shown in Table 9.1, is based on a frequent metering report from a smart meter. Normally in order to make efficient energy‐buying decisions based on usage patterns, to perform power theft detection, and to correct its service performance, utility companies would have a corresponding matrix images with the same images observations and one feature, for example, power usage quality. As stated in the system model, a DAP images trains its hypothesis function images when a data set of images arrives during a a fixed time interval images. During this images, each smart meter reports the metering data to its own DAP. Since each DAP governs a different number of smart meters, we define the DAP images that governs images number of smart meters. Therefore, the total data matrix that a DAP images receives during a time interval images has images observations and images features. Considering the fact that all the metering data is reported in a fixed short time interval, the sample size is likely to be relatively small. As discussed in “9.2.2.2”, specifically the second paragraph of “9.2.2.2”, this might lead to an overfitting problem for learning models, especially when empirical risk minimization is adopted.

In a DAP images, the cost function could be considered as images. In the study case of the images‐th iteration, we formulate this as a minimization problem for a DAP learning entity images, with the objective function

(9.9)images

where images represents the total observations images of a joint data set among images DAP learning entities during time interval images, and images represents the images‐th observation in the data set. In this study, the concentrator is surrounded by images DAPs forming a mesh‐based graph model [166], for the requirement of robust and reliable neighborhood area communications [167].

9.4.2 Regression Study

A regression study is conducted based on the data sets from [163]. In this scenario, 30 data learning entities are involved in training their own local data, and a central hub‐based model is established to perform the proposed scheme. Based on the data set, metering information is reported every five minutes during a day. Several features, such as temperature, humidity, wind degree and direction, etc., are contained in the environmental data set forming the images, while the power usage is recorded as images with the unit of watt. Six features, including “insideTemp,” “outsideTemp,” “outsideHumidity,” “windSpeed,” “windDirectionDegrees,” and “windGust,” are selected as the learning features, and a ridge‐based regression task is conducted among all the 30 local entities.

Without loss of generality, we show the plots of hypothesis weights of learning entities numbers 1, 5, and 6 in Figure 9.4, Figure 9.5, and Figure 9.6 respectively, with respect to the ridge coefficients images. In each figure, six lines represent the chosen six features of the coefficient vectors. In the beginning, each weight vector images from different learning entities is randomly sampled from a Gaussian distribution, as presented in each figure with different starting points. While images tends toward a certain threshold, weights found by the regression models from different learning parties start to become stabilized towards the sampled vector images. Meanwhile, the regularized weights are getting lower and will eventually converge to zero.

Graphical illustration of the regularization results using learning entity 1.

Figure 9.4 Regularization results using learning entity 1.

Graphical illustration of the regularization results using learning entity 5.

Figure 9.5 Regularization results using learning entity 5.

Graphical illustration of the regularization results using learning entity 6.

Figure 9.6 Regularization results using learning entity 6.

Once the initial hypothesis weights from the ridge‐based regression are obtained locally, the iteration process starts as well. By adopting the modified gradient descent algorithm, the values of the cost function images decrease as the iterations move on in Figure 9.7. After approximately 50 iterations, we achieve convergence among all the learning entities. As a comparison, the overall data set aggregated through the aggregation protocol and processed through centralized data learning scheme remains a large cost, because sample size of the overall data set is larger than the local data set.

Grid illustration of convergence on the values of cost functions Ji.

Figure 9.7 Convergence on the values of cost functions images.

Moreover, we can see that the cost value in each function images is quite large. The results can be improved if the nonlinear regression model is carefully chosen. On the local side, each entity performs with almost the same cost, and it actually explains the frequency‐based power consumption follows a regular amount of data size. In summary, it follows the nature of descent algorithms: when the value of cost function images converges, the hypothesis weights images converge.

9.5 Conclusion and Future Work

In this chapter, we presented a secure data learning scheme for multiple entities in an environment of big data applications. We considered the scenario where multiple entities, which act as data holders, were intended to find predictive models from their overall data. We proposed a scheme in which the property of scalability of data learning is used as an alternative computation technique to provide security to multiple entities. The proposed scheme was inspired by descent algorithms and considered ways to avoid the overfitting problem. Additionally, an associated secure scheme was also proposed to secure the privacy of local information against content leakage and statistical attacks, as well as the protection of identities. Both theoretical learning and security analysis are provided by this scheme. A case study was conducted by adopting the proposed scheme to perform a regression problem and investigate the smart metering data in the smart grid. The proposed scheme was targeted at improving the potential computation and security drawbacks of the centralized training task of big data applications. Future research in this area may consider massive data sets and other supervised learning models to provide more accurate predictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset