Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9
A Secure Data Learning Scheme for Big Data Applications in the Smart Grid

In this chapter, a secure data learning scheme is proposed for big data applications in the information and communication technology infrastructure of the smart grid. The proposed scheme allows multiple parties to find the predictive models from their overall data, while not revealing their own private data to one another at the same time. Instead of deploying a centralized data learning process, the scheme distributes data learning tasks to the local learning parties as their own local data learning tasks to learn the value from local data, thus preserving privacy. In addition, an associated secure scheme is proposed to guarantee the privacy of learning results during the information reassembly and value‐response process. An evaluation is performed to verify the privacy of the training data set, as well as the accuracy of learning weights. A case study is presented based on an open metering data analysis.

9.1 Background and Related Work

9.1.1 Motivation and Background

Nowadays, information and communication technology (ICT) infrastructure has been able by modern control techniques to tremendously improve the efficiency, reliability, and security of information systems [145]. Big data is an emerging topic due to its various applications [146]. However, it would not be so prevalent without the underlying support of the ICT, due to its extremely large volume of data and computing complexity. When a huge volume of data is quickly generated, it is crucial to process, aggregate, store, and analyze such a massive amount of data in real time [147, 148].

For large data sets acquired by ICT companies, the overall data set does not always reside on the same database; instead, it may reside on disparate databases among various locations of companies. The challenge is that data analytics needs to deal with the big data problem. Many ICT companies do not have the infrastructure to support such needs. In terms of possible approaches, a data‐learning task would be carried out by a decision maker from an ICT company. We categorize it into a centralized learning task and a distributed local learning task. Currently, most schemes are designed and dedicated to manage traditional small‐scale amounts of data in a centralized approach. Figure 9.1 shows an example of a centralized learning process for big data applications. However, due to the performance bottleneck of a central hub, access to the data on the central hub is relatively inefficient in the traditional scheme. Data preprocessing could be relatively slow, and information privacy could be easily jeopardized. The central hub is not being released to enable performing heavily loaded analytics on massive amounts data in the traditional approach. Meanwhile, few of the schemes have effectively incorporated strong security and privacy protection measures during the process of big data learning among multiple entities. Security and privacy are without doubt the top concerns, no matter which learning approach a decision maker adopts, because of the crucial information about users' personal data and usage information the aggregated data may contain.

Diagrammatic illustration of a traditional centralized learning process for big data applications. — Figure 9.1 Traditional centralized learning process for big data applications.

To close the gap, this chapter proposed a secure data learning scheme chapter for multiple entities in an environment of big data applications. It focuses on improving the disadvantages of traditional centralized learning tasks with large data sets. Specifically, the scheme can push the major learning processes from the site of a central hub to the sites of multiple learning entities (i.e. every origin of aggregated data) in a way that reduces the communication overhead. Moreover, based on the current learning algorithms, a penalty term is enforced on the objective function to avoid the overfitting problem, which a learning scheme may encounter while solving common regression problems. Additionally, we design an associated secure scheme with privacy preservation and identity protection to obtain the final learning result. The major security objectives of our proposed scheme are as follows:

Privacy. During the data learning process, only learning results are revealed by learning entities; learning results need to be controlled to maintain robust security against statistical attacks.
Malicious learning entity detection. The secure learning scheme should be able to detect when an entity has been compromised and subject to a malicious attack, so the system can maintain a high level of reliability.

9.1.2 Related Work

Several learning algorithms have been proposed to target the privacy and security issues in the data learning process. For example, in [149], the authors studied a privacy‐preserving secure 2‐entity computation framework based on linear regression and classification. In [150], based on constructing artificial neural networks, the authors developed a deep learning system to enable multiple entities to learn an objective without sharing their own data sets. In [151], the authors surveyed the basic paradigms of secure multiple entities computation and discussed their relevance to the field of privacy‐preserving data mining schemes. In [152], the authors discussed approaches for privacy‐preserving learning of distributed data in different applications and presented a solution of combined components for specific privacy‐preserving data learning applications. In [153], the authors presented a cryptographically secure protocol for privacy‐preserving decision trees. In [154], based on a differential privacy model, the authors constructed a privacy‐preserving naïve Bayes classifier in a centralized scenario, where data miners could access a data set in a centralized way, and the data miner could deploy a classifier on the premise that the private information of data owners could not be inferred from the classification model.

9.2 Preliminaries

In this section, the fundamentals of generalized linear learning models are illustrated. The security model for big data applications is also given in this section.

9.2.1 Classic Centralized Learning Scheme

In this study, we adopt a supervised learning algorithm to evaluate the centralized data learning process. Given a large data set , the objective of a centralized learning scheme is to minimize a cost function , avoid the phenomenon of overfitting by penalizing heavily weighted parameters of complex models, and provide high accuracy of predicted . In real world applications, a preprocessed data set possessed by a learning entity is a matrix with observations (rows) and features (columns).

Based on the applications, the main learning tasks for our proposed scheme are carried out under supervised learning algorithms. From the existing data set as the training data set, our objectives are as follows:

To find a proper hypothesis function to solve a regression problem, so that .
To minimize the cost function (i.e. ) of this hypothesis function such that .

Note that a hypothesis function of a major supervised learning algorithm is mainly parameterized by the hypothesis weights . In a normal case, a cost function is applied to approximate how well a hypothesis function performs when the true output is . For a data set possessed by one learning entity, we define the “true” risk as

(9.1) $images$

where is the true distribution over and . However, risk cannot be computed directly, since the distribution is unknown to the learning algorithm. Therefore, we apply empirical risk minimization [155] to compute an approximation of risk by averaging the loss function on a single training data set:

(9.2) $images$

Finally, the learning algorithm selects a hypothesis function that minimizes the empirical risk as the optimal learning model to perform centralized learning scheme. The hypothesis is defined as follows:

(9.3) $images$

9.2.2 Supervised Learning Models

9.2.2.1 Supervised Regression Learning Model

The quadratic cost function for supervised linear regression algorithms is defined as follows:

(9.4) $images$

For instance, if linear regression is used as the training model to solve this problem, then the hypothesis function for the training data set is presented as follows:

(9.5) $images$

where is defined as the space of a linear function mapping from to and is the value of the ‐th feature in a training sample. The quadratic cost function for the training data set is defined as follows:

(9.6) $images$

The cost function is to measure the distance between a to the corresponding given a value .

9.2.2.2 Regularization Term

For generalized linear learning models, the cost function is usually treated as being convex [156]; thus a summation of multiple cost functions remains convex. In order to solve these specific optimization problems, we first transform them to unconstrained minimization problems. Readers should refer to [157] for detailed methods of solving an unconstrained problem.

With a relatively small size local data set, learning models may encounter an overfitting problem [158], especially when empirical risk minimization is adopted in the models. In order to prevent overfitting while minimizing the empirical risk , a penalty term is needed based on the learning model to form an ridge regularization regression [159]. This process aims at minimizing the squares of the parameters. By performing an ridge regularization regression this penalty term would not only effectively prevent an overfitting problem but also regularize the unconstrained minimization problem. The penalty approach used in this scheme is defined as follows:

(9.7) $images$

where is treated as the penalty coefficient of the ridge penalty term . Normally an unconstrained minimization problem with a convex function can be solved by existing descent optimization algorithms, e.g. the gradient descent method and the classic Newton's method. However, the learning rate used in a descent method is hard to choose. In this study, we are inspired by the stochastic gradient descent method [160] to perform fast computation.

9.2.3 Security Model

The security model used for analysis is a semimalicious model, which is a subcategory of secure multiple learning entities computation [161]. In a semimalicious model, all the entities are preregistered with the central hub. In particular, the central hub is a trusted learning entity, while some of the entities are treated as potential malicious identities that are compromised by any adversarial group. A compromised and malicious learning entity may perform as follows:

Refuse to participate in the learning scheme.
Deliberately substitute or falsify its local learning result.
In the worst case, viciously abort the learning scheme prematurely.

From the discussions mentioned before, there are two major security and privacy objectives in designing such a learning scheme: (i) During the learning process, only the final training result is revealed with protection; local data information should remain private under any circumstances. And (ii) during the learning process, a compromised and malicious learning entity would be detected by other normal entities as the learning scheme is being adopted. In the next section, we will illustrate how the secure learning scheme will perform the distributed data learning tasks while identifying malicious entities and guaranteeing the security and privacy of data information from being viciously exposed to others.

9.3 Secure Data Learning Scheme

In this section, we present the secure data learning scheme. We first introduce the proposed data learning scheme, with a focus on the algorithm design. We then introduce the associated security scheme with respect to data privacy and identity protection.

9.3.1 Data Learning Scheme

A learning entity fits its hypothesis function when a data set of is arrived during a fixed time interval . For this learning entity , its cost function of the hypothesis function could be denoted as . In light of the principles of descent algorithms, during the ‐th iteration of this learning study, we formulate this as a minimization problem for a learning entity , with the objective of minimizing the cost function,

(9.8) $images$

In the beginning, each learning entity only trains its own private local data to fit its own training model. The local training model is a relatively suitable model for the local learning entity; however it would most likely be a poor fit as a training model for the overall data set. Hence, the objective becomes to solve and to find the optimal value of the hypothesis weights vector from and .

In order to find the optimal value of the hypothesis weights vector from and , we need to find the solution to Eq. (9.8) by calculating the partial derivatives of . Note that some of the learning models we discussed before can be applied to the hypothesis function to solve certain regression problems. As a general process, during the ‐th data learning iteration, the learning model of entity will update its hypothesis function based on the empirical risk minimization discussed before. Inspired by the stochastic gradient descent algorithm, a detailed learning algorithm to obtain hypothesis weights vector for an entity in ‐th iteration is presented in Algorithm 9.1. For simplicity, we use as the learning weight vector for a local training data set , where , , and for the cost and hypothesis function of during the ‐th iteration, respectively.

In Algorithm 9.1, represents the current optimal weight before performing the ridge penalty term. For different supervised learning models, there are different hypothesis functions for . For example, if a polynomial regression problem is encountered, the gradient of cost function is defined as , where in this case (,) represents the index of each observation and feature of a data set. If a classification problem is encountered, then the gradient of a 0–1 logistic regression cost function can be defined as . When we take a look into the hypothesis weights vector , we can see that the output of depends on the supervised learning algorithm. It minimizes the cost function and makes the training result better fit the learning model.

9.3.2 The Proposed Security Scheme

9.3.2.1 Privacy Scheme

In this part we present the proposed privacy scheme. As shown in Figure 9.1, the step of feature normalization towards a large aggregated data set is crucial. When features differ by orders of magnitude, a standard feature normalization process performs feature scaling so that descent algorithms can converge faster. During the aggregation process, our objective is to find an associated privacy scheme to incorporate with the feature normalization process.

For a joint data set, the goal of ‐score normalization is to rescale the features so that the properties of a standard normal distribution with and can be found and the standard scores of the joint data set could be calculated for each data entry. By subtracting the mean value of each feature from the data set and scaling the feature values by their respective standard deviations, the exact value in each data entry is in a range of . In light of this, the exact values of data content from each observation and feature have been normalized, and they are different from their original data. By performing ‐score normalization, we could expedite the process of data learning when the sample size and the number of learning entities increase. After the ‐score normalization, the local value can also be a based matrix.

‐score normalization has modified the original data content from the malicious users. However, in order to fully achieve the goal of preserving data privacy, it is necessary to prevent exposure of the real size of the data set. A noise between 0 and 1 is added to increase security robustness against statistical analysis during normalization in the learning entity . When the features are normalized, it is quite important to store the related values we used for normalization computation. Otherwise, after obtaining the parameters from the model, malicious attackers would still be able to recover the raw values, perform predictions, and compromise the integrity of the data. A set‐up noise would be a good disguise against those passive attacks.

9.3.2.2 Identity Protection

If a learning entity is compromised by a malicious entity, the trusted centralized hub should be able to detect the abnormality immediately. Once a is compromised, it normally will not reveal its malicious identity to the central hub but will keep its secret malicious identity anonymous and deliberately falsify or substitute its local learning result to the central hub for the purpose of jeopardizing the learning scheme and the whole overall learning result. Therefore, our objective is to determine whether the identity of a is legitimate or not, without requesting any identity information from a malicious itself, and to report the malicious to the system administrator.

Specifically, a fast security solution scheme inspired by the zero‐knowledge proof [162] is proposed. A first computes the 512‐bit hash value of its local learning result during ‐th iteration. then combines it with a time‐stamp as a joint message , encrypts the ticket using its own private key , and then sends the encrypted message to the central hub. Once the central hub receives the encrypted ticket, it sends a 512‐bit hash value of the current overall learning value back to . After receiving it, concatenates the joint message of , , and the updated time‐stamp together as , encrypts it with its private key , and sends it as to the central hub again. Since all the entities are preregistered, the central hub then looks up 's public key from a public keys repository for all entities located at the trusted side. Once is obtained, the central hub could decrypt the encrypted joint message to perform the identity check and determine whether is truly legitimate or not. If is legitimate, the central hub allows the access of so that submits its local learning result to the central hub during ‐th iteration.

Diagrammatic illustration of a proposed security scheme based on zero-knowledge proof. — Figure 9.2 Proposed security scheme based on zero‐knowledge proof.

The proposed security scheme based on zero‐knowledge proof is illustrated in Figure 9.2. Note that to address security concerns, the scheme should be processed at the beginning of each data learning iteration, so that any abnormal operation occurring in an iteration can be detected. A nice property of this zero‐knowledge proof based scheme is that one entity could be easily verified to possess certain information simply based on revealing its basic public information. Through the illustration, one entity's challenge process could be proven without revealing secret information. A security analysis of this associated security scheme is presented in 9.3.4 to demonstrate its security robustness.

9.3.3 Analysis of the Learning Process

In the proposed algorithm, we repeatedly run through the data set, and each time we encounter a learning example, we update the parameters according to the gradient of the error term with respect to that single learning example only. It could immediately start to perform the learning process and keep track of it with each observation. As we discussed in section IV.A, inspired by the stochastic gradient descent, it could reach the optimal weight much faster and is a reasonably good choice when facing a large data set.

We would also expect to find out that as the number of iterations increases, the minimized value cost function will converge. This also means that for a convex cost function , the hypothesis function would be , and with a very small of tolerance.

9.3.4 Analysis of the Security

In order to fully achieve the goal of preserving data privacy, the protection of noise is crucial in terms of generation and computation during the ‐score normalization. Here we consider generated from an irreversible hash function with the input and convert it into a decimal value . Then the noise would be fixed into the range of 0 and 1 by performing .

During the verification process, a reverse computation could be adopted to check on the form of weights and the decimal hash value. Since is computationally irreversible, the range of noise could be guaranteed in this case. When it comes to the anonymization of data set sample size, would play an important role in incorporating the anonymity of a local value. Even if an adversary obtains the dimension information of a possible local value, it would still be in vain, since is not computationally accessible to the adversary. Additionally, being in the range of 0 and 1 still holds a good property of computational efficiency, as inherited from the ‐score normalization.

9.4 Smart Metering Data Set Analysis—A Case Study

In this section, we describe a case study on using the proposed secure data learning scheme to solve the regression problem of smart metering data sets from the UMass Trace Repository [163]. We first introduce the networking model of a smart grid AMI, then perform regression work on the data sets.

9.4.1 Smart Grid AMI and Metering Data Set

As one of the most critical infrastructures, the smart grid has adopted advanced and powerful ICT to improve all sorts of aspects of the power grid [7, 164–171, 147]. In a smart grid advanced metering infrastructure (AMI), as shown in Figure 9.3, a data aggregation point (DAP) aggregates massive amounts of smart metering data from distributed smart meters at each household, preprocesses the aggregated data, and forwards the preprocessed data to the operation center via the smart grid's wide‐area networks (WAN) through a local master gateway of the designated AMI concentrator. This high volume of collected data, which contains the summary of power usage and personal behavior patterns, is analyzed by a big data platform and metering data management system (MDMS) for accurate real‐time control and optimized resource allocation.

Diagrammatic illustration of the ICT architecture of advanced metering infrastructure (AMI) in the smart grid. — Figure 9.3 The ICT architecture of AMI in the smart grid.

Table 9.1 A sample set of smart metering data.

Name	Meter #	Current	Previous	Total (KWh)
Suite 101	45600	321.2	485.2	258.7
Suite 102	45601	320.8	483.9	341.0
Suite 103	45602	322.7	486.1	493.2

Suite 899	47212	320.3	482.9	234.6

An example smart metering data set, shown in Table 9.1, is based on a frequent metering report from a smart meter. Normally in order to make efficient energy‐buying decisions based on usage patterns, to perform power theft detection, and to correct its service performance, utility companies would have a corresponding matrix with the same observations and one feature, for example, power usage quality. As stated in the system model, a DAP trains its hypothesis function when a data set of arrives during a a fixed time interval . During this , each smart meter reports the metering data to its own DAP. Since each DAP governs a different number of smart meters, we define the DAP that governs number of smart meters. Therefore, the total data matrix that a DAP receives during a time interval has observations and features. Considering the fact that all the metering data is reported in a fixed short time interval, the sample size is likely to be relatively small. As discussed in “9.2.2.2”, specifically the second paragraph of “9.2.2.2”, this might lead to an overfitting problem for learning models, especially when empirical risk minimization is adopted.

In a DAP , the cost function could be considered as . In the study case of the ‐th iteration, we formulate this as a minimization problem for a DAP learning entity , with the objective function

(9.9) $images$

where represents the total observations of a joint data set among DAP learning entities during time interval , and represents the ‐th observation in the data set. In this study, the concentrator is surrounded by DAPs forming a mesh‐based graph model [166], for the requirement of robust and reliable neighborhood area communications [167].

9.4.2 Regression Study

A regression study is conducted based on the data sets from [163]. In this scenario, 30 data learning entities are involved in training their own local data, and a central hub‐based model is established to perform the proposed scheme. Based on the data set, metering information is reported every five minutes during a day. Several features, such as temperature, humidity, wind degree and direction, etc., are contained in the environmental data set forming the , while the power usage is recorded as with the unit of watt. Six features, including “insideTemp,” “outsideTemp,” “outsideHumidity,” “windSpeed,” “windDirectionDegrees,” and “windGust,” are selected as the learning features, and a ridge‐based regression task is conducted among all the 30 local entities.

Without loss of generality, we show the plots of hypothesis weights of learning entities numbers 1, 5, and 6 in Figure 9.4, Figure 9.5, and Figure 9.6 respectively, with respect to the ridge coefficients . In each figure, six lines represent the chosen six features of the coefficient vectors. In the beginning, each weight vector from different learning entities is randomly sampled from a Gaussian distribution, as presented in each figure with different starting points. While tends toward a certain threshold, weights found by the regression models from different learning parties start to become stabilized towards the sampled vector . Meanwhile, the regularized weights are getting lower and will eventually converge to zero.

Graphical illustration of the regularization results using learning entity 1. — Figure 9.4 Regularization results using learning entity 1.

Graphical illustration of the regularization results using learning entity 5. — Figure 9.5 Regularization results using learning entity 5.

Graphical illustration of the regularization results using learning entity 6. — Figure 9.6 Regularization results using learning entity 6.

Once the initial hypothesis weights from the ridge‐based regression are obtained locally, the iteration process starts as well. By adopting the modified gradient descent algorithm, the values of the cost function decrease as the iterations move on in Figure 9.7. After approximately 50 iterations, we achieve convergence among all the learning entities. As a comparison, the overall data set aggregated through the aggregation protocol and processed through centralized data learning scheme remains a large cost, because sample size of the overall data set is larger than the local data set.

Grid illustration of convergence on the values of cost functions Ji. — Figure 9.7 Convergence on the values of cost functions .

Moreover, we can see that the cost value in each function is quite large. The results can be improved if the nonlinear regression model is carefully chosen. On the local side, each entity performs with almost the same cost, and it actually explains the frequency‐based power consumption follows a regular amount of data size. In summary, it follows the nature of descent algorithms: when the value of cost function converges, the hypothesis weights converge.

9.5 Conclusion and Future Work

In this chapter, we presented a secure data learning scheme for multiple entities in an environment of big data applications. We considered the scenario where multiple entities, which act as data holders, were intended to find predictive models from their overall data. We proposed a scheme in which the property of scalability of data learning is used as an alternative computation technique to provide security to multiple entities. The proposed scheme was inspired by descent algorithms and considered ways to avoid the overfitting problem. Additionally, an associated secure scheme was also proposed to secure the privacy of local information against content leakage and statistical attacks, as well as the protection of identities. Both theoretical learning and security analysis are provided by this scheme. A case study was conducted by adopting the proposed scheme to perform a regression problem and investigate the smart metering data in the smart grid. The proposed scheme was targeted at improving the potential computation and security drawbacks of the centralized training task of big data applications. Future research in this area may consider massive data sets and other supervised learning models to provide more accurate predictions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: A Secure Data Learning Scheme for Big Data Applications in the Smart Grid

Create new playlist

Sign In

Sign Up