In this section, we will describe some good machine learning practices that need to be followed before developing a machine learning application of particular interest, as described in Figure 7:
A scalable and accurate ML application demand for following a systematic approach to its development from problem definition to presenting results can be summarized into four steps: problem definition and formulation, data preparation, finding suitable algorithms for machine learning, and finally, presenting the results after the machine learning model deployment. Well, these steps can be depicted as shown in Figure 6.
The learning of a machine learning system can be formulated as the sum of representation, evaluation, and optimisation. In other words, according to Pedro D et al. (Pedro Domingos, A Few Useful Things to Know about Machine Learning, https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf):
Learning = Representation + Evaluation + Optimization
Taking this formulation into consideration, we will provide some recommendations for practitioners before getting into ML application development.
So what do we need for an effective machine learning applications development? We actually need four arsenals before we start developing an ML application; including:
That means before you start your machine learning voyage, if you can identify if your problem is a machine learning problem, you will be able to find some suitable algorithms to develop your ML application altogether. Of course, in practice, most machine learning applications can't be changed into simple optimization problems. Therefore, it's the duty of a data scientist like you to manage and maintain complex datasets. After that, you will have to handle other issues such as the analytical problems that evolve when engineering the machine learning pipeline to tackle those issues we mentioned earlier.
Therefore, the best practice is to use Spark MLlib, Spark ML, GraphX, and Spark Core APIs along with the best practice data science heuristics for developing your machine learning applications together. Now you might think of getting benefits out of it; yes, the benefits are obvious, and they are as follows:
In best practice, feature engineering should be considered as one of the most important parts of machine learning. The thing is to find a better representation of features out of the experimental dataset non-technically. In parallel to this, which learning algorithms or techniques are to be used are also important. Parameter tuning, of course in addition, however, the final choice is more about experimentation through the ML model you will be developing.
In practice, however, it is trivial to grasp the naive performance baseline by means of an out-of-the-box method (also referred to as functionality or OOTB in short, which is a feature of a product of interest that works straight away after installing or configuring) and good data pre-processing. Therefore, you might be doing it continually in order to know where the baseline is and whether this performance is of a satisfactory level or good enough for your requirements.
Once you've trained all of your out-of-the-box methods, it's always recommended and is a good idea to try bagging them together. Moreover, in order to solve the ML problems, very often you might need to know the reality that computationally hard problems (shown in section 2, for example) need either domain-specific knowledge or lots of digging down in the data or both. Consequently, the combination of a widely accepted feature engineering technique and domain-specific knowledge would help your ML algorithm/application/system to solve prediction related problems.
In a nutshell, if you have the required dataset and a robust algorithm that can take the advantages of the dataset by learning the complex features, it's almost guaranteed that you will be successful. Furthermore, sometimes domain experts might be wrong in selecting the good features; therefore, incorporation of multiple domain experts (problem domain expert), more well-structured data, and ML expertise is always helpful.
Last but not least, sometimes it is recommended from our side to consider the error rate rather than only the accuracy. For example, suppose an ML system with 99% accuracy and 50% errors is worse than the one with 90% accuracy but 25% errors, for example.
A common mistake often made by novice data scientists is subject to the overfitting issue that might evolve while building your ML model by hearing without generalizing. More technically, if you evaluate your model on the training data instead of test or validated data, you probably won't be able to articulate whether your model is overfitting or not. The common symptoms are:
Sometimes the ML model itself becomes under-fit for a particular tuning or data point, which means the model has become too simplistic. Our recommendation (like others as well we believe) is as follows:
Hastie et al. (Hastie Trevor, Tibshirani Robert, Friedman Jerome, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2009) on the other hand, have recommended splitting the large-scale dataset into three sets: Training set (50%), Validation set (25%), and Test set (25%) (roughly). They also suggested building the model using the training set and calculating the prediction errors using the validation set. The test set was recommended to be used to assess the generalization error of the final model.
If the amount of labeled data available during the supervised learning is smaller, it is not recommended to split the datasets. In that case, use cross-validation or Train split techniques (this will be discussed in Chapter 7, Tuning Machine Learning Models, with several examples). More specifically, divide the data set into 10 parts of (roughly) equal size, after that for each of these ten parts, train the classifier iteratively and use the 10th part to test the model.
The first step of the pipeline designing is to create the building blocks (as a directed or undirected graph consisting of nodes and edges) and make a link between those blocks. Nevertheless, as a data scientist, you should be focused on scaling and optimizing nodes (primitives) too, so that you are able to scale-up your application for handling large-scale datasets in the later stage to make your ML pipeline consistently perform. The pipeline process will also help you to make your model adaptive for new datasets. However, some of these primitives might be explicitly defined to particular domains and data types (for example, text, images, video, audio, and spatiotemporal).
And beyond these types of data, the primitives should also be working for the general purpose domain statistics or mathematics. The casting of your ML model in terms of these primitives will make your workflow more transparent, interpretable, accessible, and explainable. A recent example would be the ML-Matrix, which is a distributed matrix library that can be used on top of Spark:
As we already stated in the previous section, as a developer you can seamlessly combine the implementation techniques in Spark MLlib along with the algorithms developed in Spark ML, Spark SQL, GraphX, and Spark Streaming as hybrid or interoperable ML applications on top of RDD, DataFrame, and Datasets, as shown in Figure 8. For example, an IoT-based real-time application could be developed using a hybrid model. Therefore, the recommendation here is to stay tuned or synchronized with the latest technologies around you for the betterment of your ML application.
Another good and often used practice when building your ML pipeline is to make the ML system modular. Some supervised learning problems can be solved using very simple models commonly referred to as generalized linear models. However, it depends on the data you will be using and others simply don't.
Therefore, to conglomerates a series of simple linear binary classifiers, try to employ a lightweight modular architecture. This might be at the workflow stems or at the algorithms level. The advantages are obvious, since the modular architecture of your application handles massive amounts of data flow in a parallel and distributed way. Consequently, we suggest you have the three key innovative mechanisms: weighted threshold sampling, logistic calibration, and intelligent data partitioning as mentioned in the literature (for example, Yu Jin; Nick Duffield; Jeffrey Erman; Patrick Haffner; Subhabrata Sen; Zhi Li Zhang, A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks, ACM Transactions on Knowledge Discovery from Data, V-6, Issue-1, March 2012). The target is to achieve scalability and high-throughput while attaining a high accuracy of the predicted results from your ML application/system. While primitives can serve as building blocks, you still need some other tools that enable users to build ML pipelines.
Subsequently, workflow tools have become more common these days, and such tools exist for data engineers, data scientists, and even for business analysts such as Alteryx, RapidMiner, Alpine Data, and Dataiku. At this point, we are talking about and stressing the business analysts since at the very last phase your target customer will be a business company who will value your ML model, right? The latest release of Spark comes with Spark ML APIs for building machine learning pipelines and making a domain specific language (see https://en.wikipedia.org/wiki/Domain-specific_language) for pipelines.
However, in order to develop the algorithms to learn the ML models continuously with the help of available data, the viewpoint behind the machine learning is to automate the creation of analytical models. Unremittingly evolving models produce increasingly positive results and reduce the need for human interaction. This enables the ML models to automatically produce reliable and repeatable predictions.
More technically, suppose you are planning to develop a recommender system using ML algorithms. So, what is the target of developing that recommender system? And what are some innovative ideas for product development in machine learning? These two are typical questions that should be considered before you start developing your ML application or system. Consistent innovation might be challenging, especially when stirring advancing with new ideas, it can also be tough to comprehend where the greatest benefit lies. Machine learning can provision innovation from end to end of a variety of paths, such as determining weaknesses with current products, predictive analysis, or identifying previously concealed patterns.
As a result, you will have to think of large-scale computing to train your ML model offline, and later on your recommender system has to be able to work as a conventional search engine analysis for online recommendations. Thus, your ML application will be valued by a business company if your system:
As shown in Figure 9, new business models are the unavoidable extension of the available data utilisation, so consideration of big data and its business values can make the business analyst's job, life and thinking smarter, which results in your targeted company delivering value to customers. In addition to this, you will also have to investigate (analyze to be more exact) rival or better companies.
Now the question is, how do you collect and use enterprise data? Big data is not only about the size (volume), it is also related to its velocity, veracity, variety, and value. For these types of complexities, for example, velocity can be addressed using Spark Streaming since streaming-based data is also big data that needs a real-time analytical approach. Other parameters such as volume and variety can be handled using Spark Core and Spark MLlib/ML towards big data processing.
Well, you will have to manage the data by hook or by crook. If you are able to manage the data, the insights from the data can really shake up the way businesses operate with the useful features of big data:
At this point, data alone is not enough (see Pedro Domingos, A Few Useful Things to Know about Machine Learning, https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf), but extracting meaningful features from the data and putting semantics of data into the model is more important. This is like what most of the tech giants such as LinkedIn are developing through large-scale machine learning frameworks from feature targeting for their community, which is more or less a supervised learning technique. The workflow is as follows:
So what's next? Your model also should be adaptable to large-scale dynamic data such as real-time streaming IoT data PLUS real-time feedback is also important so that your ML system can learn from the mistakes. The next sub-section discusses that.
The reasons are obvious, since machine learning brings concrete and dynamic aspects to IoT projects. Recently, machine learning has experienced a pep talk in popularity amongst industrial companies and they profit out of the box. As a result, all but every IT vendor are precipitously announcing IoT platforms and consulting services. But achieving financial benefits through IoT data is not an easy job. Moreover, many businesses have failed to clearly determine what areas will change with the implementation of an IoT strategy.
Considering these positive and negative issues together, your ML model should adapt to large dynamic data since the large-scale data means billions of records, large feature spaces, and low positive rates from the sparsity issue. Nevertheless, data is dynamic so consequently, the ML models have to be adaptive enough; otherwise you will have to face a bad experience or be lost in the black hole.
The typical steps that are best practice after an ML model/system has been developed are: visualization for understanding the predictive values, model validation, error and accuracy analysis, model tuning, model adapting, and scaling up for handling large-scale datasets with ease.
Visualization provides an interactive interface to stay tune the ML model itself. Therefore, without visualizing the predictive results, it merely becomes difficult to further improve the performance of an ML application. The best practice could be something like this:
Plot.ly
(please refer to https://plot.ly/) and D3.js
(please refer to https://d3js.org/)As algorithms become more prevalent, we need better tools for building complex hitherto, robust, and stable machine learning systems. A popular distributed framework like Apache Spark takes these ideas to extremely large datasets for the wider audience. Therefore, it would be better if we could bind approximation errors and convergence rates for the layered pipelines.
Assuming we can compute error bars for nodes, the next step would be to have a mechanism for extracting error bars for these pipelines. However, in practice, when the ML model is deployed for the production, we might need tools to confirm that the pipeline will work and will not do make malfunction or stop halfway through and that it can provide some expected measure of the errors.
Devising one or two algorithms that perform solidly well on a simple problem can be considered as a good kick-off. However, sometimes you may be thirsty to get the best accuracy, by even sacrificing your valuable time and available computational resources. This would be a smarter way, and it will help you not only to squeeze out extra performance, but also to improve the results in terms of accuracy that you were receiving out of the machine learning algorithms you designed previously. In order to do that, when you tune the model and related algorithm, essentially, you must have a high confidence in the results.
Obviously, those results will be available after you specify the testing and validation. This means you should only be using those techniques that reduce the variance of the performance measure so that you can assess the algorithms that are running more smoothly.
In parallel, like most data practitioners, we also suggest you to use the cross-validation technique (also often called rotation estimation) with a reasonably high number of folds (that is, K-fold cross-validation, where a single subsample is used as the validation dataset for testing the model itself , and the remaining K-1 subsamples are used to train the data). Although the exact number of folds, or K, depends on your dataset, however, 10-fold cross-validation is commonly used, but most often the value of K remains unfixed. We will mention three strategies here that you will need to tune your machine learning model:
As shown in Figure 10, the adaptive learning conglomerates the previous generations of rule-based, simple machine learning, and deep learning approaches to machine intelligence according to Rob Munro:
The fourth generation of machine learning: adaptive learning, (http://idibon.com/the-fourth-generation-of-machine-learning-adaptive-learning/#comment-175958
).
Research also shows that adaptive learning is 95% accurate in predicting people's intention to purchase a car, for example (please refer to Rob Munro, The fourth generation of machine learning: Adaptive learning, http://idibon.com/the-fourth-generation-of-machine-learning-adaptive-learning/#comment-175958
). Moreover, if your ML application is adaptive with the new environment and new data, it is expected that if enough infrastructure is provided, your ML system can be scaled-up for the increasing data loads.