Glossary

Accuracy (Rate): A commonly used metric for evaluating a classification system across all of the classes it predicts. It denotes the proportion of data points predicted correctly. Good for balanced datasets, but inaccurate for many other cases.

Anomaly Detection: A data science methodology that focuses on identifying abnormal data points. These belong to a class of interest and are generally significantly fewer than the data points of any other class of the dataset. Anomaly detection is sometimes referred to as novelty detection.

Area Under Curve (AUC) metric: A metric for a binary classifier’s performance based on the ROC curve. It takes into account the confidence of the classifier and is generally considered a robust performance index.

Artificial Creativity: An application of AI where the AI system emulates human creativity in a variety of domains, including painting, poetry, music composition, and even problem-solving.

Artificial Intelligence (AI): A field of computer science dealing with the emulation of human intelligence using computer systems and its applications in a variety of domains. AI application in data science is a noteworthy and important factor in the field, and has been since the 2000s.

Artificial Neural Network (ANN): A graph-based artificial intelligence system which implements the universal approximator idea. Although ANNs started as a machine learning system focusing on predictive analytics, it has expanded over the years to include a large variety of tasks. They are comprised of a series of nodes called neurons, which are organized in layers. The first layer corresponds to all the inputs, the final layer to all the outputs, and the intermediary layers to a series of meta-features the ANN creates, each having a corresponding weight. ANNs are stochastic in nature, so every time they are trained over a set of data, the weights are noticeably different.

Association Rules: Empirical rules derived from a set of data aimed at connecting different entities in that data. Usually the data is unlabeled, and this methodology is part of data exploration.

Autoencoder: An artificial neural network system designed to represent codings in a very efficient manner. Autoencoders are a popular artificial intelligence system used for dimensionality reduction.

Big Data: Datasets that are so large and/or complex that it is virtually impossible to process with traditional data processing systems. Challenges include querying, analysis, capture, search, sharing, storage, transfer, and visualization. Ability to process big data could lead to decisions that are more confident, cost-effective, less risky, and have greater operational efficiency and are generally better overall.

Binning: Also known as discretization, binning refers to the transformation of a continuous variable into a discrete one.

Bootstrapping: A resampling method for performing sensitivity analysis, using the same sample repeatedly in order to get a better generalization of the population it represents and provide an estimate of the stability of the metric we have based on this sample.

Bug (in programming): An issue with an algorithm or its implementation. The process of fixing them is called debugging.

Business Intelligence (BI): A sub-field of data analytics focusing on basic data analysis of business-produced data for the purpose of improving the function of a business. BI is not the same as data science, though it does rely mainly on statistics as a framework.

Butterfly Effect: A phenomenon studied in chaos theory where a minute change in the original inputs of a system yields a substantial change in its outputs. Originally the butterfly effect only applied to highly complex systems (e.g. weather forecasts), but it has been observed in other domains, including data science.

Chatbot: An artificial intelligence system that emulates a person on a chat application. A chatbot takes its inputs text, processes it in an efficient manner, and yields a reply in text format. A chatbot may also carry out simple tasks based on its inputs. It can reply with a question in order to clarify the objective involved.

Classification: A very popular data science methodology under the predictive analytics umbrella. Classification aims to solve the problem of assigning a label (class) to a data point based on pre-existing knowledge of categorized data available in the training set.

Cloud (computing): A model that enables easy, on-demand access to a network of shareable computing resources that can be configured and customized to the application at hand. The cloud is a very popular resource in large-scale data analytics and a common resource for data science applications.

Clustering: A data exploration methodology that aims to find groupings in the data, yielding labels based on these groupings. Clustering is very popular when processing unlabeled data, and in some cases the labels it provides are used for classification afterwards.

Computer Vision: An application of artificial intelligence where a computer is able to discern a variety of visual inputs and effectively “see” many different real-world objects in real-time. Computer vision is an essential component of all modern robotics systems.

Confidence: A metric that aims to reflect the probability of another metric being correct. Usually it takes values between 0 and 1 (inclusive). Confidence is linked to statistics but it lends itself to heuristics and machine learning systems as well.

Confidentiality: The aspect of information security that has to do with keeping privileged information accessible to only those who should have access to it. Confidentiality is linked to privacy, though it encompasses other things, such as data anonymization and data security.

Confusion Matrix: A k-by-k matrix depicting the hits and misses of a classifier for a problem involving k classes. For a binary problem (involving two classes only), the matrix is comprised of various combinations of hits (trues) and misses (falses) referred to as true positives (cases of value 1 predicted as 1), true negatives (cases of value 0 predicted as 0), false positives (cases of value 0 predicted as 1), and false negatives (cases of value 1 predicted as 0). The confusion matrix is the basis for many evaluation metrics.

Correlation (coefficient): A metric of how closely related two continuous variables are in a linear manner.

Cost Function: A function for evaluating the amount of damage the total of all misclassifications amount to, based on individual costs pre-assigned to different kinds of errors. A cost function is a popular performance metric for complex classification problems.

Cross-entropy: A metric of how the addition of a variable affects the entropy of another variable.

Dark Data: Unstructured data, or any form of data where information is unusable. Dark data constitutes the majority of available data today.

Data Anonymization: The process of changing the data so that it cannot be used to identify any particular individual via the data that corresponds to him or her.

Data Analytics: A general term to describe the field involving data analysis as its main component. Data analytics is more general than data science, although the two terms are often used interchangeably.

Data Analyst: Anyone performing basic data analysis, usually using statistical approaches only, without any applicability on larger and/or more complex datasets. Data analysts usually rely on a spreadsheet application and/or basic statistics software for their work.

Data Anonymization: The process of removing or hiding personal identified information (PII) from the data analyzed.

Data Cleansing: An important part of data preparation, it involves removing corrupt or otherwise problematic data (e.g. unnecessary outliers) to ensure a stronger signal. After data cleansing, data starts to take the form of a dataset.

Data Discovery: The part of the data modeling stage in the data science pipeline that has to do with pinpointing patterns in the data that may lead to building a more relevant and more accurate model in the stages that follow.

Data Engineering: The first stage of the data science pipeline, responsible for cleaning, exploring, and processing the data so that it can become structured and useful in a model developed in the following stage of the pipeline.

Data Exploration: The part of the data engineering stage in the data science pipeline that has to do with getting a better understanding of the data through plots and descriptive statistics, as well as other methods, such as clustering. The visuals produced here are for the benefit of the data scientists involved, and may not be used in the later parts of the pipeline.

Data Frame: A data structure similar to a database table that is capable of containing different types of variables and performing advanced operations on its elements.

Data Governance: Managing data (particularly big data) in an efficient manner so that it is stored, transferred, and processed effectively. This is done with frameworks like Hadoop and Spark.

Data Learning: A crucial step in the data science pipeline, focusing on training and testing a model for providing insights and/or being part of a data product. Data learning is in the data modeling stage of the pipeline.

Data Mining: The process of finding patterns in data, usually in an automated way. Data mining is a data exploration methodology.

Data Modeling: A crucial stage in the data science pipeline, involving the creation of a model through data discovery and data learning.

Data Point: A single row in a dataset, corresponding to a single record of a database.

Data Preparation: A part of the data engineering stage in the data science pipeline focusing on setting up the data for the stages that follow. Data preparation involves data cleansing and normalization, among other things.

Data Representation: A part of the data engineering stage in the data science pipeline, focusing on using the most appropriate data types for the variables involved, as well as the coding of the relevant information in a set of features.

Data Science: The interdisciplinary field undertaking data analytics work on all kinds of data, with a focus on big data, for the purpose of mining insights and/or building data products.

Data Security: An aspect of confidentiality that involves keeping data secure from dangers and external threats (e.g. malware).

Data Structure: A collection of data points in a structured form used in programming as well as various parts of the data science pipeline.

Data Visualization: A part of the data science pipeline focusing on generating visuals (plots) of the data, the model’s performance, and the insights found. The visuals produced here are mainly for the stakeholders of the project.

Database: An organized system for storing and retrieving data using a specialized language. The data can be structured or unstructured, corresponding to SQL and NoSQL databases. Accessing databases is a key process for acquiring data for a data science project.

Dataset: A structured data collection, usually directly usable in a data science model. Datasets may still have a lot to benefit from data engineering.

Deep Learning (DL): An artificial intelligence methodology employing large artificial neural networks to tackle highly complex problems. DL systems require a lot of data in order to yield a real advantage in terms of performance.

Dimensionality Reduction: A fairly common method in data analytics aiming to reduce the number of variables in a dataset. This can be accomplished either with meta-features, each one condensing the information of a number of features, or with the elimination of several features of low quality.

Discretization: See binning.

Encryption: The process of turning comprehensive and/or useful data into gibberish using a reversible process (encryption system) and a key. The latter is usually a password, a pass phrase, or a whole file. Encryption is a key aspect of data security.

Ensemble: A set of predictive analytics models bundled together in order to improve performance. An ensemble can be comprised of a set of models of the same category, but it can also consist of different model types.

Entropy: A metric of how much disorder exists in a given variable. This is defined for all kinds of variables.

Error Rate: Denotes the proportion of data points predicted incorrectly. Good for balanced datasets.

Ethics: A code of conduct for a professional. In data science, ethics revolves around things like data security, privacy, and proper handling of the insights derived from the data analyzed.

Experiment (data science related): A process involving the application of the scientific method on a data science question or problem.

F1 Metric: A popular performance metric for classification systems defined as the harmonic mean of precision and recall, and just like them, corresponds to a particular class. In cases of unbalanced datasets, it is more meaningful than accuracy rate. F1 belongs to a family of similar metrics, each one being a function of precision (P) and recall (R) in the form Fβ = (1+ β2) (P * R) / (β2 P + R), where β is a coefficient related to importance of precision in the particular aggregation metric Fβ.

False Negative: In a binary classification problem, it is a data point of class 1, predicted as class 0. See confusion matrix for more context.

False Positive: In a binary classification problem, it is a data point of class 0, predicted as class 1. See confusion matrix for more context.

Feature: A processed variable capable of being used in a data science model. Features are generally the columns of a dataset.

Fitness Function: An essential part of most artificial intelligence systems, particularly those related to optimization. It depicts how close the system is getting to the desired outcome and helps it adjust its course accordingly.

Functional Programming: A programming paradigm where the programming language is focused on functions rather than objects or processes, thereby eliminating the need of a global variable space. Scripts of functional languages are modular and easy to debug.

Fusion: Usually used in conjunction with feature (e.g. feature fusion), this relates to the merging of a set of features into a single meta-feature that encapsulates all, or at least most, of the information in those features.

Fuzzy Logic: An artificial intelligence methodology that involves a flexible approach to the states a variable takes. For example, instead of having the states “hot” and “cold” in the variable “temperature,” Fuzzy Logic allows for different levels of “hotness” making for a more human kind of reasoning. For more information about Fuzzy Logic check out MathWorks’ webpage on the topic: http://bit.ly/2sBVQ3M.

Generalization: A key characteristic of a data science model where the system is able to handle data beyond its training set in a reliable way.

Git: A version control system that is popular among developers and data scientists alike. Unlike some other systems, Git is decentralized, making it more robust.

Github: A cloud-based repository for Git, accessible through a web browser.

Graph Analytics: A data science methodology making use of Graph Theory to tackle problems through the analysis of the relationships among the entities involved.

Hadoop: An established data governance framework for both managing and storing big data on a local computer cluster or a cloud setting.

HDFS: Short for Hadoop Distributed File System, HDFS enables the storage and access of data across several computers for easier processing through a data governance system (not just Hadoop).

Hypothesis: An educated guess related to the data at hand about a number of scenarios, such as the relationship between two variables or the difference between two samples. Hypotheses are tested via experiments to determine their validity.

Heuristic: An empirical metric or function that aims to provide some useful tool or insight, to facilitate a data science method or project.

IDE: Short for Integrated Development Environment, an IDE greatly facilitates the creation and running of scripts as well as their debugging.

Index of Discernibility: A family of heuristics created by the author that aim to evaluate features (and in some cases individual data points) for classification problems.

Information Distillation: A stage of the data science pipeline which involves the creation of data products and/or the deliverance of insights and visuals based on the analysis conducted in the project.

Insight: A non-obvious and useful piece of information derived from the use of a data science model on some data.

Internet of Things (IoT): A new technological framework that enables all kinds of devices (even common appliances) to have Internet connectivity. This greatly enhances the amount of data collected and usable in various aspects of everyday life.

Julia: A modern programming language of the functional programming paradigm comprised of characteristics for both high-level and low-level languages. Its ease of use, high speed, scalability, and sufficient amount of packages make it a robust language well-suited for data science.

Jupyter: A popular browser-based IDE for various data science languages, such as Python and Julia.

Kaggle: A data science competition site focusing on the data modeling part of the pipeline. It also has a community and a job board.

K-fold Cross Validation: A fundamental data science experiment technique for building a model and ensuring that it has a reliable generalization potential.

Labels: A set of values corresponding to the points of a dataset, providing information about the dataset’s structure. The latter takes the form of classes, often linked to classification applications. The variable containing the labels is typically used as the target variable of the dataset.

Layer: A set of neurons in an artificial neural network. Inner layers are usually referred to as hidden layers and consist mainly of meta-features created by the system.

Library: See package.

Machine Learning (ML): A set of algorithms and programs that aim to process data without relying on statistical methods. ML is fast, and some methods of it are significantly more accurate than the corresponding statistical ones, while the assumptions they make about the data are fewer. There is a noticeable overlap between ML and artificial intelligence systems designed for data science.

Minimum Squared Error (MSE): A popular metric for evaluating the performance of regression systems by taking the difference of each prediction with the target variable (error) and squaring it. The model having the smallest such squared error is usually considered the optimal one.

Mentoring: The process of someone knowledgeable and adept in a field sharing his experience and advice with others newer to the field. Mentoring can be a formal endeavor or something circumstantial, depending on the commitment of the people involved.

Metadata: Data about a piece of data. Examples of metadata are: timestamps, geolocation data, data about the data’s creator, and notes.

Meta-features (super features or synthetic features): High quality features that encapsulate large amounts of information, usually represented in a series of conventional features. Meta-features are either synthesized in an artificial intelligence system or created through dimensionality reduction.

Monte Carlo Simulation: A simulation technique for estimating probabilities around a phenomenon, without making assumptions about the phenomenon itself. Monte Carlo simulations have a variety of applications, from estimating functions to sensitivity analysis.

Natural Language Processing (NLP): A text analytics methodology focusing on categorizing the various parts of speech for a more in-depth analysis of the text involved.

Neuron: A fundamental component of an artificial neural network, usually representing an input (feature), a meta-feature, or an output. Neurons are organized in layers.

Non-negative Matrix Factorization (NMF or NNMF): An algebraic technique for splitting a matrix containing only positive values and zeros into a couple of matrices that correspond to meaningful data, useful for recommender systems.

Normalization: The process of transforming a variable so that it is of the same range as the other variables in a dataset. This is done through statistical methods primarily and is part of the data engineering stage in the data science pipeline.

NoSQL Database: A database designed for unstructured data. Such a database is also able to handle structured data too, as NoSQL stands for Not Only SQL.

Novelty Detection: See anomaly detection.

Object-Oriented Programming (OOP): A programming paradigm where all structures, be it data or code, are handled as objects. In the case of data, objects can have various fields (referred to as attributes), while when referring to code, objects can have various procedures (referred to as methods).

Optimization: An artificial intelligence process aimed at finding the best value of a function (usually referred to as the fitness function), given a set of restrictions. Optimization is key in all modern data science systems.

Outlier: An abnormal data point, often holding particular significance. Outliers are not always extreme values, as they can exist near the center of the dataset as well.

Over-fitting: Making the model too specialized to a particular dataset. Its main characteristic is great performance for the training set and poor performance for any other dataset.

Package: A set of programs designed for a specific set of related tasks, sharing the same data structures and freely available to the users of a given programming language. Packages may require other packages in order to function, which are called dependencies. Once installed, the package can be imported in the programming language and used in scripts.

Paradigm: An established way of doing things, as well as the set of similar methodologies in a particular field. Paradigms change very slowly, but when they do, they are accompanied by a change of mindset and often new scientific theory.

Pipeline: Also known as workflow, it is a conceptual process involving a variety of steps, each one of which can be comprised of several other processes. A pipeline is essential for organizing the tasks needed to perform any complex procedure (often non-linear) and is very applicable in data science (this application is known as the data science pipeline).

Population: The theoretical total of all the data points for a given dataset. As this is not accessible, an approximate representation of the population is used through sampling.

Precision: A performance metric for classification systems focusing on a particular class. It is defined as the ratio of the true positives of that class over the total number of predictions related to that class.

Predictive Analytics: A set of methodologies of data science related to the prediction of certain variables. It includes a variety of techniques, such as classification, regression, time-series analysis, and more. Predictive analytics are a key data science methodology.

Privacy: An aspect of confidentiality that involves keeping certain pieces of information private.

Recall: A performance metric for classification systems focusing on a particular class. It is defined as the ratio of the true positives of that class over the total number of data points related to that class.

Recommender System (RS): Also known as a recommendation engine, a RS is a data science system designed to provide a set of similar entities to the ones described in a given dataset based on the known values of the features of these entities. Each entity is represented as a data point in the RS dataset.

Regression: A very popular data science methodology under the predictive analytics umbrella. Regression aims to solve the problem of predicting the values of a continuous variable corresponding to a set of inputs based on pre-existing knowledge of similar data, available in the training set.

Resampling: The process of sampling repeatedly in order to ensure more stable results in a question or a model. Resampling is a popular methodology for sensitivity analysis.

ROC Curve: A curve representing the trade-off between true positives and false positives for a binary classification problem, useful for evaluating the classifier used. The ROC curve is usually a zig-zag line depicting the true positive rate for each false positive rate value.

Sample: A limited portion of the data available, useful for building a model and (ideally) representative of the population it belongs to.

Sampling: The process of acquiring a sample of a population using a specialized technique. Sampling must be done properly to ensure that the resulting sample is representative of the population studied. Sampling needs to be random and unbiased.

Scala: A functional programming language, very similar to Java, that is used in data science. The big data framework Spark is based on Scala.

Scientific Process: The process of forming a hypothesis, processing the available data, and reaching a conclusion in a rigorous and reproducible manner. Conclusions are not 100% valid. Every scientific field, including data science, applies the scientific process.

Sensitivity Analysis: The process of establishing the stability of a result or how prone a model’s performance is to change, if the initial data is different. It involves several methods, such as resampling and “what if” questions.

Sentiment Analysis: A text analytics method that involves inferring the sentiment polarity of a piece of text using its words and some metadata that may be attached to it.

Signal: A piece of valuable information within a collection of data. Insights derived from the analysis of the data tend to reflect the various signals identified in the data.

Spark: A big data framework focusing on managing and processing data through a series of specialized modules. Spark does not handle storing data, just handling it.

SQL: Short for Structured Query Language, SQL is a basic programming language used in databases containing structured data. Although it does not apply to big data, many modern databases are using query languages based on SQL.

Statistical Test: A test for establishing relationships between two samples based on statistics concepts. Each statistical test has a few underlying assumptions behind it.

Statistics: A sub-field of mathematics that focuses on data analysis using probability theory, a variety of distributions, and tests. Statistics involves a series of assumptions about the data involved. There are two main types of statistics: descriptive and inferential. The former deals with describing the data at hand, while the latter with making predictions using statistical models.

Steganography: The process of hiding a file in another much larger file (usually a photo, an audio clip, or a video) using specialized software. The process does not change how the file seems or sounds. Steganography is a data security methodology.

Stochastic: Something that is probabilistic in nature (i.e. not deterministic). Stochastic processes are common in most artificial intelligence systems and other advanced machine learning systems.

Structured Data: Data that has a form that enables it to be used in all kinds of data analytics models. Structured data usually takes the form of a dataset.

Target Variable: The variable of a dataset that is the target of a predictive analytics system, such as a classifier or a regressor.

Text Analytics: The sub-field of data science that focuses on all text-related problems. It includes natural language processing (NLP), among other things.

Testing Set: The part of the dataset that is used for testing a predictive analytics model after it has been trained and before it is deployed. The testing set usually corresponds to a small portion of the original dataset.

Training Set: The part of the dataset that is used for training a predictive analytics model before it is tested and deployed. The training set usually corresponds to the largest portion of the original dataset.

Transfer Function: The function applied on the output of a neuron in an artificial neural network.

Time-series Analysis: A data science methodology aiming to tackle dynamic data problems, where the values of a target variable change over time. In time-series analysis, the target variable is also used as an input in the model.

True Negative: In a binary classification problem, it is a data point of class 0, predicted as such. See confusion matrix for more context.

True Positive: In a binary classification problem, it is a data point of class 1, predicted as such. See confusion matrix for more context.

Unstructured Data: Data that lacks any structural frame (e.g. free-form text) or data from various sources. The majority of big data is unstructured data and requires significant processing before it is usable in a model.

Versatilist: A professional who is an expert in one skill, but has a variety of related skills, usually in a tech-related field, allowing him to perform several roles in an organization. Data scientists tend to have a versatilist mentality.

Version Control System (VCS): A programming tool aiming to keep various versions of your documents (usually programming scripts and data files) accessible and easy to maintain, allowing for variants of them to co-exist with the original ones. VCS are great for collaboration of various people on the same files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset