Glossary
of Computer and Big Data Terminology

Big data terminology has developed during the last few years. This glossary alphabetically lists some big data definitions along with some relative computer terms that a newcomer in the field will find useful. A basic understanding of computers is required to fully harness the information in this glossary.

A

Aggregation – the process through which data is searched, gathered and presented.

Algorithm – a mathematical process that can perform a specific analysis or transformation on a piece of data.

Analytics – the discovery and communication of insights derived from data, or the use of software-based algorithms and statistics to derive meaning from data.

Analytics Platform – software and/or hardware that provide the tools and computational power needed to build and perform many different analytical queries.

Anomaly Detection – the systematic search for data items in a dataset that deviate from a projected pattern or expected behavior. Anomalies are often referred to as outliers, exceptions, surprises or contaminants, and they usually provide critical and actionable information.

Application (App) – a program designed to perform information processing tasks for a specific purpose or activity.

Artificial Intelligence (A.I.) – the field of computer science related to the development of machines and software that are capable of perceiving their environment and taking appropriate action when required (in real-time), even learning from those actions. Some A.I. algorithms are widely used in data science.

B

Behavioral Analytics – analytics that inform about the how, why and what (instead of just the who and when) occurs in data related to human behavior. Behavioral analytics investigates humanized patterns in the data.

Big Data – data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process them within a tolerable elapsed time. Big data sizes are a constantly moving target, ranging from a few dozen terabytes to many petabytes of data in a single data set. Big data is characterized by its 4 Vs: volume, velocity, variety and veracity.

Big Data Scientist – an IT professional who is able to use/develop the essential algorithms to make sense out of big data and communicate the derived information effectively to anyone interested. Also known as a data scientist.

Big Data Startup – a young company that has developed new big data technology.

Business Intelligence – the theories, methodologies and processes to make data, particularly business-related data, understandable and more actionable.

Byte (B) – an acronym for “binary term.” A sequence of bits that represents a character. Each byte has 8 bits.

C

Central Processing Unit (CPU) – the brains of an information processing system; the processing component that controls the interpretation and execution of instructions in a computer.

Classification Analysis – a systematic process for obtaining important and relevant information about data using classification algorithms.

Cloud – a broad term that refers to any Internet-based application or service that is hosted remotely.

Cloud Computing – a computing system whose processing is distributed over a network that uses server farms to store data in a distant location (see also, data centers).

Clustering Analysis – the process of identifying objects that are similar to each other and grouping them in order to understand the differences and the similarities within the data. Clustering is usually referred to as unsupervised learning and is a fundamental part of data exploration and data discovery.

Comparative Analysis – a process that ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.

Complex Structured Data – data that is composed of two or more complex, complicated and interrelated parts that cannot be easily interpreted by structured query languages and tools.

Computer Generated Data – data generated by computers such as log files. This constitutes a large part of big data in the world today.

Concurrency – performing and executing multiple tasks and processes at the same time.

Correlation Analysis – a statistical technique for determining a relationship between variables and whether that relationship is negative or positive. Although it does not imply causation, correlation analysis can yield very useful information about the data and help the data scientist handle it more effectively.

Customer Relationship Management (CRM) – managing sales and business processes. Big data will affect CRM strategies.

D

Dashboard – a graphical representation of the analyses performed by algorithms, usually in the form of plots and gauges.

Data – a quantitative or qualitative value. Common types of data include sales figures, marketing research results, readings from monitoring equipment, user actions on a website, market growth projections, demographic information and customer lists.

Data Access – the act or method of viewing or retrieving stored data.

Data Aggregation Tools – methods for transforming scattered data from numerous sources into a new, single source.

Data Analytics – the application of software to derive information or meaning from data. The end result might be a report, an indication of status or an action taken automatically based on the information received.

Data Analyst – someone who analyzes, models, cleanses, and/or processes data. Data analysts usually don’t perform predictive analytics, and when they do, it’s usually through the use of a simple statistical model.

Data Architecture and Design – the way enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes: conceptual representation of business entities, the logical representation of the relationships among those entities and the physical construction of the system to support the functionality.

Database – a digital collection of data and the structure in which the data is organized (structured). The data is typically entered into and accessed via a database management system (DBMS).

Database Administrator (DBA) – a person who is responsible for supporting and maintaining the integrity of the structure and content of a database.

Database-as-a-Service (DaaS) – a database hosted in the cloud and sold on a metered basis. Examples include Heroku Postgres and Amazon Relational Database Service.

Database Management System (DBMS) – collecting, storing and providing access of data through integrated software that is practical to use even by non-specialists.

Data Center – a physical location that houses the servers for storing data. Data centers might belong to a single organization or sell their services to many organizations.

Data Cleansing – the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency.

Data Collection – any process that captures any type of data.

Data Custodian – a person responsible for the database structure and the technical environment including the storage of data.

Data-Directed Decision Making – using data to support making crucial decisions.

Data Exhaust – the data that a person creates as a byproduct of a common activity: for example, a cell call log or Web search history.

Data Governance – a set of processes or rules that ensure the integrity of the data and that data management best practices are met.

Data Integration – the process of combining data from different sources and presenting it in a single view.

Data Integrity – the measure of trust an organization has in the accuracy, completeness, timeliness and validity of the data.

Data Management Association (DAMA) – a non-profit international organization for technical and business professionals “dedicated to advancing the concepts and practices of information and data management.”

Data Management – according to the Data Management Association, data management incorporates the following practices needed to manage the full data lifecycle in an enterprise:

  • data governance
  • data architecture, analysis and design
  • database management
  • data security management
  • data quality management
  • reference and master data management
  • data warehousing and business intelligence management
  • document, record and content management
  • metadata management
  • contact data management

Data Migration – the process of moving data between different storage types or formats or between different computer systems.

Data Mining – the process of finding certain patterns or information from data sets in an automated way. This is one popular way to perform data exploration.

Data Modeling – development of a graphic representation defining the structure of data for the purpose of communicating the data needed for business processes between functional and technical people or for communicating a plan to develop how data is stored and accessed among application development team members.

Data Science – a recent term that has multiple definitions but is generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning and database engineering to solve complex problems.

Data Scientist – a practitioner of data science. Also known as big data scientist.

Data Security – the practice of protecting data from destruction or unauthorized access.

Data Set – a collection of data, usually in a structured form. Data sets are represented as data frame objects in R.

Data Structure – a specific way of storing and organizing data.

Data Visualization – a visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.

Data Virtualization – a data integration process that improves data insights. Usually it involves databases, applications, file systems, websites, big data techniques, etc.

Discriminant Analysis – a statistical analysis that takes advantage of known groups or clusters in data to derive the classification rule. It involves cataloguing the data as well as distributing it into groups, classes or categories.

Distributed File System – a system that offers simplified, highly available access to storing, analyzing and processing data.

Distributed Processing System – a form of local area network in which each user has a fully functional computer, but all users can share data and application software. The data and software are distributed among the linked computers, not stored in one central computer.

Document Store Database – a document-oriented database that is especially designed to store, manage and retrieve documents. Also known as semi-structured data.

E

Enterprise Resource Planning (ERP) – a software system that allows an organization to coordinate and manage all its resources, information and business functions.

E-Science – traditionally defined as computationally intensive science involving large data sets. More recently broadened to include all aspects and types of research that are performed digitally.

Event Analytics – a process that shows the series of steps that led to an action.

Exploratory Analysis – finding patterns within data without standard procedures or methods. It is a means of discovering the data and finding the data set’s main characteristics. Usually referred to as data exploration, it constitutes an important part of the data science process.

Exabyte – approximately 1000 petabytes or 1 billion gigabytes. Today, we create one exabyte of new information globally on a daily basis.

Extract, Transform and Load (ETL) – a process for populating data in a database and data warehouse by extracting the data from various sources, transforming it to fit operational needs and loading it into the database.

F

Failover – switching automatically to a different server or node if one fails. This is a very useful property of a computer cluster and ensures scalability in data analysis processes.

Fault-Tolerant Design – a system designed to continue working even if certain parts fail.

Federal Information Security Management Act (FISMA) – a US federal law that requires all federal agencies to meet certain standards of information security across their systems.

File Transfer Protocol (FTP) – a set of guidelines or standards that establishes the format in which files can be transmitted from one computer to another.

G

Gamification – using game elements in a non-game context. This is a very useful way to create data; therefore, coined as the friendly scout of big data.

Gigabyte – a measurement of the storage capacity of a computer. One megabyte represents more than 1 billion bytes. Gigabyte may be abbreviated G or GB or Gig; however, GB is clearer since G also stands for the metric prefix giga (meaning 1 billion).

Graph Database – databases that use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning every element is directly linked to its neighboring element.

Grid Computing – connecting different computer systems from various locations, often via a cloud, to reach a common goal.

H

Hadoop – an open-source framework that is built to enable the process and storage of big data across a distributed file system. Hadoop is currently the most widespread and most developed big data platform available.

Hadoop Distributed File System (HDFS) – a distributed file system designed to run on commodity hardware.

HBase – an open source, non-relational, distributed database running in conjunction with Hadoop. It is particularly useful for archiving purposes.

High-Performance-Computing (HPC) – using supercomputers to solve highly complex and advanced computing problems.

Hypertext – a technology that links text in one part of a document with related text in another part of the document or in other documents. A user can quickly find the related text by clicking on the appropriate keyword, key phrase, icon or button.

Hypertext Transfer Protocol (HTTP) – the protocol used on the World Wide Web that permits Web clients (Web browsers) to communicate with Web servers. This protocol allows programmers to embed hyperlinks in Web documents using hypertext markup language (HTML).

I

Indexing – the ability of a program to accumulate a list of words or phrases that appear in a document, along with their corresponding page numbers, and to print or display the list in alphabetical order.

Information Processing – the coordination of people, equipment and procedures to handle the storage, retrieval, distribution and communication of information. The term information processing embraces the entire field of processing words, figures, graphics, videos and voice input by electronic means.

In-Database Analytics – the integration of data analytics into the data warehouse.

Information Management – the practice of collecting, managing and distributing information of all types: digital, paper-based, structured and unstructured.

In-Memory Data Grid (IMDG) – the storage of data in memory, across multiple servers, for the purpose of greater scalability and faster access or analytics.

In-Memory Database – a database management system that stores data in the main memory instead of on the disk, resulting in very fast processing, storing and loading of the data.

Internet – a system that links existing computer networks into a worldwide network. The Internet may be accessed by means of commercial online services (such as America Online) and Internet service providers (ISPs).

Internet of Things (IoT) – ordinary devices that are connected to the Internet at anytime and anywhere via sensors. IoT is expected to contribute substantially to the growth of big data.

Internet Service Provider (ISP) – an organization that provides access to the Internet for a fee. Companies like America Online are more properly referred to as commercial online services because they offer many other services in addition to Internet access.

Intranet – a private network established by an organization for the exclusive use of its employees. Firewalls prevent outsiders from gaining access to an organization’s intranet.

J

Juridical Data Compliance – the need to comply with the laws of the country where your data is stored. Relevant when you use cloud solutions and when the data is stored in a different country or continent.

K

Key Value Database – database in which data is stored with a primary key, a uniquely identifiable record, making it easy and fast to look up. The data stored in a KeyValue is normally some kind of primitive of the programming language.

Kilobyte – a measurement of the storage capacity of a computer. One kilobyte represents 1024 bytes. Kilobyte may be abbreviated K or KB; however, KB is the clearer abbreviation, since K also stands for the metric prefix kilo (meaning 1000).

L

Latency – a measure of time delay in a system.

Legacy System – an old system, technology or computer system that is not supported any more.

Load Balancing – distributing workload across multiple computers or servers in order to achieve optimal results and utilization of the system.

Location Data (Geo-Location Data) – GPS data describing a geographical location. Very useful for data visualization among other things.

Log File – a file that a computer, network or application creates automatically to record events that occur during operation (e.g., the time a file is accessed).

M

Machine Data – data created by machines via sensors or algorithms.

Machine Learning (ML) – the field of computer science related to the development and use of algorithms to enable machines to learn from what they are doing and become better over time. Although there is a large overlap between ML and artificial intelligence, they are not the same. ML algorithms are an integral part of data science.

MapReduce – a software framework for processing vast amounts of data using parallelization.

Massively Parallel Processing (MPP) – using many different processors (or computers) to perform certain computational tasks at the same time.

Master Data Management (MDM) – management of core non-transactional data that is critical to the operation of a business to ensure consistency, quality and availability. Examples of master data are customer or supplier data, product information, employee data, etc.

Megabyte – a measurement of the storage capacity of a computer. One megabyte represents more than 1 million bytes. Megabyte may be abbreviated M or MB; however, MB is clearer since M also stands for the metric prefix mega (meaning 1 million).

Memory – the part of a computer that stores information. Often synonymous to Random Access Memory (RAM), the temporary memory that allows information to be stored randomly and accessed quickly and directly without the need to go through intervening data.

Metadata – any data used to describe other data; for example, a data file’s size or date of creation.

MongoDB – a popular open-source NoSQL database.

MPP Database – a database optimized to work in a massively parallel processing environment.

Multi-Dimensional Database – a database optimized for online analytical processing (OLAP) applications and for data warehousing.

Multi-Threading – the act of breaking up an operation within a single computer system into multiple threads for faster execution. Multi-threading turns a single PC with a modern CPU into a computer cluster that makes use of all of its CPU cores.

MultiValue Database – a type of NoSQL and multidimensional database that understands 3-dimensional data directly. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly.

Memetic Algorithm – a special type of evolutionary algorithm that combines a steady state genetic algorithm with local search for real-valued parameter optimization.

N

Natural Language Processing (NLP) – a field of computer science involved with interactions between computers and human languages. NLP is widely used in text analytics and is a popular subfield of data science.

Network Analysis – analyzing connections and the strength of the ties between nodes in a network. Viewing relationships among the nodes in terms of the network or graph theory.

NewSQL – an elegant, well-defined database system that is easier to learn and better than SQL. It is even newer than NoSQL.

NoSQL – a class of database management system that does not use the relational model. NoSQL is designed to handle large data volumes that do not follow a fixed schema. It is ideally suited for use with very large data volumes that do not require the relational model. It is sometimes referred to as ”Not only SQL” because it is a database that doesn’t adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling.

Normalization – the process of transforming a numeric variable so that its values are in the same range as other normalized variables. This allows for easier comparisons and more efficient ways of handling a set of variables.

O

Object Database – databases that store data in the form of objects as used by object-oriented programming. They are different from relational or graph databases, and most of them offer a query language that allows objects to be found with a declarative programming approach.

Online Analytical Processing (OLAP) – the process of analyzing multidimensional data using three operations: consolidation (the aggregation of available data), drill-down (the ability for users to see the underlying details) and slice and dice (the ability for users to select subsets and view them from different perspectives).

Online Transactional Processing (OLTP) – the process of providing users with access to large amounts of transactional data so that they can derive meaning from it.

Open Data Center Alliance (ODCA) – a consortium of global IT organizations whose goal is to speed the migration to cloud computing.

Open Source – a type of software code that has been made freely available for download, modification and redistribution.

Operational Database – databases that record the regular operations of an organization; they are generally very important to a business. Organizations generally use online transaction processing, which allows them to enter, collect and retrieve specific information about the company.

Optimization Analysis – the process of optimization during the design cycle of products done by algorithms. It allows companies to virtually design many different variations of a product and to test that product against pre-set variables.

Ontology – ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts. Very useful when designing a database.

Outlier Detection – an outlier is an object that deviates significantly from the general average within a dataset or a combination of data. It is numerically distant from the rest of the data and therefore indicates that something is going on that requires additional analysis. Usually referred to as anomaly detection.

P

Parallel Data Analysis – breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel data analysis can occur within the same system or across multiple systems.

Parallel Method Invocation (PMI) – the ability to allow programming code to call multiple functions in parallel.

Parallel Processing – the ability to execute multiple tasks at the same time.

Parallel Query – a query that is executed over multiple system threads for faster performance.

Pattern Recognition – identifying patterns in data via algorithms to make predictions of new data coming from the same source. Pattern recognition is also referred to as supervised learning and constitutes a major part of machine learning.

Performance Management – the process of monitoring system or business performance against predefined goals to identify areas that need attention.

Petabyte –1024 terabytes or 1 million gigabytes. The CERN Large Hadron Collider generates approximately 1 petabyte per second.

Predictive Analysis (Predictive Analytics) – the most valuable analysis within big data as it helps predict what someone is likely to buy, visit or do as well as how someone will behave in the (near) future. It uses a variety of different data sets such as historical, transactional, social, or customer profile data to identify risks and opportunities.

Predictive Modeling – the process of developing a model to predict a trend or outcome.

Program – an established sequence of instructions that tells a computer what to do. The term program means the same thing as software.

Protocol – a set of standards that permits computers to exchange information and communicate with each other.

Q

Quantified Self – a modern movement related to the use of applications to track one’s every move during the day in order to gain a better understanding of one’s behavior.

Query – asking for information to answer a certain question, usually in a database context.

Query analysis – the process of analyzing a search query for the purpose of optimizing it for the best possible result.

R

R – an open-source programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R’s popularity has increased substantially in recent years.

Real Time – a descriptor for events, data streams or processes that have an action performed on them as they occur.

Real-Time Data – data that is created, processed, stored, analyzed and visualized within milliseconds of its creation.

Recommendation Engine (Recommender System) – an algorithm that analyzes a user’s purchases and actions on an e-commerce site and then uses that data to recommend complementary products.

Record – a collection of all the information pertaining to a particular subject.

Records Management – the process of managing an organization’s records throughout their entire lifecycle from creation to disposal.

Reference Data – data that describes an object and its properties. The object may be physical or virtual.

Regression Analysis – a statistical technique for defining the dependency between continuous variables. It assumes a one-way causal effect from one variable to the response of another variable.

Report – the presentation of information derived from a query against a dataset, usually in a predetermined format.

Risk Analysis – the application of statistical methods on one or more datasets to determine the likely risk of a project, action or decision.

Root-Cause Analysis – the process of determining the main cause of an event or problem.

Routing Analysis – using many different variables to find the optimal route for a certain means of transport in order to decrease fuel costs and increase efficiency.

S

Scalability – the ability of a system or process to maintain acceptable performance levels as workload or scope increases.

Schema – the structure that defines the organization of data in a database system.

Semi-Structured Data – a form a structured data that does not conform to a formal structure the way structured data does. It contains tags or other markers to enforce a hierarchy of records. Semi-structured data is usually found in .JSON objects.

Server – a physical or virtual computer that serves requests for a software application and delivers those requests over a network.

Signal Analysis – the analysis of measurement of time varying or spatially varying physical quantities to analyze the performance of a product. Signal analysis is frequently used with sensor data.

Similarity Searches – finding the closest object to a query in a database where the data object can be of any type of data.

Simulation Analysis – a simulation is the imitation of the operation of a real-world process or system. A simulation analysis helps to ensure optimal product performance by taking into account many different variables.

Smart Grid – the smart grid refers to the concept of adding intelligence to the world’s electrical transmission systems with the goal of optimizing energy efficiency. Enabling the smart grid will rely heavily on collecting, analyzing and acting on large volumes of data.

Software-as-a-Service (SaaS) – application software that is used over the Web by a thin client or Web browser. Salesforce is a well-known example of SaaS.

Solid-State Drive (SSD) – also called a solid-state disk; a device that uses memory ICs to persistently store data.

Spatial Analysis – the process of analyzing spatial data such as geographic or topological data to identify and understand patterns and regularities within data distributed in geographic space. This is usually performed in a special type of system called a geographic information system (GIS).

Storm – an open source distributed computation system designed for processing multiple data streams in real time.

Structured Data – data that is identifiable because it is organized in a structure such as rows and columns. The data resides in fixed fields within a record or file, or the data is tagged correctly and can be accurately identified.

Structured Query Language (SQL) – a programming language for retrieving data from a relational database. SQL is not directly applicable in the big data domain.

T

Terabyte – approximately 1000 gigabytes. A terabyte is the data volume of about 300 hours of high-definition video.

Text Analytics – the application of statistical, linguistic and machine learning techniques on text-based sources to derive meaning or insight.

Thread – a series of posted messages that represents an ongoing discussion of a specific topic in a bulletin board system, a newsgroup or a Web site.

Time Series Analysis – the process of analyzing well-defined data obtained through repeated measurements of time. The data has to be well-defined and measured at successive points in time spaced at identical time intervals.

Topological Data Analysis – focusing on the shape of complex data and identifying clusters and any statistical significance that is present within that data.

Transmission Control Protocol/Internet Protocol (TCP/IP) – a collection of over 100 protocols that are used to connect computers and networks.

Transactional Data – data that describes an event or transaction that took place.

Transparency – operating in such a way that whatever is taking place is open and apparent to whomever is interested.

U

Unstructured Data – data that is text heavy, in general, but may also contain dates, numbers and facts.

V

Value – the benefits that organizations can reap from analysis of big data.

Variability – one of the characteristics of big data, variability means that the meaning of the data can change (and rapidly). For example, in multiple tweets the same word can have totally different meanings.

Variety – one of the major characteristics of big data. Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data.

Velocity – one of the major characteristics of big data. The speed at which the data is created, stored, analyzed and visualized.

Veracity – one of the major characteristics of big data, veracity refers to the correctness of the data. Organizations need to ensure that both the data and the analyses performed on it are correct.

Visualization – visualizations are complex graphs that can include many variables of data while still remaining understandable and readable. With the right visualizations, raw data can be put to use.

Volume – one of the major characteristics of big data. It refers to the total quantity of data, beginning at terabytes and growing higher over time.

W

Weather Data – an important open, public data source that can provide organizations with a lot of insights when combined with other sources.

X

XML Database – databases that allow data to be stored with its markup tags. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported and serialized into any format needed.

Y

Yottabytes – approximately 1000 Zettabytes, or 250 trillion DVDs. The entire digital universe today is 1 Yottabyte; this will double every 18 months.

Z

Zettabyte – approximately 1000 Exabytes or 1 billion terabytes. It is expected that in 2016, more than 1 zettabyte will cross our networks globally on a daily basis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset