Tearcard

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

EXAM✓CRAM

The CompTIA^® Data+DA0-001 CramSheet

Domain 1.0: Data Concepts and Environments

A database is a system for saving and processing data in a well-organized structure such that the information can be easily manipulated and accessed.
Structured Query Language (SQL) is used to perform various operations on databases. These operations include inserting, updating, and deleting data by leveraging INSERT, UPDATE, and DELETE statements to manipulate the values stored in a table.
A relational database stores data in rows and columns. Examples of relational database management systems include Microsoft SQL, MySQL, and PostgreSQL.
A non-relational database (also known as a NoSQL database) is a database that does not use tables, fields, and columns for structured data but uses other mechanisms, such as documents, graphs, and key/value pairs. Examples of non-relational databases include Redis, MongoDB, and Cassandra.
Online transactional processing (OLTP) enables the real-time execution of large numbers of database transactions (e.g., everyday transactions such as online or in-store purchases).
Online analytical processing (OLAP) enables multidimensional analysis at high speeds on large volumes of data and is frequently used in data analytics.
A data warehouse is a centralized repository that aggregates data from one or more different data sources. It is used for analysis of collated information. Data warehouses can only store structured data.
A data mart is a subset of a data warehouse that is pertinent to a specific line of business.
A data lake is a large data storage repository where structured, semi-structured, and unstructured data from multiple sources can be stored and leveraged for data analytics.
Video can be encoded in multiple formats, including AVI, MPEG, MP4, and WEBM.
Images can be broadly categorized as raster or vector images. Image formats include JPEG, PNG, GIF, SVG, and PDF.
Audio can be analog or digital. Human ears hear only analog sound waves. Audio formats include MP3, WAV, WMA, OGG, and AAC.
Discrete data is of the numeric data type, which involves complete or whole numbers with fixed and specific values.
Continuous data is data that cannot be measured in absolute terms but can be measured over a period of time.
Characters include numeric digits (0–9), upper- and lowercase letters (a–z and A–Z), and special characters and symbols.
Integers are whole numbers that can have negative, zero, or positive values.
A number that includes a fractional and/or decimal portion is known as a floating point number, or a float.
An array is a linear data structure composed of a set of data elements of similar data type.
A string is a data type that contains alphanumeric data.
A data schema gives information about the structure of data and relationships among tables or models. There are three types of schemas: logical, physical, and view schemas.
A star schema is a star-shaped schema that has one (primary) fact table and a number of associated dimension tables.
A snowflake schema is an expansion/extension of the star schema in which the dimension tables are further connected to subdimension tables.
Slowly changing dimensions (SCDs) refers to the concept of data dimensions in a data warehouse containing both current and historical data and the data dimensions changing with time. SCDs can be Type 0–4.
Structured data is data that is stored in a fixed field within a table in a well-defined structure (that is, with rows and columns).
Unstructured data does not conform to any data model or schema.
Semi-structured data can be seen as being between structured and unstructured data, and it shares the characteristics of both structured and unstructured data.
Metadata is data about data and contains a description and the context of the data.
Data file formats vary from text to tab delimited to comma delimited.
XML is a platform-independent standard markup language that consists of rules of data formatting for data encoding across platforms. It is platform agnostic.
JSON is a lightweight text-based file format used for storing and transporting data that is often used when data is sent from a web server to a client web page.

Domain 2.0: Data Mining

Data integration combines business and technical processes for collating data from different sources into valuable and meaningful datasets.
Extract, transform, load (ETL) enables data engineers to extract data from multiple source systems, transform the raw data into a more usable/workable dataset, and finally load the data into a storage system so end users can access meaningful data in reports or dashboards.
Extract, load, transform (ELT) enables data engineers to extract the data from data sources, load it to target datastore, and transform it as the queries are executed to get insights in reports or dashboards.
Delta loading refers to the process of extracting the delta, or difference in the data compared to what was previously extracted as part of the ETL process.
An application programming interface (API) provides a programmable interface for interacting with applications and infrastructure and acts as a middleware integration layer.
APIs enable organizations to selectively share their applications in terms of data and functionality with internal stakeholders (developers and users) as well as external stakeholders, such as business partners, third-party developers, and vendors.
Web scraping, also known as web data extraction or web harvesting, is a method used to the extract data from websites.
Surveys are commonly used to collect data from respondents.
Sampling is the process of collecting data from a subdivision/subset of a given population to get insights that represent the whole population.
A derived variable is defined by a parameter or an expression related to existing variables in a dataset.
The process of recoding a variable can be used to transform a current variable into a different one, based on certain criteria and business requirements.
Data merging simplifies data analysis by merging multiple datasets into one larger dataset.
Data blending brings together data from multiple sources that may be very dissimilar.
Duplicate data can lead to similar entities of the same data values being created in the database/warehouse.
Data appending refers to adding new data elements to an existing dataset/database.
Imputation is helpful in filling in missing values. Imputation can be based on logical rules, based on related observations, based on the last observation carried forward, and based on creating new variable categories.
Data reduction is a data manipulation technique that is used to minimize the size of a dataset by aggregating, clustering, or removing any redundant features.
Data redundancy occurs when the same datasets are stored in multiple data sources.
Data manipulation is an important step for business operation and optimization when dealing with data and analysis. Data analysts and engineers can manipulate data so that analysis can be performed on cleansed, focused, and more accurate datasets.
Normalization is aimed at removing redundant information from a database and ensuring that only related data is stored in a table.
Many data functions are available to help collate or get focused insights from data. Some examples are aggregate functions, logical functions, sorting, and filtering.
Missing data is one of the key issues with data accuracy and consistency.
Specification mismatch is caused by data at the source being a mismatch for data at the destination due to unrecognized symbols, bad data entry, invalid calculations, or mismatching of units/labels.
A data outlier in a dataset is an observation that is inconsistent or very dissimilar to the remaining information.
Invalid data refers to values that were initially generated inaccurately.
Non-parametric data is data that does not fit a well-defined or well-stated distribution.
Data type validation ensures that data has the correct data type before it is leveraged at the destination system.
An execution plan works behind the scenes to ensure that a query gets all the needed resources and is executed; it outlines the steps for execution of the query from start through output.
A parameterized query makes it possible to use placeholders for parameters, where the parameter values are supplied at execution time.
Indexing speeds up the execution of queries by rapidly finding records by delivering all the columns requested by the query without executing full table scans.
A B-tree is formed of nodes where the tree starts at a root that has no parent node and the other nodes in the tree each have one parent node, which might or might not have child nodes.
A clustered index sorts the way records in the table are physically stored, whereas a non-clustered index collects data in one place and records in another place, like a pointer to the data.
Temporary tables offer workspace for transitional results when processing data.
There are two types of temporary tables that you can create in Microsoft SQL: global and local.
A subset is a smaller set of data from a larger database or a data warehouse that allows you to focus on only the relevant information.
Data subsetting can be performed by using two methods: data sharding and data partitioning. Data sharding involves creating logical horizontal partitions in database to quickly access the data of interest. Partitioning involves creating logical vertical partitions in a database.

Domain 3.0: Data Analysis

Python offers the capability to manage huge amounts of information and to manage and create data structures rapidly.
Microsoft Excel enables data analysis via data models and queries.
R is an open source data analysis tool that is used for statistics, data visualization, and data science projects.
Tableau is a popular data visualization solution that has an intuitive user interface.
Power BI is Microsoft’s business analytics solution that offers reporting and visualization.
Amazon Web Services (AWS) QuickSight is a machine learning–powered cloud-native BI service.
Central tendency of a dataset can be identified by using measures such as mode, median, and mean.
Mean indicates a dataset’s average value. The formula for the mean is μ = (X1 + X2 + X3 + . . . + Xn) / n, where X is a value and n is the number of values.
The median is the middle value in a dataset that might be arranged in descending or ascending order.
The mode is the value that occurs the most frequently in a dataset.
The measure of dispersion describes the scattering of data and explains the variation in data points.
Mean deviation denotes the arithmetic mean of observation from absolute deviations from measures of central tendency.
Standard deviation is the square root of AM of deviation squares of provided values from their AM arithmetic mean. The formula for standard deviation is

$σ = \sqrt{\frac{Σ | x - μ |^{2}}{N}}$
The square of standard deviation, σ2, is known as variance.
Frequency is the number of times an observation of a specific value occurs in data.
Percent change can be used for comparing old values with new values. Percent change is calculated by the following formula:

$\frac{(V_{2} - V_{1})}{| V_{1} |} \times 100$
Descriptive statistics helps summarize data in a meaningful way such that statisticians can identify patterns in the data collected.
Confidence intervals are used to measure the degree of certainty in a sampling method. A confidence interval can be calculated by using the following formula:

$CI = \bar{X} \pm Z \times \frac{σ}{\sqrt{n}}$
Inferential analysis helps in making inferences and predictions by performing analysis on a sample population of data from an original/larger datasets.
A Z-score, also referred as a standard score, describes how distant from the mean a data point is. A Z-score is calculated using the following formula: Z = (x – μ) / σ
A t-test helps identify the main variance between mean values from two groups.
A p-value (probability value) is leveraged as part of hypothesis testing to help accept or reject the null hypothesis.
Chi-square testing is used to test a hypothesis about the observed distributions in various categories. Chi-square can be calculated using the following formula:

$\begin{matrix} Xc2 = Σ (Obi - Ei) 2 / Ei where: & c & denotes the degrees of freedom \\ E & represents the expected value \\ Ob & represents the observed value \end{matrix}$
A hypothesis describes a perception or an idea about a value that can be tested given sample data from a population. The null hypothesis (H₀) is where things are happening as expected, and there is no difference from the expected outcome. The alternative hypothesis (Ha) is where things change from expected
A type I error occurs when the null hypothesis is incorrectly rejected when, in fact, it is true. A type II error occurs when the null hypothesis is not rejected when, in fact, it is false.
Simple linear regression helps describe a relationship between two variables through an equation of a straight line. This line is sometimes also called the line of best fit.
Correlation is a statistical measure that denotes the degree to which two values or variables are associated.
To transform data into business insights and drive decisions, it is important to prepare and follow a set of questions that give direction to the analysis performed.
Data sources can be broadly categorized as primary and secondary sources and as internal or external.
A gap analysis is usually performed to compare the current state to a desirable future state, with an action plan to get from the current state to the desired state.
A number of analysis techniques may be leveraged based on the business needs. These include text, statistical, diagnostic, prescriptive, and predictive analysis techniques.
Trend analysis is based the comparison of data over a specific time period(s) in order to spot a pattern or a trend.
Performance analysis is used to study or compare the performance of a particular activity or process in order to identify the strengths and weaknesses.
Data scientists and statisticians use exploratory data analysis to analyze and investigate datasets and summarizing their major characteristics.
Performance analysis can be used to set a baseline to drive performance measurements across organizations.
Link analysis allows analysts to identify connections and association patterns within the nodes and links of a network.

Domain 4.0: Visualization

A report can be broadly categorized as a formal report or an informal report, as well as an analytical report or an informational report, based on its content and intended audience.
Filters in a report or a dashboard are very useful for showing only the data of interest.
A view is a stored query that can help pull in data from tables as well as other views.
When data is refreshed at the source, a report should be refreshed with updated data as well.
There can be three major audience categories for a report: primary, secondary, and tertiary.
A report’s executive summary is presented to C-level executives.
A report can have multiple design elements that make the report more intuitive, including color schemes, font sizes and styles, charts, and labels.
A color wheel offers a multitude of colors that can be chosen to represent the information in a report.
A report should have a version number to ensure that the most recent report is being used.
You should reference the source(s) of the data used in a report, especially when you use an external document or article for text, figures, articles, findings, diagrams, maps, etc.
An FAQ (frequently asked questions) is list of commonly asked questions and answers pertinent to a report.
A dashboard is a single screen that offers insights about key performance indicators (KPIs).
There several types of data attributes: nominal, ordinal, binary, numeric, interval, ratio, discrete, and continuous.
Live-data-feed, or continuous, dashboards permit live inputs for making decisions in near real time. Static dashboards enable users to look at information from a particular time.
A mockup is essentially a replica of a final dashboard that can be used to demonstrate the look and feel before the dashboard development work begins.
The steps in the dashboard development process are as follows: Requirements gathering, Ideation, Storyboard creation, Design layout, Testing, Deployment, and Feedback & maintenance
Dashboard subscriptions are a great way to keep consumers up to date on the data that matters most to them.
Drill-down allows users to move from an overview of data to a more granular view of the same dataset.
Dashboards can be scheduled for delivery over email on a periodic basis.
Roll-up allows users to get real-time insights from multiple sources of information in one place and take action on the insights presented.
A line chart, which is also known as a line graph, shows a set of points of information connected by a continuous line.
A pie chart uses slices of data categories that together they make up a whole chart, totaling 100%.
A bar chart, also known as a bar graph, categorizes information into a graphic with bars of different lengths, where the length of each vertical bar is relative to the quantity or amount of the information it represents.
Scatter plots are usually used to find the relationship between two variables.
A bubble chart is an extension of a scatter plot graph.
A heat map is a visualization used for showing differences in information based on colors.
A waterfall chart visualizes the data by denoting how a value is modified as it moves between two points.
A histogram represents a data distribution over a defined period or a continuous interval period.
Geographic map data visualizations, also known as choropleth maps or thematic maps, are a novel way to show comparative values across states, countries, or regions.
Tree maps are used to capture data values in a hierarchical structure to present the information visually.
A stacked chart, also known as stacked bar graph, is an extension of the usual bar chart in which each bar is divided into a number of sub-bars stacked together.
Infographics, or information graphics, are visual illustrations of data that make information clear.
A word cloud, also known as a tag cloud, is a visual design in which textual content is graphically shown in a random manner relative to the frequency of the words.
Ad hoc reporting, sometimes also called one-time reporting, enables users to generate reports on the fly so they don’t have to wait for the usual/scheduled reports.
Static reports offer insights about data or trends at a point in time or over a specific time period.
Dynamic reports are graphical outlays that combine live and static reporting elements so that users can click through links to the different types of data needed on demand.
Self-service reports can be generated by end users without IT or assistance.
Recurring reports are reports that are scheduled and generated based on predetermined KPIs.
A research report is used in the field of research, where a researcher reports their findings on a topic of interest.

Domain 5.0: Data Governance, Quality, and Controls

Role-based access control (RBAC) permits or restricts data access based on an individual’s role within an organization.
A data use agreement (DUA) is used when moving protected data from one party to another.
Encryption can be broadly classified as symmetric or asymmetric. With symmetric encryption, the encryption (scrambling) and decryption (unscrambling) keys are the same. With asymmetric encryption, the encrypting key is different from the decrypting key.
Symmetric key encryption is also known as shared key encryption, and asymmetric encryption is also known as public key cryptography.
Data is considered in transit or in motion when it is moving between devices or from one database/data warehouse/data lake to another database/data warehouse/data lake.
Data masking enables some of the data value fields or even parts of value fields to be hidden. It is not encryption, just hiding the actual field values. Data masking can be static or dynamic.
Local storage refers to data storage on a drive (hard-disk drive, solid-state drive, or zip drive) in a computer network on an organization’s own premises.
Cloud-based storage enables users to store data in cloud provider storage (such as AWS S3 or EBS, Azure Blobs, or GCP Cloud Storage).
An organization may ask new members and/or vendors as well as suppliers to sign an acceptable use policy (AUP) prior to giving them access to internal systems or information.
Data deletion can be primarily of two types: user request based or regulation enforced.
A data retention policy drafted and enforced by an organization states the directions for storing, holding, and deleting data.
Data constraints are rules to effectively enforce the type of data that can be inserted, updated, or deleted in a table or a column.
Data classification is a key step in helping ensure that the governance and protection of data against unauthorized access are executed as required and that there is no alteration, disclosure, or damage to the data.
Personally identifiable information (PII) is regulated data that could identify an individual.
Personal health information (PHI) is very similar to PII in that PHI is any individually identifiable health information.
Payment Card Industry (PCI) regulations protect an individual’s credit card information.
Data breaches may be caused by human error (e.g., failure to apply an appropriate security control) or may be due to hacking of targeted systems where the hackers are after specific data.
Data accuracy refers to the information being correct at the time it is used for analytics.
Data integrity is a measure of overall consistency, accuracy, and completeness of data stored in a database, data lake, or data warehouse over the life cycle of data.
Data validity refers to the way the data is entered in a system to begin with, with the right inputs from end users in terms of the data types and format expected.
Data transformation can be broadly categorized as pass-through or conversion.
In the data transformation process, transformation can be active or passive.
Data quality rules provide a guide that allows data engineers and data analysts to ensure that the data being considered for analytics is fit for its intended purpose.
Master data management (MDM) can help eliminate duplicate records by merging them together into a single, consolidated record.
MDM enables management, organization, categorization, synchronization, and localization of all organizational data.
Data standardization can be performed based on predefined business standards using rules or by leveraging third-party tools.
Data conformity measures the alignment of data types and formats against defined standards.
Data profiling helps discover any discrepancies, imprecisions, and missing data so that data engineers can correct the data before it leads to incorrect outcomes.
Cross-validation is based on splitting data into training and test sets.
A data dictionary is a centralized store for metadata.
MDM helps in integrating new data sources and in creating a single master source and is particularly useful during M&A. An MDM hub contains the golden record.
MDM offers a cohesive view of customers across an organization.
Data dictionaries can make it simpler to navigate the tons of data that gets processed from multiple sources in MDM.
MDM provides unification of data, which directly relates to how an organization can direct its people, processes, and technology.
Data transformation can be broadly categorized as pass-through or conversion.
The CIA triad has three pillars: confidentiality, availability, and integrity.
Wired Equivalent Privacy (WEP) is considered to be weak and is not recommended for deployment today.
Mobile device management (MDM) enables organizations to allow the employees to enable bring their own devices for work.
Encrypting data during transfer is sometimes also referred to as end-to-end encryption.
Data loss prevention (DLP) can be used to ensure that any classified and protected data cannot be exfiltrated from the organization without proper approval.
Shared drives can be both on-premises and in the cloud.
An entity relationship diagram is a visual method of explaining the relationships among the various entities in a database.
Examples of common data classification are Public; Private; Confidential; Controlled; Restricted; Sensitive; Internal use only

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Tearcard

Create new playlist

Sign In

Sign Up

Domain 1.0: Data Concepts and Environments

Domain 2.0: Data Mining

Domain 3.0: Data Analysis

Domain 4.0: Visualization

Domain 5.0: Data Governance, Quality, and Controls

Table of Contents for
Tearcard