An accuracy score is an evaluation metric used to estimate a machine learning model’s performance, and show the ratio of the number of correct predictions to the total number of predictions.
An activation function is used in artificial neural networks (ANN) that determines whether a neuron should be activated or not by calculating its output to the next hidden layer (or output layer) based on the input from the previous layer (or input layer). The activation function is responsible for the non-linear transformation of a neural network.
An algorithm is a sequence of repeatable steps, often expressed mathematically, written by a human and executed by a computer, to solve a certain type of data science problem. Algorithms range from very simple to extremely complex. Different algorithms are suitable for various tasks and technologies. The main concept is that an algorithm takes in some input and produces an output, and the same input will always produce the same output. In machine learning, algorithms take in input in the form of data and hyperparameters, identify and learn common patterns from the data, and produce outputs in the form of predictions.
Apache Spark is an open-source multifunctional parallel processing framework for analyzing and modeling big data. Spark lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data. As each node works on its own subset of the total data, it also carries out a part of the total calculations required, so that both data processing and computation are performed in parallel with the nodes in the cluster. It is a fact that parallel computation can make certain types of programming tasks much faster.
API is an acronym for Application Programming Interface, a software intermediary that ensures a connection between applications or computers. An example of an API is embedding Google Maps in a Rideshare application. Data scientists often work with APIs to access data (e.g., Twitter API to download tweets), or to package a solution they made (e.g., an API that calls a machine learning model in production).
Artificial Intelligence (AI)
Artificial intelligence is a branch of computer science that involves using machine learning, programming, and data science techniques that enable computers to behave intelligently. AI systems are broad and have varying degrees of complexity. They can range from rule-based systems, to machine learning based systems, and can perform abilities such as fraud detection, object recognition, language translation, stock price prediction, and so much more.
Artificial Neural Networks (ANN)
An Artificial Neural Network is a machine learning model that is loosely inspired by biological neural networks in human brains. Neural networks consist of up to hundreds of layers of interconnected units called neurons. Conceptually, an artificial neural network has the following types of layers: input, output, and hidden layers used to filter the data through, process it with an activation function, and make predictions at the output. ANN are the building blocks of a subset of machine learning called deep learning, which delivers complex outputs such as image or sound recognition, object detection, language translation, and more.
Backpropagation is a technique used in training deep learning networks and based on implementing gradient descent to iteratively tune the weights and biases to enhance the accuracy of a network. The algorithm calculates the error of the output in each training iteration, and then propagates it back into the network, thus enabling it to minimize error in future training iterations.
A Bayesian Network is a probabilistic graph showing the relationship between random variables for an uncertain domain, where the graph nodes represent those variables, and the links between each pair of nodes (the edges) represent the conditional probability for the corresponding variables. An example of Bayesian Networks is in medical diagnoses, where researchers predict health outcomes while taking into account all the factors that may affect an outcome.
Bayes' Theorem is a mathematical equation for calculation of conditional probability, i.e., the probability of event B occurring given that related event A has already happened. One of the applications of this theorem in data science is building Bayesian networks for large datasets.
Bias refers to the tendency of models to underfit the data, leading to inaccurate predictions in machine learning and data science. This is the definition of bias that is often discussed in the bias-variance tradeoff. Moreover, bias can also mean algorithmic bias—which refers to the propensity of machine learning models to reproduce harmful societal biases by treating different groups of individuals differently based on protected attributes such as race, sexual orientation, gender identify, age, pregnancy, veteran status, and more.
The bias-variance tradeoff is the tradeoff between bias and variance when creating a machine learning model. Bias and variance are two types of prediction error when creating machine learning models—where a high bias indicates model underfitting, and high variance indicates model overfitting. Minimizing both of these factors to an optimal level decreases the overall error of predictions.
Big Data is the field that revolves around processing, treating, and extracting information from data sets that are too large or complex for traditional data processing tools. Big data is defined by the five Vs; velocity—the speed of data generation; volume—the amount of data generated; variety—the variety of data types e.g., text, images, tabular data etc.; veracity—the quality and truthfulness of the data; and value—the data’s propensity to be translated to valuable business insights.
Binomial distribution is the discrete probability distribution of outcomes of independent trials, with two mutually exclusive possible outcomes (success and failure), a finite number of trials, and a constant probability of success. In simple terms, a binomial distribution can be thought of as the probability of a particular outcome (success or failure) in an event that is repeated multiple times (e.g., the probability of getting 3 in a dice that is rolled 5 times).
Business analysts are responsible for tying data insights to actionable results that increase profitability or efficiency. They have deep knowledge of the business domain and often use SQL alongside non-coding tools to communicate insights derived from data.
Business Analytics (BA)
Business analytics is a subfield of analytics focused on using historical and current data to discover valuable operational insights, anticipate possible future trends, and make data-driven business decisions. The toolkit of business analytics usually includes statistical analysis, descriptive analytics and data visualization, and may cross into predictive analytics and machine learning.
Business Intelligence (BI)
Business intelligence is a subfield of analytics combining descriptive analytics, business analytics, data visualization, statistical analysis, reporting, and more. Aimed at helping organizations make data-driven decisions. BI usually leverages non-coding tools such as Tableau and Power BI to explore trends in historical and current data. Unlike business analytics, the main focus of BI is on descriptive analytics.
A categorical variable is a variable that can have one of a limited number of possible values (categories) without any intrinsic ordering involved. An example of a categorical variable would be marital status (e.g., married, single, divorced). It is also called a nominal or qualitative variable.
Classification is a supervised learning problem when it is necessary to predict categorical outcomes based on input features. Examples of classification problems are fraud detection (e.g., is this transaction fraudulent given the set of input features?) and email spam filters (e.g. Is this email spam or not?). Commonly used classification algorithms are k-nearest neighbors, decision trees, random forest, etc.
Clustering is an unsupervised learning problem concerned with grouping all the observations of a dataset according to their similarity by some common characteristics. Unlike a classification problem, these groups (called clusters) are not predefined by humans, but identified by machine learning algorithms while learning the input data. The elements in each cluster are similar among themselves and different from all the others. Common clustering algorithms are k-means, hierarchical clustering, spectral clustering, etc..
Computer Science is a multifaceted field of study focused on theoretical and practical aspects of processing information in digital computers, designing computer hardware and software, and the applications of computers. In particular, computer science deals with artificial intelligence, computational systems, algorithms, data structures, data modeling, security, computer and network design, etc..
Computer vision is an area of computer science concerned with enabling computers to achieve high-level understanding from digital images or videos, near to how humans can see them. Computer vision became especially popular with the evolution of deep learning and the accumulation of big data. Some of its applications are object and facial recognition, motion analysis, self-driving cars, and optical character recognition.
A confusion matrix is a table illustrating the predictive performance of a classification model. Usually, a confusion matrix is created for a binary output (i.e., prediction problems with only two types of prediction—e.g., whether a transaction is fraudulent or not), so the resulting table is a two by two table. A confusion matrix represents the relationships between predictions vs. actual labels for both classes. It easily surfaces the amount of accurate predictions made (true positives and true negatives), and the amount of false positives (type I error) and false negatives (type II error) made.
A continuous variable is a variable that can take on an infinite set of values within a particular range. Examples of continuous variables are height and weight.
Correlation is the strength and direction of the relationship between two or more variables, measured by a correlation coefficient, or Pearson coefficient. Statistically, a correlation coefficient is the ratio of the covariance of two variables to the product of their standard deviations. It can take the values from -1 (a perfect negative correlation) to 1 (a perfect positive correlation). The presence of correlation between two variables does not imply cause-effect relation.
Cost function is a machine learning function used to measure the average of the differences between the predicted and actual values over the training set, and supposed to be minimized.
Covariance is a measure of the relationship between two variables. Unlike variance that measures variations inside the same variable, covariance shows how variations in one variable influence changes in the second one. Covariance is used for calculating a correlation coefficient.
Cross-Validation (not validated)
Cross-validation is a resampling method when training machine learning models that splits labeled data into training and test sets. In each iteration of cross validation, different parts of the data are used to train and test the model. The training set is used to train a model, and the test set is used to make predictions and compare them with the actual labels for those entries. Afterwards, an overall accuracy metric is calculated to estimate the predictive performance of the resulting model.
A dashboard is an interactive, graphical user interface used to visualize, summarize, and report key performance indicators (KPI), progress metrics, and information around business processes that allows the target audience to easily grasp the most important insights at many levels. Dashboards are built using non-coding tools such as Excel, Tableau, or PowerBI, or even coding tools such as Python and R. A dashboard is often linked to regularly updated databases and services.
Data Analysis (DA)
Data analysis is a discipline focused on cleaning, transforming, visualizing, and exploring data with the purpose of extracting meaningful patterns and insights, and communicating the results to the interested parties. Data analysis is usually the first milestone in all data science projects, but it can also represent a stand-alone project. Unlike data science though, it deals more with descriptive analytics rather than predictive analytics.
A data analyst is similar to business analysts, data analysts are responsible for analyzing data and reporting insights from their analysis. They have a deep understanding of the data analysis workflow and derive and report their insights using a combination of coding and non-coding tools.
A database is a structured storage space where the data is organized in many different tables in a way such that the necessary information can be easily accessed and summarized. Databases are mostly used with a relational database management system (RDBMS) such as Oracle or PostgreSQL. The most common programming language used to interact with the data from a database is SQL.
Database Management System (DBMS)
A database management system is a software package used to easily perform different operations on the data: accessing, manipulating, retrieving, managing, and storing the data in a database. Based on the way the data is organized and structured, there are different types of DBMS: relational, graph, hierarchical, etc.. Some examples of DBMS: Oracle, MySQL, PostgreSQL, Microsoft SQL Server, MongoDB.
Data consumers are often in non-technical roles, but consume data insights and analytics provided by data professionals to make data-driven decisions. Data consumers often need to have conversations with data professionals and should be able to distinguish when data can and cannot be used to answer business questions.
A data engineer is a specialist responsible for providing the right data to the hands of data scientists and data analysts. They design and maintain the storage infrastructure and data pipelines that take large amounts of raw data coming from various sources into one centralized location with clean, correctly formatted data that is relevant for the organization.
Data Engineering (DE)
Data engineering is a specialization that focuses on scaling data access within the organization. Data engineers work on data acquisition, collection, management, and storage—as well as setting up data pipelines and transforming data into high quality, usable data by the rest of the organization.
Data enrichment is the processes of enhancing, refining, and augmenting raw data, to make it more useful for the organization and, as a result, to obtain more meaningful business insights and optimize predictive analytics.
A dataframe is the tabular data structure with labeled axes (rows and columns) of potentially different types.
DAMA defines Data Governance as “The planning, oversight, and control over management of data and data-related sources”. Data governance sets the roles, responsibilities, and processes for ensuring data availability, relevance, quality, usability, integrity, and security. Data governance includes a governing body, a framework of rules and practices to meet the company's information needs, and a program to perform these practices.
Data journalism is a type of journalism concerned with processing and analyzing large amounts of numerical data, with the purpose to create a story about the data, or information derived from this data. This role appeared as the result of a consistently growing information flow and the increasing interaction of modern journalism with such spheres as statistics, IT technologies, and data science.
A data lake is a single storage repository containing a large amount of raw, unprocessed data of any kind from various sources, which as yet has no defined purpose. A data lake includes both structured data of different structures without any relationship between one another and, more often, unstructured data, such as documents and text files. The raw data is kept as an original source of information, without the necessity to structure and wrangle it unless the data is needed.
Data literacy is the ability to read, write, analyze, communicate, and reason with data to make better data-driven decisions. From an organizational perspective, it is a spectrum of data skills ranging from data-driven decision making, to advanced technical skills in data science, data engineering, and machine learning, resulting in everyone in the organization having the relevant competencies and generating value from data at scale.
Data mining is the process of collecting relevant data from various sources, cleaning and transforming it to the right format, detecting and extracting meaningful hidden trends, patterns, and interconnections between the data, and communicating actionable insights to help the organization make data-driven decisions and develop better strategies. For this purpose, various analytical and modeling techniques are used, including statistical analysis, data visualization, regression and classification.
Data modeling is the process of developing a visual representation of either a whole IT system, or parts of it, to communicate connections between data points and structures. Data models showcase the types of data used and stored within the system, the relationships between these different data sources, and how data is grouped according to different attributes and characteristics. A slightly adjusted definition you may encounter in data science for data modeling is: building reliable models that transform the raw data into predictive, consistent, and actionable insights. The main intent is to clearly understand the critical business needs, available data sources and deadlines, and to provide a relevant data-driven and correctly formatted framework for meeting those needs.
A data pipeline is a set of data processing scripts that are linked, thus automating the flow of data through an organization where data is extracted, transformed, and loaded so it’s ready to be used.
Data Science (DS)
Data science is a multifaceted interdisciplinary field of study that uses various scientific methods, advanced analytics techniques, and predictive modeling algorithms to extract meaningful insights from data, to help answer strategic business or scientific questions in many spheres. It combines a wide range of technical and non-technical skills and usually requires solid domain knowledge in the particular industry where it is applied, to be able to correctly interpret the available data and the obtained results.
Data Scientists investigate, extract, and report meaningful insights in the organization’s data. They communicate these insights to non-technical stakeholders, and have a good understanding of machine learning workflows and how to tie them back to business applications. They work almost exclusively with coding tools, conduct analysis, and often work with big data tools.
A dataset is a collection of data of one or many types representing real-life or synthetically generated observations, and used for statistical analysis or data modeling. The data of a dataset can be gathered from many sources and is typically stored in some kind of data structure, most commonly a table, where the columns correspond to different variables, and the rows to different data entries.
A data structure is a way of organizing and storing data so that it can be accessed and worked with efficiently. Data structure defines the relationship between the data and the operations that can be performed on it. Common data structures encountered in data science are dataframes, lists, arrays, and more.
Data visualization is an interdisciplinary field that deals with condensing and representing information in visual format. Data can be visualized according to a variety of charts such as maps, histograms, barplots and line-charts, and can be combined into infographics, dashboards, and more. Data visualization is often used to help the target audience better understand the underlying data and the obtained results.
A data warehouse is a central repository for storing structured, cleaned, and transformed data collected from multiple sources by means of the ETL (extraction, transformation, loading) process. Data professionals can easily access the necessary information from the data warehouse through business intelligence tools, SQL queries, etc., and use it for further analysis and modeling to answer business questions.
Data wrangling is also called data munging. Data wrangling tasks are concerned with data cleaning, restructuring, merging, aggregating, and transforming to an appropriate format for a specific purpose. All in all, it is a process of data preparation for further easy access and data analysis.
A decision tree is a supervised machine learning algorithm used mostly for classification, but also regression problems. Decision trees make a sequence of if-else questions about individual features, with the objective of inferring the class labels. A decision tree benefits from a possible graphical tree-like representation, the imitation of human decision-making ability, and intuitively understandable logic, but this type of model tends to overfit.
Deep Learning (DL)
Deep learning is a subset of machine learning algorithms based on multilayered artificial neural networks (ANN) that are largely inspired by the structure of the brain. ANN are very flexible and can learn from huge amounts of data, to deliver highly accurate outputs. They are often behind some data science and machine learning use cases such as image or sound recognition, language translation, and other advanced problems.
Dimensionality reduction is the process of reducing the number of features of the training set, leaving only the most relevant features that capture most of the variation, in order to enhance the model's performance. Dimensionality reduction is especially useful for large datasets with a lot of variables. It helps optimize storage space and computational time, and also fixes a multicollinearity problem. The most popular technique of dimensionality reduction is PCA (principal component analysis).
EDA is an acronym for exploratory data analysis and refers to the first phase of data analysis, focused on basic exploration of the available data; summarizing its main characteristics, and finding initial patterns and trends, issues to fix, and questions to further investigate. At this stage, a data analyst or data scientist draws a general understanding of the data as a basis for the subsequent, more detailed data analysis.
ELT (extract, load, transform) is a data pipeline system designed by data engineers, an alternative to the more popular approach ETL (extract, transform, load). Before applying any transformation, the raw data is loaded to the data lake and then transformed in place. The advantage of ELT over ETL is that it requires less time, suits for processing large datasets, and is more cost-effective.
ETL (extract, transform, load) is a data pipeline system designed by data engineers. The data is extracted from multiple sources, transformed from its raw form into a proper format to be consistent with the data from other sources, and loaded into the target data warehouse. From here, it can be used for further data analysis and modeling to solve various business problems.
Evaluation metrics are a collection of metrics used to estimate the performance of a statistical or machine learning model. Some examples of evaluation metrics are accuracy score, f-score, recall and RMSE.
False Negative (FN, Type II Error)
A false negative is an outcome when a classification model incorrectly predicts the negative class for a binary target variable (e.g, if we’re predicting customer churn, a false negative is generating a “won’t churn” prediction whereas the actual label is “will churn”).
False Positive (FP, Type I Error)
A false positive is an outcome when a classification model incorrectly predicts the positive class for a binary target variable. For example, if we’re predicting customer churn, a false positive generates a “will churn” prediction whereas the actual label is “won’t churn”.
A feature is an independent variable used as an input in a machine learning model. For example, if we’re predicting the likelihood of diabetes using height, weight, and sugar intake—height, weight, and sugar intake are features.
Feature engineering is the process of using domain knowledge and subject matter expertise to transform raw features into features that better reflect the underlying problem, and are better suited for machine learning algorithms. It includes extracting new features from the available data, or manipulating the existing features. For example, if we’re trying to predict a health outcome such as the likelihood of having diabetes, calculating a BMI feature using height and weight features is feature engineering.
Feature selection is the process of selecting a subset of features from the dataset that are the most relevant for predicting the target variable. A smart feature selection process is especially important for large datasets since it reduces model complexity, overfitting, and computational time, and increases the model accuracy.
F-Score is an evaluation metric for estimating the model performance that combines both precision and recall. Usually, the F1-score is used, which is the harmonic mean of the precision and recall. The more generic case is Fβ where additional weight is applied to either precision or recall.
Gradient descent is an iterative optimization process used in machine learning to minimize the cost function by finding the optimal values to the parameters of the function.
Hadoop is an open-source Java-based software framework that enables parallel processing and distributed storage of big data across clusters of many computers. Hadoop allows to save time and handle much larger amounts of data than could be possible using only one computer.
Hyperparameters are attributes pertaining to a machine learning model whose value is set manually before starting the training process. Unlike the other parameters, hyperparameters cannot be estimated or learned directly from the data. Tuning hyperparameters and estimating the resulting model performance, we can determine their optimal values for obtaining the most precise model. Intuitively speaking, tuning a hyperparameter is akin to tuning a radio knob when trying to reach a perfect signal. An example of hyperparameters is the number of trees in the random forest algorithm.
A hypothesis is an assumption about some problem or event that has to be tested, and depending on the outcome of the experiment, proven or rejected.
Imputation is the process of filling in missing values in a dataset. Imputation techniques can be either statistical (mean/mode imputation) or machine learning techniques (KNN imputation).
K-Means is the most popular clustering algorithm that identifies K cluster centers (called centroids) with tentative coordinates in the data and iteratively assigns each observation to one of the centroids based on its features until the centroids converge. Data points are similar inside a cluster and different from the data points in the other clusters.
K-Nearest Neighbors (KNN)
K-nearest neighbors are supervised learning algorithms that classify observations based on their similarity to their nearest neighbors. The most important parameters of KNN that can be tuned are the number of nearest neighbors and the distance metric (Minkowski, Euclidean, Manhattan, etc.).
Linear algebra is a branch of mathematics concerned with linear systems: lines, planes, vector spaces, matrices, and operations on them, such as addition or multiplication. Linear algebra is very useful in data science and machine learning, since datasets and many machine learning models can be represented in a matrix form.
Linear regression is a regression algorithm that deals with modeling a linear relationship between a continuous target variable and one or several continuous features. A typical example of data science using linear regression is price prediction based on various input attributes.
Logistic regression is a regression algorithm that uses a logistic function on the input features to predict the class probability or directly the class label for the target variable. In the second case, the output represents a set of categories instead of continuous values, meaning that the logistic regression acts here as a classification technique. A typical data science use case for logistic regression is predicting the likelihood of customer churn.
Machine Learning (ML)
Machine learning is a branch of artificial intelligence (AI) that provides a set of algorithms designed to learn patterns and trends from historical data. The purpose of ML is to predict future outcomes, and generalize beyond the data points of the training set without being explicitly programmed. There are two main types of machine learning algorithms: supervised and unsupervised, each represented by numerous techniques applicable for different use cases.
The mean is the arithmetic average value for a set of numbers i.e., the sum of all the values divided by the number of values. It is usually used together with other statistics to get an overall understanding of the entire dataset.
Mean Absolute Error (MAE)
The mean absolute error (MAE) is the arithmetic average of all absolute errors of the predicted values compared to the actual values.
Mean Squared Error (MSE)
The mean squared error (MSE) is the arithmetic average of the squares of all the errors of the predicted values compared to the actual values.
The median is the middle value in a set of numbers sorted in ascending or descending order. If there is an even number of values in the set, the median is the arithmetic average of the two middle values. The median is usually used together with other statistics to get an overall understanding of the entire dataset, and is especially useful for detecting possible outliers.
The mode is the most frequent value (or values) in a set of data.
Model tuning is the process of adjusting the hyperparameters in order to maximize the model's accuracy without overfitting it.
Multivariate modeling is the process of modeling the relationship between multiple variables (predictors) defined at the feature selection step, and the target variable.
Naive Bayes is a group of classification algorithms based on Bayes' theorem and an assumption of independence between features used in the classifier. Despite, in reality, features are not always independent, the Naive Bayes algorithms can be successfully applied for various data science use cases, such as spam filtering or sentiment analysis.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a branch of computer science that deals with making computer applications understand and analyze written or spoken human language. NLP techniques take input text data, usually unstructured, convert it to a structured form, look for linguistic and contextual patterns, categorize them, and extract valuable insights from this data. NLP also involves leveraging machine learning and deep learning to generate language, categorize it, and complete other cognitive tasks using language. Some examples of NLP applications are chatbots, speech-to-text converters, sentiment analysis, and automatic translation.
Normalization is the process of rescaling the data so that all the attributes have the same scale. Normalization is necessary for making a meaningful comparison between the attributes and also is required for some machine learning algorithms.
NoSQL stands for "not only SQL". A database management system used for storage and retrieval of non-relational (i.e., non-tabular) databases. Some examples of non-relational data models are graph, document, and key-value databases. NoSQL systems benefit from high flexibility and operational speed, and the possibility to be scaled across many servers”.
Null hypothesis is a type of hypothesis that states the opposite of the alternative hypothesis to be checked, i.e., that no significant statistical relationship exists between the two variables and that the observations are all based on chance. A null hypothesis can be rejected or confirmed during a statistical experiment.
Open source refers to free-licensed software and resources available for further modifications and sharing. Open-source tools facilitate collaboration among users and are usually more stable since the researchers can add new, helpful features, or fix technical issues and bugs reported by the community.
An ordinal variable is a variable that can have one of a limited number of possible values with an intrinsic ordering involved. An example would be a survey response column where responses are ordered by intensity (e.g., “Strongly disagree”, “disagree”, “neutral”, “agree” or “Strongly agree”).
An outlier is an abnormal value in a dataset that deviates considerably from the rest of the observations. Outliers can be evidence of a measurement error or extraordinary event.
Overfitting refers to when a model learns too much information from the training set including potential noise and outliers. As a result, it becomes too complex, too conditioned on the particular training set, and fails to adequately perform on unseen data. Overfitting leads to high variance on the bias-variance tradeoff.
A parameter is a named variable passed into a function in programming and data science. In machine learning, parameters are an internal component of an algorithm to be learned from the data. Some machine learning algorithms are parametric with a fixed set of parameters (e.g, linear and logistic regressions), whilst others are non-parametric (e.g., k-nearest neighbors).
Precision is an evaluation metric used to estimate a machine learning model’s performance, showing the ratio of the number of correctly predicted positive cases to the total number of predicted positive cases.
Predictive analytics is the process of analyzing historical data using various tools of statistical analysis, data mining, data visualization, and machine learning, to make predictions about future events in a particular business.
Principal Component Analysis (PCA)
Principal component analysis (PCA) is a statistical technique of factor analysis and dimensionality reduction that transforms a set of possibly correlated initial features into a smaller set of linearly uncorrelated features called principal components. In this way, PCA preserves as much variance in the dataset as possible, while minimizing the number of features.
Python is an open-source, object-oriented, high-level programming language. Python is very popular in the field of data science, but also widely used for general-purpose programming in computer science. It is intuitively understandable, and easy to learn and use, whilst remaining a very powerful source for solving complex problems. Python provides an extensive standard library and many additional useful modules, and is constantly developed, improved, and expanded.
R is a popular programming language and free software widely used for solving data science and machine learning problems, especially famous for its statistical computing power and awesome data visualization solutions. It includes numerous data science tools and packages, can be used in many operating systems, and has a strong online community.
Random Forest is a supervised learning algorithm used for regression or classification problems, random forest combines the outputs of many decision trees in a single model. The predictions of a random forest represent essentially the average outcome of all the decision trees, hence this algorithm provides more accurate results than just a single decision tree.
A recall is an evaluation metric used to estimate a machine learning model’s performance, showing the ratio of the number of correctly predicted positive cases to the total number of actual positive cases.
Regression is a supervised learning problem where it is necessary to predict continuous outcomes based on input features. A regression model learns the relationship between one or several independent features and the target variable, and then uses the established function to predict unseen data. Examples of the regression algorithm are linear regression and ridge regression. A typical regression problem is price prediction.
Reinforcement Learning (RL)
Reinforcement learning (RL) is a stand-alone branch of machine learning (neither supervised nor unsupervised) where an algorithm gradually learns by interacting with an environment. RL makes decisions based on its past experience about which actions can bring it closer to a stated goal. By receiving rewards for correct actions and penalties for wrong ones, the algorithm finds out the optimal strategy to maximize its performance. Examples of RL algorithms include game-playing machine learning systems such as chess engines and video game agents.
A relational database is a type of database that stores data in several tables related to one another by means of unique IDs (keys) from which the data can be accessed, extracted, summarized, or reassembled in different ways.
Root Mean Squared Error (RMSE)
The root mean squared error (RMSE) is the square root of the mean squared error. This evaluation metric is more intuitive than MSE because the result can be understood easier, using the same units of measurement as the original data.
Sampling error is the statistical difference between the entire population of the data and its subset (a sample), due to the fact that the sample does not include all elements of the entire population.
SQL (structured query language) is a programming language designed to interact with relational database management systems (RDBMS). SQL has several flavors, including SQLite, PostgreSQL and MySQL. Some of these are free and open source. All the flavors have a rather similar syntax, with minor variations in additional functionality.
Standard deviation is the square root of the variance of a population. The standard deviation shows the amount of dispersion of the values and is more intuitive than the variance because it is in the same units of measurement as the data.
Supervised learning is a branch of machine learning concerned with teaching a model on a labeled training set of historical data. Supervised learning learns the relationship between inputs and outputs and then measures how accurately it predicts the outputs for a test set with the known actual outputs. In this way it can be used later to make predictions on completely new data. Supervised learning algorithms include linear and logistic regressions, decision trees, and SVM. Examples of common tasks include predicting house prices and classifying messages as spam or ham.
SVM (Support Vector Machine) is a supervised learning algorithm used mostly for classification, but also regression problems. In a classification problem, SVM provides an optimal hyperplane that separates the observations of both classes (in the case of a multiclass classification, the algorithm breaks down the problem into a set of binary problems). In a regression problem, SVM provides the best fit hyperplane within a defined threshold.
Synthetic data is artificially created data. Synthetic data usually reflects the statistical properties of the initial dataset, so they can be used in high-privacy spheres such as banking and healthcare, or for augmenting an existing dataset with additional statistically representative data observations.
A target variable (also called dependent variable) is the variable to be predicted in a machine learning algorithm by using features, for example if we’re predicting likelihood of diabetes using height, weight, and sugar intake, diabetes status is the target variable we want to predict.
A test set is a subset of the available data isolated before building a model, usually between 20 to 30% of the entire dataset. Test sets are used for evaluating the accuracy of models fitted on a training set.
A time series is a sequence of observations of a variable taken at different times and sorted in time order. Usually, time series measurements are taken at successive, equally spaced points in time. Some examples of time series are stock market prices or the temperature over a certain period of time.
A training set is a subset of the available data isolated before building a model, usually from 70 to 80% of the entire dataset. A training set is used for fitting the model that will later be tested on the test set.
True Negative (TN)
True negative (TN) is an outcome where the model correctly predicts the negative class for a binary target variable (i.e., it predicts “False” for an actual label of False).
True Positive (TP)
True positive (TP) is an outcome where the model correctly predicts the positive class for a binary target variable (i.e., it predicts “True” for an actual label of True).
Underfitting is when a model is unable to detect the patterns from the training set because it was built on insufficient information. As a result, the model is too simple and cannot perform well on unseen data, nor the training set itself. Underfitted models have high bias.
Univariate modeling is the process of modeling the relationship between a single variable (a predictor) and the target variable. Univariate modeling is usually used with time series.
Unstructured data is any data that does not fit a predefined data structure such as the typical row-column structure of a database. Examples of such data are images, emails, text documents, videos and audio.
Unsupervised learning is a class of machine learning algorithms that learn the underlying structure of a dataset without being provided a target variable. Unsupervised learning is used to discover common patterns in data, group the values based on their attributes, and then later make predictions on unseen data. The most common unsupervised learning algorithm is k-means. Examples of common tasks are anomaly detection, and customer segmentation based on common characteristics.
The variance is the average squared difference between individual values and the mean of the entire set of values in mathematics and statistics. In other words, variance shows how spread out the values are. In machine learning, the variance is an error caused by the sensitivity of a model to small variations in the training set. High variance reflects a tendency of the model to learn random noise from the input features leading to model overfitting.
Web scraping is the process of extracting specific data from websites for further usage. Web scraping can be done automatically by writing a program to capture the necessary information from a website.
The Z-Score (also called the standardized score, standard score, or normal score) is the number of units of standard deviation by which the value of a data observation is above or below the mean value of the entire set of values. A z-score equal to 0 means that the data observation is close to the mean.