Data analysis
1.96
1.96 is the approximate value of the 97.5 percentile point of the normal distribution used in probability and statistics.
1.96 is the approximate value of the 97.5 percentile point of the normal distribution used in probability and statistics.
68-95-99.7 rule
In statistics, the 68-95-99.7 rule — or three-sigma rule, or empirical rule — states that for a normal distribution, nearly all values lie within 3 standard deviations of the mean.
In statistics, the 68-95-99.7 rule — or three-sigma rule, or empirical rule — states that for a normal distribution, nearly all values lie within 3 standard deviations of the mean.
Algorithmic inference
Algorithmic inference gathers new developments in the statistical inference methods made feasible by the powerful computing devices widely available to any data analyst.
Algorithmic inference gathers new developments in the statistical inference methods made feasible by the powerful computing devices widely available to any data analyst.
ANOVA-simultaneous component analysis
ASCA, ANOVA-SCA, or analysis of variance – simultaneous component analysis is a method that partitions variation and enables interpretation of these partitions by SCA, a method that ...
ASCA, ANOVA-SCA, or analysis of variance – simultaneous component analysis is a method that partitions variation and enables interpretation of these partitions by SCA, a method that ...
Anscombe transform
In statistics, the Anscombe transform, named after Francis Anscombe, is a variance-stabilizing transformation that transforms a random variable with a Poisson distribution into one with an appro...
In statistics, the Anscombe transform, named after Francis Anscombe, is a variance-stabilizing transformation that transforms a random variable with a Poisson distribution into one with an appro...
Barnard's test
In statistics, Barnard's test is an exact test of the null hypothesis of rows and columns in a contingency table.
In statistics, Barnard's test is an exact test of the null hypothesis of rows and columns in a contingency table.
Barnardisation
Barnardisation is a method of disclosure control for tables of counts that involves randomly adding or subtracting 1 from some cells in the table.
Barnardisation is a method of disclosure control for tables of counts that involves randomly adding or subtracting 1 from some cells in the table.
Boolean analysis
Boolean analysis was introduced by Flament (1976).
Boolean analysis was introduced by Flament (1976).
Bootstrapping (statistics)
In statistics, bootstrapping is a method for assigning measures of accuracy to sample estimates (Efron and Tibshirani 1993).
In statistics, bootstrapping is a method for assigning measures of accuracy to sample estimates (Efron and Tibshirani 1993).
Cluster analysis
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters.
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters.
Cluster-weighted modeling
In statistics, cluster-weighted modeling (CWM) is an algorithm-based approach to non-linear prediction of outputs (dependent variables) from inputs (independent variables) based on density estim...
In statistics, cluster-weighted modeling (CWM) is an algorithm-based approach to non-linear prediction of outputs (dependent variables) from inputs (independent variables) based on density estim...
Clustering high-dimensional data
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions.
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions.
Collocation (remote sensing)
Collocation is a procedure used in remote sensing to match measurements from two or more different instruments.
Collocation is a procedure used in remote sensing to match measurements from two or more different instruments.
Combinatorial data analysis
Combinatorial data analysis (CDA) is the study of data sets where the arrangement of objects is important.
Combinatorial data analysis (CDA) is the study of data sets where the arrangement of objects is important.
Common-method variance
In science, common-method variance is a specific type of variance that may cause errors in analyzing statistical data.
In science, common-method variance is a specific type of variance that may cause errors in analyzing statistical data.
Contingency table
In statistics, a contingency table is a type of table in a matrix format that displays the frequency distribution of the variables.
In statistics, a contingency table is a type of table in a matrix format that displays the frequency distribution of the variables.
Correlation clustering
Correlation clustering operates in a scenario where the relationship between the objects is known instead of the actual representation of the objects.
Correlation clustering operates in a scenario where the relationship between the objects is known instead of the actual representation of the objects.
Correspondence analysis
Correspondence analysis (CA) is a multivariate statistical technique developed by Jean-Paul Benzécri.
Correspondence analysis (CA) is a multivariate statistical technique developed by Jean-Paul Benzécri.
Counternull
In statistics, and especially in the statistical analysis of psychological data, the counternull is a statistic used to aid the understanding and presentation of research results.
In statistics, and especially in the statistical analysis of psychological data, the counternull is a statistic used to aid the understanding and presentation of research results.
Covariance matrix
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th element...
In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th element...
Cross tabulation
Cross tabulation is the process of creating a contingency table from the multivariate frequency distribution of statistical variables.
Cross tabulation is the process of creating a contingency table from the multivariate frequency distribution of statistical variables.
Cumulative frequency analysis
Cumulative frequency analysis is the applcation of estimation theory to exceedance probability.
Cumulative frequency analysis is the applcation of estimation theory to exceedance probability.
Data analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making.
Data classification (business intelligence)
In business intelligence, data classification has close ties to data clustering, but where data clustering is descriptive, data classification is predictive.
In business intelligence, data classification has close ties to data clustering, but where data clustering is descriptive, data classification is predictive.
Data Discovery and Query Builder
Data Discovery and Query Builder (DDQB) is a data abstraction technology, developed by IBM, that allows users to retrieve information from a data warehouse, in terms of the user's specific area ...
Data Discovery and Query Builder (DDQB) is a data abstraction technology, developed by IBM, that allows users to retrieve information from a data warehouse, in terms of the user's specific area ...
Data fusion
Data fusion, is the process of integration of multiple data and knowledge representing the same real-world object into a consistent, accurate, and useful representation.
Data fusion, is the process of integration of multiple data and knowledge representing the same real-world object into a consistent, accurate, and useful representation.
Data mining
Data mining, a relatively young and interdisciplinary field of computer science, is the process that results in the discovery of new patterns in large data sets.
Data mining, a relatively young and interdisciplinary field of computer science, is the process that results in the discovery of new patterns in large data sets.
Data reduction
Data Reduction is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form.
Data Reduction is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form.
Data transformation (statistics)
In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the...
In statistics, data transformation refers to the application of a deterministic mathematical function to each point in a data set — that is, each data point zi is replaced with the...
Data visualization
Data visualization is the study of the visual representation of data, meaning "information which has been abstracted in some schematic form, including attributes or variables for the units of in...
Data visualization is the study of the visual representation of data, meaning "information which has been abstracted in some schematic form, including attributes or variables for the units of in...
Double mass analysis
Double mass analysis is a commonly used data analysis approach for investigating the behaviour of records made of hydrological or meteorological data at a number of locations.
Double mass analysis is a commonly used data analysis approach for investigating the behaviour of records made of hydrological or meteorological data at a number of locations.
Dynamic mode decomposition
The dynamic mode decomposition is a mathematical method to extract the relevant modes from experimental data, without any recurrence to the governing equations.
The dynamic mode decomposition is a mathematical method to extract the relevant modes from experimental data, without any recurrence to the governing equations.
Empirical distribution function
In statistics, the empirical distribution function, or empirical cdf, is the cumulative distribution function associated with the empirical measure of the sample.
In statistics, the empirical distribution function, or empirical cdf, is the cumulative distribution function associated with the empirical measure of the sample.
Evolutionary data mining
Evolutionary data mining, or genetic data mining is an umbrella term for any data mining using evolutionary algorithms.
Evolutionary data mining, or genetic data mining is an umbrella term for any data mining using evolutionary algorithms.
Experimental uncertainty analysis
The purpose of this introductory article is to discuss the experimental uncertainty analysis of a derived quantity, based on the uncertainties in the experimentally measured quantities t...
The purpose of this introductory article is to discuss the experimental uncertainty analysis of a derived quantity, based on the uncertainties in the experimentally measured quantities t...
Explained variation
In statistics, explained variation or explained randomness measures the proportion to which a mathematical model accounts for the variation (= apparent randomness) of a given data set.
In statistics, explained variation or explained randomness measures the proportion to which a mathematical model accounts for the variation (= apparent randomness) of a given data set.
Exploratory data analysis
Exploratory data analysis (EDA) is an approach to analysing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses.
Exploratory data analysis (EDA) is an approach to analysing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses.
Exponential smoothing
Exponential smoothing is a technique that can be applied to time series data, either to produce smoothed data for presentation, or to make forecasts.
Exponential smoothing is a technique that can be applied to time series data, either to produce smoothed data for presentation, or to make forecasts.
Forecasting
Forecasting is the process of making statements about events whose actual outcomes have not yet been observed.
Forecasting is the process of making statements about events whose actual outcomes have not yet been observed.
Functional data analysis
Functional data analysis is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum.
Functional data analysis is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum.
Geometric data analysis
Geometric data analysis can refer to geometric aspects of image analysis, pattern analysis and shape analysis or the approach of multivariate statistics that treats arbitrary data sets as clou...
Geometric data analysis can refer to geometric aspects of image analysis, pattern analysis and shape analysis or the approach of multivariate statistics that treats arbitrary data sets as clou...
German tank problem
In the statistical theory of estimation, the problem of estimating the maximum of a discrete uniform distribution from sampling without replacement, is known - in the English-speaking world - as the '...
In the statistical theory of estimation, the problem of estimating the maximum of a discrete uniform distribution from sampling without replacement, is known - in the English-speaking world - as the '...
Grand mean
The grand mean is the mean of the means of several subsamples.
The grand mean is the mean of the means of several subsamples.
Grouped data
Grouped data is a statistical term used in data analysis.
Grouped data is a statistical term used in data analysis.
Health care analytics
Health care analytics is a rapidly evolving field of health care business solutions that makes extensive use of data, statistical and qualitative analysis, explanatory and predictive modeling.
Health care analytics is a rapidly evolving field of health care business solutions that makes extensive use of data, statistical and qualitative analysis, explanatory and predictive modeling.
Hit selection
The process of selecting hits is called hit selection.
The process of selecting hits is called hit selection.
Imputation (statistics)
In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point.
In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point.
Independent component analysis
Independent component analysis is a computational method for separating a multivariate signal into additive subcomponents supposing the mutual statistical independence of the non-Gaussian source...
Independent component analysis is a computational method for separating a multivariate signal into additive subcomponents supposing the mutual statistical independence of the non-Gaussian source...
Index of dispersion
In probability theory and statistics, the index of dispersion, dispersion index, coefficient of dispersion, or variance-to-mean ratio (VMR), like the coefficient of variation, ...
In probability theory and statistics, the index of dispersion, dispersion index, coefficient of dispersion, or variance-to-mean ratio (VMR), like the coefficient of variation, ...
Inverse Mills ratio
In statistics, the inverse Mills ratio, named after John P. Mills, is the ratio of the probability density function to the cumulative distribution function of a distribution.
In statistics, the inverse Mills ratio, named after John P. Mills, is the ratio of the probability density function to the cumulative distribution function of a distribution.
Inverse-variance weighting
In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the sum.
In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the sum.
Item tree analysis
Item tree analysis is a data analytical method which allows constructing a hierarchical structure on the items of a questionnaire or test from observed response patterns.
Item tree analysis is a data analytical method which allows constructing a hierarchical structure on the items of a questionnaire or test from observed response patterns.
k-medoids
The -medoids algorithm is a clustering algorithm related to the -means algorithm and the medoidshift algorithm.
The -medoids algorithm is a clustering algorithm related to the -means algorithm and the medoidshift algorithm.
Klecka's tau
Klecka's tau is a statistic which is used to test whether a given classification analysis improves one's classification to groups over a random allocation to the various groups under consideration.
Klecka's tau is a statistic which is used to test whether a given classification analysis improves one's classification to groups over a random allocation to the various groups under consideration.
Limited dependent variable
A limited dependent variable is a variable whose range of possible values is "restricted in some important way."
A limited dependent variable is a variable whose range of possible values is "restricted in some important way."
Lincoln index
The Lincoln index is a statistical measure used in several fields to estimate the number of cases that have not yet been observed, based on two independent sets of observed cases.
The Lincoln index is a statistical measure used in several fields to estimate the number of cases that have not yet been observed, based on two independent sets of observed cases.
LISREL
LISREL, an acronym for linear structural relations, is a statistical software package used in structural equation modeling.
LISREL, an acronym for linear structural relations, is a statistical software package used in structural equation modeling.
Local convex hull
Local convex hull (LoCoH) is a method for estimating the size the home range of an animal or a group of animals (e.g.
Local convex hull (LoCoH) is a method for estimating the size the home range of an animal or a group of animals (e.g.
Missing data
In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation.
In statistics, missing data, or missing values, occur when no data value is stored for the variable in the current observation.
Multiple correspondence analysis
In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set.
In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set.
Multiscale geometric analysis
Multiscale geometric analysis or geometric multiscale analysis is an emerging area of high-dimensional signal processing and data analysis.
Multiscale geometric analysis or geometric multiscale analysis is an emerging area of high-dimensional signal processing and data analysis.
Multitrait-multimethod matrix
The multitrait-multimethod matrix is an approach to examining Construct Validity developed by Campbell and Fiske.
The multitrait-multimethod matrix is an approach to examining Construct Validity developed by Campbell and Fiske.
Natural Language Toolkit
Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language.
Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language.
Neighbourhood components analysis
Neighbourhood components analysis is a supervised learning method for clustering multivariate data into distinct classes according to a given distance metric over the data.
Neighbourhood components analysis is a supervised learning method for clustering multivariate data into distinct classes according to a given distance metric over the data.
Normal score
The term normal score is used with two different meanings in statistics.
The term normal score is used with two different meanings in statistics.
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data.
In statistics, an outlier is an observation that is numerically distant from the rest of the data.
Overdispersion
In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given simple statistical model.
In statistics, overdispersion is the presence of greater variability (statistical dispersion) in a data set than would be expected based on a given simple statistical model.
Oversampling and undersampling in data analysis
Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e.
Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e.
Photoanalysis
Photoanalysis (or photo analysis) refers to the study of pictures to compile various types of data, for example, to measure the size distribution of virtually anything that can be captured by photo.
Photoanalysis (or photo analysis) refers to the study of pictures to compile various types of data, for example, to measure the size distribution of virtually anything that can be captured by photo.
Political forecasting
Political forecasting aims at predicting the outcome of elections.
Political forecasting aims at predicting the outcome of elections.
Post-hoc analysis
In the design and analysis of experiments, post-hoc analysis (from Latin post hoc, "after this") consists of looking at the data—after the experiment has concluded—for patterns that we...
In the design and analysis of experiments, post-hoc analysis (from Latin post hoc, "after this") consists of looking at the data—after the experiment has concluded—for patterns that we...
Power transform
In statistics, the power transform is from a family of functions that are applied to create a rank-preserving transformation of data using power functions.
In statistics, the power transform is from a family of functions that are applied to create a rank-preserving transformation of data using power functions.
Principal component analysis
Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly...
Principal component analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly...
Principal geodesic analysis
In geometric data analysis and statistical shape analysis, principal geodesic analysis is a generalization of principal component analysis to a non-Euclidean, non-linear setting of manifolds sui...
In geometric data analysis and statistical shape analysis, principal geodesic analysis is a generalization of principal component analysis to a non-Euclidean, non-linear setting of manifolds sui...
Probit
In probability theory and statistics, the probit function is the inverse cumulative distribution function, or quantile function associated with the standard normal distribution.
In probability theory and statistics, the probit function is the inverse cumulative distribution function, or quantile function associated with the standard normal distribution.
Proxy (statistics)
In statistics, a proxy variable is something that is probably not in itself of any great interest, but from which a variable of interest can be obtained.
In statistics, a proxy variable is something that is probably not in itself of any great interest, but from which a variable of interest can be obtained.
Quantile normalization
In statistics, quantile normalization is a technique for making two distributions identical in statistical properties.
In statistics, quantile normalization is a technique for making two distributions identical in statistical properties.
Reification (statistics)
In statistics, reification is the use of an idealized model of a statistical process.
In statistics, reification is the use of an idealized model of a statistical process.
Report mining
Report mining is the extraction of data from human readable computer reports.
Report mining is the extraction of data from human readable computer reports.
Segmented regression
Segmented regression is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval.
Segmented regression is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval.
Self-modeling mixture analysis
Self-modeling mixture analysis is a class of data analysis techniques that are also termed as Blind signal separation or Blind source separation which are used to separate pure data compon...
Self-modeling mixture analysis is a class of data analysis techniques that are also termed as Blind signal separation or Blind source separation which are used to separate pure data compon...
Shape of the distribution
In statistics, the concept of the shape of the distribution refers to the shape of a probability distribution and it most often arises in questions of finding an appropriate distribution to use ...
In statistics, the concept of the shape of the distribution refers to the shape of a probability distribution and it most often arises in questions of finding an appropriate distribution to use ...
Silhouette (clustering)
Silhouette refers to a method of interpretation and validation of clusters of data.
Silhouette refers to a method of interpretation and validation of clusters of data.
Smoothing
In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-sca...
In statistics and image processing, to smooth a data set is to create an approximating function that attempts to capture important patterns in the data, while leaving out noise or other fine-sca...
SSMD
The SSMD is short for "Strictly Standardized Mean Difference", a measure of statistical effect size.
The SSMD is short for "Strictly Standardized Mean Difference", a measure of statistical effect size.
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory.
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory.
Standard score
In statistics, a standard score indicates by how many standard deviations an observation or datum is above or below the mean.
In statistics, a standard score indicates by how many standard deviations an observation or datum is above or below the mean.
Standardized rate
Standardized rates are a statistical measure of any rates in a population.
Standardized rates are a statistical measure of any rates in a population.
Stationary subspace analysis
Stationary Subspace Analysis (SSA) is a blind source separation algorithm which factorizes a multivariate time series into stationary and non-stationary components.
Stationary Subspace Analysis (SSA) is a blind source separation algorithm which factorizes a multivariate time series into stationary and non-stationary components.
Statistical assumption
Statistical assumptions are general assumptions about statistical populations.
Statistical assumptions are general assumptions about statistical populations.
Strictly standardized mean difference
In statistics, the strictly standardized mean difference (SSMD) is a measure of effect size.
In statistics, the strictly standardized mean difference (SSMD) is a measure of effect size.
Structured data analysis (statistics)
Structured data analysis is the statistical data analysis of structured data.
Structured data analysis is the statistical data analysis of structured data.
Subgroup analysis
Subgroup analysis, in the context of design and analysis of experiments, refers to looking for pattern in a subset of the subjects.
Subgroup analysis, in the context of design and analysis of experiments, refers to looking for pattern in a subset of the subjects.
Taylor's law
Taylor's law is an empirical law in ecology that relates the between sample variance in density to the overall mean density of a sample of organisms in a study area.
Taylor's law is an empirical law in ecology that relates the between sample variance in density to the overall mean density of a sample of organisms in a study area.
Test set
A test set is a set of data used in various areas of information science to assess the strength and utility of a predictive relationship.
A test set is a set of data used in various areas of information science to assess the strength and utility of a predictive relationship.
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence,...
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence,...
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.
TinkerPlots
TinkerPlots is exploratory data analysis software designed for use by students in grades 4-8.
TinkerPlots is exploratory data analysis software designed for use by students in grades 4-8.
Topological data analysis
Topological data analysis is a new area of study aimed at having applications in areas such as data mining and computer vision.
Topological data analysis is a new area of study aimed at having applications in areas such as data mining and computer vision.
Training set
A training set is a set of data used in various areas of information science to discover potentially predictive relationships.
A training set is a set of data used in various areas of information science to discover potentially predictive relationships.
Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out.
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out.
Variance-stabilizing transformation
In applied statistics, a variance-stabilizing transformation is a data transformation that is specifically chosen either to simplify considerations in graphical exploratory data analysis or to a...
In applied statistics, a variance-stabilizing transformation is a data transformation that is specifically chosen either to simplify considerations in graphical exploratory data analysis or to a...
Visual comparison
A visual comparison is to compare two or more things by eye.
A visual comparison is to compare two or more things by eye.
Visual inspection
Visual inspection is a common method of quality control, data acquisition, and data analysis.
Visual inspection is a common method of quality control, data acquisition, and data analysis.
Wide and narrow data
Wide and narrow (sometimes un-stacked and stacked) are terms used to describe two different presentations for tabular data.
Wide and narrow (sometimes un-stacked and stacked) are terms used to describe two different presentations for tabular data.
Window function
In signal processing, a window function is a mathematical function that is zero-valued outside of some chosen interval.
In signal processing, a window function is a mathematical function that is zero-valued outside of some chosen interval.
Settings