Statistical natural language processing
Additive smoothing
In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing), or Lidstone smoothing, is a technique used to smooth categorical data.
In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing), or Lidstone smoothing, is a technique used to smooth categorical data.
Dissociated press
Dissociated press is an algorithm for generating text based on another text.
Dissociated press is an algorithm for generating text based on another text.
Dynamic topic model
Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time.
Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time.
F1 score
In statistics, the F1 score is a measure of a test's accuracy.
In statistics, the F1 score is a measure of a test's accuracy.
Factored language model
The factored language model (FLM) is an extension of a conventional language model.
The factored language model (FLM) is an extension of a conventional language model.
Frederick Jelinek
Frederick Jelinek (18 November 1932 – 14 September 2010) was a Czech American researcher in information theory, automatic speech recognition, and natural language processing.
Frederick Jelinek (18 November 1932 – 14 September 2010) was a Czech American researcher in information theory, automatic speech recognition, and natural language processing.
Glottochronology
Glottochronology (from Att.-Greek γλῶττα “tongue, language” and χρóνος “time”) is that part of lexicostatistics dealing with the chronological relationship between languages.
Glottochronology (from Att.-Greek γλῶττα “tongue, language” and χρóνος “time”) is that part of lexicostatistics dealing with the chronological relationship between languages.
Herdan's law
Herdan's law, also known as Heaps' law states that word frequency distribution follows a power law.
Herdan's law, also known as Heaps' law states that word frequency distribution follows a power law.
Interactive machine translation
Interactive Machine Translation (IMT), is a specific sub-field of computer-aided translation.
Interactive Machine Translation (IMT), is a specific sub-field of computer-aided translation.
Katz's back-off model
Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
Latent Dirichlet allocation
In statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
In statistics, latent Dirichlet allocation is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
Markov information source
In mathematics, a Markov information source, or simply, a Markov source, is an information source whose underlying dynamics are given by a stationary finite Markov chain.
In mathematics, a Markov information source, or simply, a Markov source, is an information source whose underlying dynamics are given by a stationary finite Markov chain.
Markovian discrimination
Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and nonspam more accurately than in simple Bayesian methods.
Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and nonspam more accurately than in simple Bayesian methods.
Maximum entropy Markov model
In machine learning, a maximum entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Marko...
In machine learning, a maximum entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Marko...
Maximum-entropy Markov model
In machine learning, a maximum-entropy Markov model, or conditional Markov model, is a graphical model for sequence labeling that combines features of hidden Markov models and maximum entr...
In machine learning, a maximum-entropy Markov model, or conditional Markov model, is a graphical model for sequence labeling that combines features of hidden Markov models and maximum entr...
Natural Language Toolkit
Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language.
Natural Language Toolkit or, more commonly, NLTK is a suite of libraries and programs for symbolic and statistical natural language processing for the Python programming language.
Noisy channel model
The noisy channel model is a framework used, for instance, in spell checkers, question answering, speech recognition, and machine translation.
The noisy channel model is a framework used, for instance, in spell checkers, question answering, speech recognition, and machine translation.
Noisy text analytics
Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data.
Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data.
OpenNLP
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text.
Apache OpenNLP is a machine learning based toolkit for the processing of natural language text.
Pachinko allocation
In machine learning and natural language processing, the pachinko allocation model (PAM) is a topic model, i.e. a generative statistical model for discovering the abstract "topics" that occur in...
In machine learning and natural language processing, the pachinko allocation model (PAM) is a topic model, i.e. a generative statistical model for discovering the abstract "topics" that occur in...
Probabilistic latent semantic analysis
Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for...
Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for...
Sinkov statistic
Sinkov statistics, also known as log-weight statistics, is a specialized field of statistics that was developed by Abraham Sinkov, while working for the small Signal Intelligence Service o...
Sinkov statistics, also known as log-weight statistics, is a specialized field of statistics that was developed by Abraham Sinkov, while working for the small Signal Intelligence Service o...
Statistical machine translation
Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual ...
Statistical machine translation is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual ...
Statistical parsing
Statistical parsing is a group of parsing methods within natural language processing.
Statistical parsing is a group of parsing methods within natural language processing.
Statistical semantics
Statistical semantics is the study of "how the statistical patterns of human word usage can be used to figure out what people mean, at least to a level sufficient for information access" (Furnas...
Statistical semantics is the study of "how the statistical patterns of human word usage can be used to figure out what people mean, at least to a level sufficient for information access" (Furnas...
Stochastic context-free grammar
A stochastic context-free grammar (SCFG; also probabilistic context-free grammar, PCFG) is a context-free grammar in which each production is augmented with a probability.
A stochastic context-free grammar (SCFG; also probabilistic context-free grammar, PCFG) is a context-free grammar in which each production is augmented with a probability.
Synchronous context-free grammar
Synchronous context-free grammars (SynCFG or SCFG; not to be confused with stochastic CFGs constitute a formal model of natural language syntax, developed in the area of statistical ...
Synchronous context-free grammars (SynCFG or SCFG; not to be confused with stochastic CFGs constitute a formal model of natural language syntax, developed in the area of statistical ...
Text analytics
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence,...
The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence,...
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.
tf*idf
The tf*idf weight is a numerical statistic which reflects how important a word is to a document in a collection or corpus.
The tf*idf weight is a numerical statistic which reflects how important a word is to a document in a collection or corpus.
tf-idf
The tf–idf weight is a weight often used in information retrieval and text mining.
The tf–idf weight is a weight often used in information retrieval and text mining.
tf–idf
The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining.
The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining.
Topic model
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.
Trigram tagger
A trigram tagger is a statistical part-of-speech tagger based on second order Markov models.
A trigram tagger is a statistical part-of-speech tagger based on second order Markov models.
Variable rules analysis
In linguistics, variable rules analysis is a set of statistical analysis methods commonly used in sociolinguistics and historical linguistics to describe patterns of variation between alternativ...
In linguistics, variable rules analysis is a set of statistical analysis methods commonly used in sociolinguistics and historical linguistics to describe patterns of variation between alternativ...
Writer invariant
Writer invariant, also called authorial invariant or author's invariant, is a property of a text which is invariant of its author, that is, it will be similar in all texts of a given...
Writer invariant, also called authorial invariant or author's invariant, is a property of a text which is invariant of its author, that is, it will be similar in all texts of a given...
Settings