|Title||Predicting the Out-of-Vocabulary Rate and the Required Vocabulary Size for Speech Processing Applications|
|Authors||Johannes Müller, Holger Stahl, Manfred Lang|
|Type||Scientific Conference Paper|
|Abstract||This paper describes an approach for predicting both the vocabulary size and the resulting out-of-vocabulary rate (OOV-rate) for a hypothetical extension of an existing text corpus. By splitting the original corpus into two different sub-corpora, vocabulary and OOV-rate can be determined for that special constellation. Average values are calculated for all combinations of sub-corpora and can be approximated by analytic function terms. These functions enable the easy prediction of the vocabulary size and the OOV-rate. The prediction accuracy results in a relative error below 4.6%.
Keywords: out-of-vocabulary rate, OOV-rate, vocabulary size, text corpus, test corpus, training corpus
|Reference||Proceedings ICSLP 96 (Philadelphia, USA, 1996), pp. 658-661
|Download||Scientific Conference Paper as pdf file (48 kByte)|