A corpus-based word frequency list of Turkish Evidence from the subcorpora of Turkish National Corpus project /
Word frequency studies have a central role in various disciplines, such as linguistics, cognitive psychology, natural language processing, computational linguistics. Developments in the computer technologies and information processing help researchers make comprehensive word lists on the basis of di...
The Szeged Conference : proceedings of the 15th International Conference on Turkish Linguistics held on August 20-22, 2010 in Szeged 49
|Summary:||Word frequency studies have a central role in various disciplines, such as linguistics, cognitive psychology, natural language processing, computational linguistics. Developments in the computer technologies and information processing help researchers make comprehensive word lists on the basis of digitally constructed language corpora. Since Kucera and Francis's first corpus-based word frequency lists derived from the Brown Corpus (1967), a variety of research have been conducted on general or specialized corpora to obtain rank frequency order and distribution of words for different Indo-European languages (Johansson & Hofland 1989; Leech et al. 2001; Baroni et al. 2004; Ha et al. 2006; Davies & Gardner 2010). In Turkish, Goz's dictionary (2003), which is based on a 1 million-word general corpus, is the only work on word frequency. In general, lexical properties of Turkish and, in particular, word frequency lists of text collections representing different registers of Turkish need to be described via corpus-based word frequency lists. Keeping this necessity in mind, this study has two aims: (1) to produce word frequency lists of Turkish on the basis of two subcorpora, namely the Corpus of Contemporary Turkish Fiction and the Corpus of Contemporary Turkish News Texts. In this respect, frequency lists of both root types and word classes in Turkish are prepared; (2) to compare these two corpora by using frequency profiling information. This paper is organized as follows. First we explain basic concepts and review literature of word frequency studies. Then, we describe the construction of two subcorpora used to derive wordlists and explain the steps followed in tokenization and root type mapping scheme on which the token and root counts are based. Finally, we compare rank frequency and word class lists of Turkish Fiction and Turkish News Texts Corpora.|