Language:
Seema

Seema

Iteratative type-token analyser
Download Link

Seema

Seema is developed by LDC-IL to analyze type-token of text corpus iteratively. The data can be given in XML or TXT formats. It efficiently calculates token and type counts at user-defined intervals and computes the increment rate for each iteration

Seema exclusively considers tokens within the user-defined script, excluding numeric tokens and disregarding punctuation marks during type counting.

To get unbiased results, random selection of files need to be ensured. Seema achieve this by employing multiple threads for processing TXT or XML files. It operates 100 threads concurrently, with each thread selecting files spaced at 100-file intervals. This approach optimizes efficiency and introduces a level of randomness to file selection, accounting for variations in corpus file sizes and processing times. Semaphores are utilized to safeguard the accuracy of type-token counts amidst competing threads. The results are displayed in a grid format.

Type-token analysis can reveal differences in language use across various texts, genres, or time periods. By comparing type-token ratios between different corpora, researchers can identify linguistic trends, stylistic differences, or changes in language over time. The saturation of lexical items in a corpus can also be assessed by the analysis.

Seema helps researchers understand the richness and diversity of vocabulary within a corpus by providing counts of both unique words (types) and total words (tokens). This information can be used for assessing vocabulary size, lexical variety, and word usage patterns, quantitative linguistic analysis and understanding the complexities of language usage within corpora.

Credits: Rajesha N, Linguistic Data Consortium for Indian Languages (LDC-IL), Central Institute of Indian Languages, Mysore.

Seema Interface :