Table of Contents

Material in the databases

We give a description of the material included in the two sub-bases.

BCMS

The verb selection for BCMS was conducted using the corpora srWac, hrWaC, bsWaC and meWaC, all of which are part of Clarin.si’s infrastructure that uses NoSketch Engine to search and analyze different corpora. The criterion was frequency: the 3000 most frequent verbs from each of the corpora were included. The corpora of BCMS had substantial overlap, which is why the number of included verbs is not 12000, as expected without any overlap, but 5300, with a number of verbs repeated in regional variants. Different shapes that the same verbs have in two or each of the varieties were introduced as separate entries and annotated as variants of one verb. Some typical examples of variants are ekavian and ijekavian versions (e.g. verovati and vjerovati 'to believe'), or versions emerging from using different native suffixes to adopt borrowed verbs (e.g. lajk-a-ti and lajk-ova-ti 'to like (on social media)').

Slovenian

The list of 3000 most common Slovenian verbs was made using Clarin.si’s infrastructure that uses NoSketch Engine to search and analyze different corpora. For the purposes of this database, we used the Gigafida 2.0 corpora. You can find general information about the corpora here and its website here.

Some general notes

Items that got on the list due to mistakes in annotation in the corpus were excluded from our list and replaced by the next web on the list of most common verbs. One such example from Slovenian is ‘Hoče’. Hoče is indeed the 3. person singular form of the verb hoteti, but it is also a proper name of a Slovenian municipality. Since hoteti ‘to want’ was independently on the list, the form ‘Hoče’ was excluded.

The list of verbs includes several homophonous verbs. Since the corpus is not annotated for meaning, homophonous verbs are counted as one verb. For example, in Slovenian, the verb brati can mean ‘read’ or ‘gather, collect’. In such cases the annotators annotated the verb for the properties associated with what they took to be the more frequent use of the verb. Same goes for prefixed versions (prebrati ‘to finish reading’ or ‘pick through’) but note that not all meanings appear with all prefixes (odbrati just ‘collect some items from a set, separate’).

Annotation

The following properties were annotated for each verb. If a property is only applicable to one sub-base, this is indicated.

Back to start page.