User Tools

Site Tools


material_in_the_databases

Material in the databases

We give a description of the material included in the two sub-bases.

BCMS

The verb selection for BCMS was conducted using the corpora srWac, hrWaC, bsWaC and meWaC, all of which are part of Clarin.si’s infrastructure that uses NoSketch Engine to search and analyze different corpora. The criterion was frequency: the 3000 most frequent verbs from each of the corpora were included. The corpora of BCMS had substantial overlap, which is why the number of included verbs is not 12000, as expected without any overlap, but 5300, with a number of verbs repeated in regional variants. Different shapes that the same verbs have in two or each of the varieties were introduced as separate entries and annotated as variants of one verb. Some typical examples of variants are ekavian and ijekavian versions (e.g. verovati and vjerovati 'to believe'), or versions emerging from using different native suffixes to adopt borrowed verbs (e.g. lajk-a-ti and lajk-ova-ti 'to like (on social media)').

Slovenian

The list of 3000 most common Slovenian verbs was made using Clarin.si’s infrastructure that uses NoSketch Engine to search and analyze different corpora. For the purposes of this database, we used the Gigafida 2.0 corpora. You can find general information about the corpora here and its website here.

Some general notes

Items that got on the list due to mistakes in annotation in the corpus were excluded from our list and replaced by the next web on the list of most common verbs. One such example from Slovenian is ‘Hoče’. Hoče is indeed the 3. person singular form of the verb hoteti, but it is also a proper name of a Slovenian municipality. Since hoteti ‘to want’ was independently on the list, the form ‘Hoče’ was excluded.

The list of verbs includes several homophonous verbs. Since the corpus is not annotated for meaning, homophonous verbs are counted as one verb. For example, in Slovenian, the verb brati can mean ‘read’ or ‘gather, collect’. In such cases the annotators annotated the verb for the properties associated with what they took to be the more frequent use of the verb. Same goes for prefixed versions (prebrati ‘to finish reading’ or ‘pick through’) but note that not all meanings appear with all prefixes (odbrati just ‘collect some items from a set, separate’).

Annotation

The following properties were annotated for each verb. If a property is only applicable to one sub-base, this is indicated.

  • the verb’s regional variant,
  • its variants regarding the realization of the phoneme yat (for SC),
  • the base of the verb (the chunk preceding the theme vowel),
  • the 3rd person singular present tense form,
  • the theme vowel,
  • frequency in the corpus (tokens per million words),
  • availability of imperfective interpretation,
  • prefix (the rightmost one),
  • second prefix (second rightmost),
  • third prefix,
  • whether the verb can be intransitive,
  • whether the verb can be intransitive with an external argument,
  • whether the verb denotes a state,
  • whether the verb can take an argument (each type of argument annotated as a separate property): in the accusative case, in the dative case, in the genitive case, in the instrumental case, a clausal argument, a PP argument, an obligatory reflexive accusative,
  • whether there are two verbs, one without and another with the reflexive accusative,
  • the verb’s aspectual pair,
  • whether each of the following morphological operations applies to the verb to get to the aspectual pair (descriptively speaking; the application of each operation to derive the aspectual pair annotated as a separate property): adding a suffix, removing a suffix, adding a prefix, removing a prefix, apophony, theme vowel change, suppletion,
  • whether the verb includes each of the following suffixes (availability of each suffix annotated as a separate property, the suffixes in bold only in SC): ava/uje, ova/uje, ava, eva/uje, eva, nu/ne (in Slovenian ni/ne), ka, ta, ca, j, isa, ira, iva/uje, va/ je,
  • whether the verb is simplex (root + theme vowel + inflection),
  • whether the verb is derived from a word of another category,
  • whether the verb involves root allomorphy and the list of root allomorphs,
  • for each of the following positions it was marked whether it bears prosodic prominence,and, in SC, whether it contains a long vowel (separately for the infinitive and for the 3rd person singular in the present tense; (each combination annotated as one property, for instance: whether the verb has a long syllable on the theme vowel is one property): the inflection, the theme vowel, the syllable preceding the theme vowel,, the syllable two slots before the theme vowel, any syllable three or more slots before the theme vowel.
  • the passive participle,
  • the -nje nominalization.

Back to start page.

material_in_the_databases.txt · Last modified: 2023/09/06 09:49 by pm