User Tools

Site Tools


material_in_the_databases

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
material_in_the_databases [2022/03/17 19:34] stefanparmamaterial_in_the_databases [2023/09/06 09:49] (current) pm
Line 2: Line 2:
 We give a description of the material included in the two sub-bases.  We give a description of the material included in the two sub-bases. 
  
-===== BCS =====+===== BCMS =====
  
-The verb selection for BCS was conducted using the corpora [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=srwac&struct_attr_stats=1|srWac]], [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=hrwac&struct_attr_stats=1|hrWaC]], [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=bswac&struct_attr_stats=1|bsWaC]] and [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=mewac&struct_attr_stats=1|meWaC]], all of which are part of [[https://www.clarin.si/noske/index.html|Clarin.si]]’s infrastructure that uses [[https://nlp.fi.muni.cz/trac/noske|NoSketch Engine]] to search and analyze different corpora. The criterion was frequency: the 3000 most frequent verbs from each of the corpora were included. The corpora of SC had substantial overlap, which is why the number of included verbs is not 12000, as expected without any overlap, but 5300, with a number of verbs repeated in regional variants.+The verb selection for BCMS was conducted using the corpora [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=srwac&struct_attr_stats=1|srWac]], [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=hrwac&struct_attr_stats=1|hrWaC]], [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=bswac&struct_attr_stats=1|bsWaC]] and [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=mewac&struct_attr_stats=1|meWaC]], all of which are part of [[https://www.clarin.si/noske/index.html|Clarin.si]]’s infrastructure that uses [[https://nlp.fi.muni.cz/trac/noske|NoSketch Engine]] to search and analyze different corpora. The criterion was frequency: the 3000 most frequent verbs from each of the corpora were included. The corpora of BCMS had substantial overlap, which is why the number of included verbs is not 12000, as expected without any overlap, but 5300, with a number of verbs repeated in regional variants.
 Different shapes that the same verbs have in two or each of the varieties were introduced as separate entries and annotated as variants of one verb. Some typical examples of variants are ekavian and ijekavian versions (e.g. //verovati// and //vjerovati// 'to believe'), or versions emerging from using different native suffixes to adopt  Different shapes that the same verbs have in two or each of the varieties were introduced as separate entries and annotated as variants of one verb. Some typical examples of variants are ekavian and ijekavian versions (e.g. //verovati// and //vjerovati// 'to believe'), or versions emerging from using different native suffixes to adopt 
 borrowed verbs (e.g. //lajk-a-ti// and //lajk-ova-ti// 'to like (on social media)').    borrowed verbs (e.g. //lajk-a-ti// and //lajk-ova-ti// 'to like (on social media)').   
- 
-**Some general notes**\\ 
-FIXME 
  
 ===== Slovenian  ===== ===== Slovenian  =====
  
-The list of 3000 most common Slovenian verbs was made using [[https://www.clarin.si/noske/index.html|Clarin.si]]’s infrastructure that uses [[https://nlp.fi.muni.cz/trac/noske|NoSketch Engine]] to search and analyze different corpora. For the purposes of this database, we used  the Gigafida 2.0 corpora. You can find general information abot the corpora [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida20_dedup&struct_attr_stats=1&subcorpora=1|here]] and its website [[https://viri.cjvt.si/gigafida/|here]].+The list of 3000 most common Slovenian verbs was made using [[https://www.clarin.si/noske/index.html|Clarin.si]]’s infrastructure that uses [[https://nlp.fi.muni.cz/trac/noske|NoSketch Engine]] to search and analyze different corpora. For the purposes of this database, we used  the Gigafida 2.0 corpora. You can find general information about the corpora [[https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida20_dedup&struct_attr_stats=1&subcorpora=1|here]] and its website [[https://viri.cjvt.si/gigafida/|here]].
  
 ===== Some general notes  ===== ===== Some general notes  =====
Line 19: Line 16:
 Items that got on the list due to mistakes in annotation in the corpus were excluded from our list and replaced by the next web on the list of most common verbs. One such example from Slovenian is ‘//Hoče//’. Hoče is indeed the 3. person singular form of the verb //hoteti//, but it is also a proper name of a Slovenian municipality. Since //hoteti// ‘to want’ was independently on the list, the form ‘//Hoče//’ was excluded. \\ Items that got on the list due to mistakes in annotation in the corpus were excluded from our list and replaced by the next web on the list of most common verbs. One such example from Slovenian is ‘//Hoče//’. Hoče is indeed the 3. person singular form of the verb //hoteti//, but it is also a proper name of a Slovenian municipality. Since //hoteti// ‘to want’ was independently on the list, the form ‘//Hoče//’ was excluded. \\
  
-The list of verbs includes several homophonous verbs. Since the corpus is not annotated for meaning, homophonous verbs are counted as one verb. For example, in Slovenian, the verb //brati// can mean ‘read’ or ‘gather, collect’. In such cases the annotators annotated the verb for the propertes associated with what they took to be the more frequent use of the verb. Same goes for prefixed versions (//prebrati// ‘to finish reading’ or ‘pick through’) but note that not all meanings appear with all prefixes (//odbrati// just ‘collect some items from a set, separate’).+The list of verbs includes several homophonous verbs. Since the corpus is not annotated for meaning, homophonous verbs are counted as one verb. For example, in Slovenian, the verb //brati// can mean ‘read’ or ‘gather, collect’. In such cases the annotators annotated the verb for the properties associated with what they took to be the more frequent use of the verb. Same goes for prefixed versions (//prebrati// ‘to finish reading’ or ‘pick through’) but note that not all meanings appear with all prefixes (//odbrati// just ‘collect some items from a set, separate’).
  
  
material_in_the_databases.1647542051.txt.gz · Last modified: 2022/03/17 19:34 by stefanparma