Digital repository of Slovenian research organisations

Show document
A+ | A- | Help | SLO | ENG

Title:Korpus CVET 1.0 : Izdelava, opis in analiza zbirke starejših besedil v verski periodiki
Authors:ID Košir, Diana (Author)
ID Erjavec, Tomaž (Author)
Files:.pdf PDF - Presentation file, download (21,69 MB)
MD5: 4F3B47542E4F6641935966099EF65AE2
 
URL URL - Source URL, visit https://zenodo.org/records/13912515
 
Language:Slovenian
Typology:1.08 - Published Scientific Conference Contribution
Organization:Logo ZRS Koper - Science and Research Centre Koper
Abstract:V prispevku je predstavljen proces izdelave in jezikoslovnega označevanja korpusa CVET 1.0, ki vsebuje besedila patra Hijacinta Repiča v starejšem slovenskem jeziku, objavljena v verskem glasilu Cvetje z vertov sv. Frančiškav obdobju 1881–1916. Besedila so bila v obliki PDF pridobljena s portala dLib, urejena v urejevalniku Word in nato pretvorjena v zapis TEI. Starejše besedje je bilo z odprtokodnim orodjem za normalizacijo avtomatsko posodobljeno, kar olajša iskanje po korpusu in nadaljnjo analizo gradiva. V članku so izpostavljene nekatere napake, ki so nastale pri posodabljanju in bodo v naslednji verziji korpusa ročno popravljene. Posodobljena besedila so bila nato še avtomatsko jezikoslovno označena z oblikoskladnjo in skladnjo po sistemu Universal Dependencies. Zapis TEI smo pretvorili v več izvedenih formatov in zbirko objavili pod odprto licenco na repozitoriju in konkordančnikih CLARIN.SI, ki so primerni za jezikoslovne analize gradiva. V drugem deluprispevkaje prikazan primer analize avtorjevega pripovednega stila, opravljene s konkordančnikom noSketch Engine, ki temelji na frekvenčnih spremenljivkah najpogostejših in najmanj pogostih besed terključnih besed
Keywords:starejša slovenščina, verski tisk, TEI, normalizacija, stilistična analiza, leksika
Publication version:Version of Record
Year of publishing:2024
Number of pages:Str. 184-204
PID:20.500.12556/DiRROS-21294 New window
UDC:81'32
COBISS.SI-ID:223670531 New window
Note:Nasl. z nasl. zaslona; Opis vira z dne 23. 12. 2024;
Publication date in DiRROS:23.01.2025
Views:466
Downloads:310
Metadata:XML DC-XML DC-RDF
:
Copy citation
  
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a monograph

Title:Jezikovne tehnologije in digitalna humanistika : zbornik konference
Editors:Špela Arhar Holdt, Tomaž Erjavec
Place of publishing:Ljubljana
Publisher:Inštitut za novejšo zgodovino, = Institute of Contemporary History
Year of publishing:2024
ISBN:978-961-7104-40-0
COBISS.SI-ID:211315971 New window

Licences

License:CC BY-SA 4.0, Creative Commons Attribution-ShareAlike 4.0 International
Link:http://creativecommons.org/licenses/by-sa/4.0/
Description:This Creative Commons license is very similar to the regular Attribution license, but requires the release of all derivative works under this same license.

Secondary language

Language:English
Title:Corpus CVET 1.0: Creation, description and analysis of a collection of older texts in religious periodicals
Abstract:The e paper presents the process of creation and linguistic tagging of the CVET 1.0 corpus, which contains the texts of Father Hijacint Repič in the older Slovenian language, published in the religious journal Cvetje z vertov sv.Frančiškain the period 1881–1916. The texts were obtained in PDF format from the dLib portal, edited in the Word editor and then converted to TEI. Older words were automatically updated using an open-source normalisation tool, which facilitates corpus search and further analysis of the material. The article points out some errors that occurred during normalisation,which will be corrected manually in the next version of the corpus(e.g. keterim> ketim* > katerim; kesneje> kosno* > kasneje; sobrat> zbrat* > sobrat). The updated texts were then automatically linguistically annotated,including morphosyntactic annotationsas well asmorphological and syntactic annotations according to the Universal Dependencies Formalism for Slovenian. We converted the TEI-encoded versions into various formats and published the collection under an open licence in the CLARIN.SI repository and concordancers suitable for linguistic analysis of the material. The second partof the paperpresentsan example of the analysis of the author's narrative styleperformed withnoSketchEngine, based on the frequency variables of the most and least frequent words and keywords
Keywords:historical Slovenian language, religious texts, TEI, normalisation, stylistic analysis, lexis


Back