Korpus CVET 1.0 : Izdelava, opis in analiza zbirke starejših besedil v verski periodiki

Košir, Diana; Erjavec, Tomaž

Show document
A+ | A- | | SLO | ENG

Title:	Korpus CVET 1.0 : Izdelava, opis in analiza zbirke starejših besedil v verski periodiki
Authors:	ID Košir, Diana (Author) ID Erjavec, Tomaž (Author)
Files:	PDF - Presentation file, download (21,69 MB) MD5: 4F3B47542E4F6641935966099EF65AE2 URL - Source URL, visit https://zenodo.org/records/13912515
Language:	Slovenian
Typology:	1.08 - Published Scientific Conference Contribution
Organization:	ZRS Koper - Science and Research Centre Koper
Abstract:	V prispevku je predstavljen proces izdelave in jezikoslovnega označevanja korpusa CVET 1.0, ki vsebuje besedila patra Hijacinta Repiča v starejšem slovenskem jeziku, objavljena v verskem glasilu Cvetje z vertov sv. Frančiškav obdobju 1881–1916. Besedila so bila v obliki PDF pridobljena s portala dLib, urejena v urejevalniku Word in nato pretvorjena v zapis TEI. Starejše besedje je bilo z odprtokodnim orodjem za normalizacijo avtomatsko posodobljeno, kar olajša iskanje po korpusu in nadaljnjo analizo gradiva. V članku so izpostavljene nekatere napake, ki so nastale pri posodabljanju in bodo v naslednji verziji korpusa ročno popravljene. Posodobljena besedila so bila nato še avtomatsko jezikoslovno označena z oblikoskladnjo in skladnjo po sistemu Universal Dependencies. Zapis TEI smo pretvorili v več izvedenih formatov in zbirko objavili pod odprto licenco na repozitoriju in konkordančnikih CLARIN.SI, ki so primerni za jezikoslovne analize gradiva. V drugem deluprispevkaje prikazan primer analize avtorjevega pripovednega stila, opravljene s konkordančnikom noSketch Engine, ki temelji na frekvenčnih spremenljivkah najpogostejših in najmanj pogostih besed terključnih besed
Keywords:	starejša slovenščina, verski tisk, TEI, normalizacija, stilistična analiza, leksika
Publication version:	Version of Record
Year of publishing:	2024
Number of pages:	Str. 184-204
PID:	20.500.12556/DiRROS-21294
UDC:	81'32
COBISS.SI-ID:	223670531
Note:	Nasl. z nasl. zaslona; Opis vira z dne 23. 12. 2024;
Publication date in DiRROS:	23.01.2025
Views:	485
Downloads:	316
Metadata:
:	Copy citation

Share:

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a monograph

Title:	Jezikovne tehnologije in digitalna humanistika : zbornik konference
Editors:	Špela Arhar Holdt, Tomaž Erjavec
Place of publishing:	Ljubljana
Publisher:	Inštitut za novejšo zgodovino, = Institute of Contemporary History
Year of publishing:	2024
ISBN:	978-961-7104-40-0
COBISS.SI-ID:	211315971

Licences

License:	CC BY-SA 4.0, Creative Commons Attribution-ShareAlike 4.0 International

Link:	http://creativecommons.org/licenses/by-sa/4.0/
Description:	This Creative Commons license is very similar to the regular Attribution license, but requires the release of all derivative works under this same license.

Secondary language

Language:	English
Title:	Corpus CVET 1.0: Creation, description and analysis of a collection of older texts in religious periodicals
Abstract:	The e paper presents the process of creation and linguistic tagging of the CVET 1.0 corpus, which contains the texts of Father Hijacint Repič in the older Slovenian language, published in the religious journal Cvetje z vertov sv.Frančiškain the period 1881–1916. The texts were obtained in PDF format from the dLib portal, edited in the Word editor and then converted to TEI. Older words were automatically updated using an open-source normalisation tool, which facilitates corpus search and further analysis of the material. The article points out some errors that occurred during normalisation,which will be corrected manually in the next version of the corpus(e.g. keterim> ketim* > katerim; kesneje> kosno* > kasneje; sobrat> zbrat* > sobrat). The updated texts were then automatically linguistically annotated,including morphosyntactic annotationsas well asmorphological and syntactic annotations according to the Universal Dependencies Formalism for Slovenian. We converted the TEI-encoded versions into various formats and published the collection under an open licence in the CLARIN.SI repository and concordancers suitable for linguistic analysis of the material. The second partof the paperpresentsan example of the analysis of the author's narrative styleperformed withnoSketchEngine, based on the frequency variables of the most and least frequent words and keywords
Keywords:	historical Slovenian language, religious texts, TEI, normalisation, stylistic analysis, lexis

Back

Show document A+ | A- | | SLO | ENG

Record is a part of a monograph

Licences

Secondary language

Show document
A+ | A- | | SLO | ENG