Jezik in slovstvo

Kodni sistem
Slovenska knjizevnost
Avtorji
Urednistvo <-> bralci

Jezik in slovstvo
Povzetki

Tomaz Erjavec

Racunalniske zbirke besedil
Computerized Text Collections

Slovenski sinopsis
English synopsis
English summary

Slovenski sinopsis

Urejene racunalniske zbirke besedil --- korpusi --- postajajo nepogresljiv vir jezikovnih podatkov. Za slovenscino javno dostopnih korpusov se ni. Clanek podaja zgodovinski pregled razvoja racunalniskih korpusov, njihovo tipologijo in podrocja uporabe. Podrobneje spregovori o dveh vprasanjih: standardizaciji zapisovanja ter orodjih za njihovo razvijanje in izkoriscanje. Drugi del clanka je posvecen projektu MULTEXT-East (Multilingual Text Tools and Corpora for Central and Eastern European Languages; Vecjezicna besedilna orodja in korpusi za srednje- in vzhodnoevropske jezike), ki vkljucuje tudi slovenscino. Najvec pozornosti namenja predstavitvi korpusa in oblikoslovnih in skladenjskih opisov, razvitih v okviru projektov, ter trenutno dostopnim rezultatom. V zakljucnem delu spregovori o nekaterih moznostih za razvoj korpusnega jezikoslovja v Sloveniji.

English synopsis

Ordered and computerized text collections --- corpora --- are becoming an indispensable source of linguistic data. Freely available corpora of the Slovene language do not yet exist. The article gives a historical overview of the development of computer corpora, their typology and fields of application. Two aspects of corpora are discussed next: the standardization of their encoding and the tools for their development and exploitation. The second part of the article gives an overview of the MULTEXT-East project (Multilingual Text Tools and Corpora for Central and Eastern European Languages), which also includes the Slovene language. The focus of the presentation is on the corpus and morphosyntactic descriptions developed in the project and on its currently available results. Finally, some possibilities for developing the field of corpus linguistics in Slovenia are discussed.

English summary

Ordered and computerized text collections --- corpora --- are becoming an indispensable source of linguistic data. Freely available corpora of the Slovene language do not yet exist. The article gives a historical overview of the development of computer corpora, their typology and fields of application. Two aspects of corpora are discussed next: the standardization of their encoding and the tools for their development and exploitation. The second part of the article gives an overview of the MULTEXT-East project (Multilingual TextTools and Corpora for Central and Eastern European Languages), which also includes the Slovene language. The focus of the presentation is on the corpus and morphosyntactic descriptions developed in the project and on its currently available results. Finally, some possibilities for developing the field of corpus linguistics in Slovenia are discussed.