Jezik in slovstvo

Kodni sistem
Slovenska književnost
Avtorji
Uredništvo <-> bralci

Jezik in slovstvo
Povzetki

Tomaž Erjavec

Računalniške zbirke besedil
Computerized Text Collections

Slovenski sinopsis
English synopsis
English summary

Slovenski sinopsis

Urejene računalniške zbirke besedil --- korpusi --- postajajo nepogrešljiv vir jezikovnih podatkov. Za slovenščino javno dostopnih korpusov še ni. Članek podaja zgodovinski pregled razvoja računalniških korpusov, njihovo tipologijo in področja uporabe. Podrobneje spregovori o dveh vprašanjih: standardizaciji zapisovanja ter orodjih za njihovo razvijanje in izkoriščanje. Drugi del članka je posvečen projektu MULTEXT-East (Multilingual Text Tools and Corpora for Central and Eastern European Languages; Večjezična besedilna orodja in korpusi za srednje- in vzhodnoevropske jezike), ki vključuje tudi slovenščino. Največ pozornosti namenja predstavitvi korpusa in oblikoslovnih in skladenjskih opisov, razvitih v okviru projektov, ter trenutno dostopnim rezultatom. V zaključnem delu spregovori o nekaterih možnostih za razvoj korpusnega jezikoslovja v Sloveniji.

English synopsis

Ordered and computerized text collections --- corpora --- are becoming an indispensable source of linguistic data. Freely available corpora of the Slovene language do not yet exist. The article gives a historical overview of the development of computer corpora, their typology and fields of application. Two aspects of corpora are discussed next: the standardization of their encoding and the tools for their development and exploitation. The second part of the article gives an overview of the MULTEXT-East project (Multilingual Text Tools and Corpora for Central and Eastern European Languages), which also includes the Slovene language. The focus of the presentation is on the corpus and morphosyntactic descriptions developed in the project and on its currently available results. Finally, some possibilities for developing the field of corpus linguistics in Slovenia are discussed.

English summary

Ordered and computerized text collections --- corpora --- are becoming an indispensable source of linguistic data. Freely available corpora of the Slovene language do not yet exist. The article gives a historical overview of the development of computer corpora, their typology and fields of application. Two aspects of corpora are discussed next: the standardization of their encoding and the tools for their development and exploitation. The second part of the article gives an overview of the MULTEXT-East project (Multilingual TextTools and Corpora for Central and Eastern European Languages), which also includes the Slovene language. The focus of the presentation is on the corpus and morphosyntactic descriptions developed in the project and on its currently available results. Finally, some possibilities for developing the field of corpus linguistics in Slovenia are discussed.