“Unification of the systems of grammatical annotation in the Turkic languages corpora (UniTurk seminar)”

The creation of electronic linguistic corpora presents a wide range of problems and challenges for their developers. Their successful solution requires unification of results of linguistic research and modern computer methods of analysis of linguistic data. The possibilities of a corpus are largely determined by the system of annotation (marking).

In the context of globalization and integration of scientific research, the problems of unification in the representation of linguistic information, in particular, of the annotation systems of grammatical categories, gain special importance. This is especially important for groups of kindred languages. The analysis of the current situation shows that in the Turkic Corpus linguistics, despite genetic and structural and typological commonality of Turkic languages, common principles and approaches to linguistic annotation of texts have not yet been developed. In the long term this will lead to considerable difficulties for conducting comparative studies, as well as for the development of Turkic parallel corpora, multilingual text processing systems and for solving of other theoretical and applied problems.

Despite the fact that most of the experts are convinced of the need to use a system of tags in written fixation of texts, there is no single unified format for representation of linguistic information in corpora. The differences relate to both the inventory of grammatical categories and meta-language of their description, and the composition of the necessary layers of representation. The same morphological categories in different studies in Turkic languages are designated differently. The developers of corpora use the annotation systems created for other, primarily Indo-European languages, which do not always adequately reflect the specific features of the Turkic languages. Therefore, the development of the annotation system for this particular group of languages is today very important.

The lack of uniformity in corpora annotation is due to both objective scientific-content problems (complexity of natural language system, non-isomorphism of the grammatical structure of different languages, differences in the degree of transparency of morphonological processes, etc.) and to the organizational ones (lack of a single coordinating centre or standards for the development of grammatical and semantic annotation for corpora and linguistic databases, etc.).

The participants of the scientific-practical seminar “Unification of the systems of grammatical annotation in the Turkic languages corpora (UniTurk seminar)” state that one of the most important tasks of Turkic linguistics is to develop such a standard of representation of linguistic information that would allow to organize the existing corpora of Turkic languages and the ones which are being created into a single information space for a wide range of users: specialists in Turkic studies, typologists and lay-users.

We suggest:

  1. to support the work of the initiative group on creation of a variant of the table of annotations for grammatical categories in Turkic languages and to adopt it as the basis for the development of a unified system of grammatical annotation of the corpora;
  2. to form a working group for the development of general recommendations on the representation of linguistic information in Turkic languages corpora and of the standards of grammatical annotation of corpora. To appoint the Scientific Research Institute “Applied Semiotics” of the Academy of Sciences of the Republic of Tatarstan as coordinator of the group (responsible person – Airat Gatiatullin, email: ayrat.gatiatullin@gmail.ru);
  3. to develop the standards of representation of grammatical information in the corpora;
  4. to display the progress and results of discussion of problems of unification of grammatical annotation systems in Turkic languages corpora on the website (http://TurkLang.tatar/);
  5. to hold the UniTurk seminar on a regular basis (also with the use of videoconference);
  6. to prepare Kazan agreement on cooperation in the field of unification of the Turkic languages.