Automatic classification of large archives of astronomical texts

Astronomical knowledge of interest for Virtual Observatory consists not only of catalogues, images and spectra, but also of huge amount of purely textual information - articles, preprints, abstracts etc. However, large archives of astronomical texts currently in operation are neither keyword-based nor semantically structured, in clear contradiction with VO principles. There is, of course, PACS, used by major refereed journals, but it is arbitrary and implies manual assignment of keywords. Obviously, manual classification of the whole corpus is impossible - we need a way to perform it in automatic, uniform and well-defined way.

There are two possible solutions for this task. First is to use existing thesauri or already defined semantic maps of the field, and classify texts according to appearance of terms as exact phrases in them. Obvious problems of this approach, however, are arbitrary and incomplete word lists, which inevitably have large inertia towards appearance of new scientific topics of interest.

Second, more general, approach implies building ontology from scratch, by linguistic analysis of texts and their relations. Extraction of phrases, common for the corpus and uncommon for some reference text base, results in a list of keywords - key concepts, popular object names etc. Then, analysis of frequencies of keyword appearances in the same article, or in articles of the same author, may reveal the relations between keywords. In result, we effectively build the complete semantic map of the given text corpus.

Here we demonstrate how this approach works for the array of arXiv.org astro-ph abstracts. Also, we briefly discuss the possibility of revealing the semantic relations between extracted keywords using Wikipedia database.

Last modified: Mon Jan 19 17:12:10 GMT 2009