Research Papers 2d ago Updated 2d ago 53

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

A reproducible workflow was developed for building and evaluating a Universal Dependencies-style parsing resource for Katharevousa parliamentary quest

65
Hot
90
Quality
75
Impact

Deep Analysis

Background

The study addresses the challenge of processing Katharevousa Greek, a historical form used in legal, administrative, and parliamentary archives. Despite its significance, existing natural language processing (NLP) pipelines are poorly equipped to handle this register. The workflow presented aims to fill this gap by creating a structured parsing resource.

Key Points

The workflow involves several steps:

  1. OCR-Aware Reconstruction: Correcting OCR errors in scanned documents.
  2. Schema-Constrained LLM-Assisted Annotation: Using language models to annotate sentences while adhering to predefined schemas.
  3. Automatic Validation: Ensuring the accuracy of annotations through automated checks.
  4. Deterministic CoNLL-U Snapshotting: Creating a standardized format for parsing results.
  5. Fixed-Split Evaluation: Splitting the dataset into training and testing sets to ensure consistent evaluation across different models.

The workflow resulted in a frozen reference set containing 1,697 sentences, with 1,357 used for training and 340 reserved for testing. This resource was compared against various parsers including off-the-shelf Greek and Ancient Greek parsers, a feature-based parser, multilingual BERT (mBERT), XLM-R, and custom Stanza training.

The study found that:

  • Off-the-Shelf Systems: Showed significant register mismatch, with the best external baseline, spaCy Greek, achieving 0.4183 LAS.
  • XLM-R Model Performance: Outperformed other models, reaching UPOS accuracy of 0.8893, dependency-relation F1 score of 0.7250, UAS of 0.6098, and LAS of 0.5162. This represents an absolute gain of 0.0980 in LAS over the best external baseline.
  • Feature-Based Model: Continued to be competitive for UPOS and relation labeling, suggesting that transparent lexical-context features remain relevant.

Significance

The paper's methodology provides a robust approach to turning challenging historical OCR data into reusable NLP infrastructure. The entire pipeline, including code, schema, frozen reference annotations, fixed train/test split, and per-model benchmark reports, is open-access, making it available for further research and development in the field of Katharevousa Greek parsing.

The findings underscore the importance of tailor-made models for specific registers and historical texts, highlighting that off-the-shelf solutions may not be sufficient. The transparent nature of the workflow also allows for reproducibility and validation by other researchers, contributing to the advancement of NLP tools for less-resourced languages and historical corpora.

Disclaimer: The above content is generated by AI and is for reference only.

NLP 依存句法解析 通用资源 评估 OCR感知重建
Share: