BlonD: An Automatic Evaluation Metric for Document-level Machine Translation
Yuchen Jiang, Shuming Ma, Dongdong Zhang, Ming Zhou
Standard automatic metrics such as BLEU are problematic for the document-level MT evaluation. Neither can they distinguish document-level improvements from sentence-level ones, nor can they identify which specific phenomena lead to translation errors.
To address these problems, we propose an automatic metric BlonD for document-level machine translation evaluation. BlonD takes discourse coherence into consideration by calculating recall and distance of check-pointing phrases and tags, and further provide comprehensive evaluation scores by combining with n-gram.
An extensive comparison between BlonD and existing evaluation metrics is conducted to illustrate their critical distinctions. Experimental results show a large margin between the sensitivity of BlonD and those metrics to the document-level improvements. Human evaluation also reveals high Pear-son R correlation values between BlonD scores and human judgments of translation quality.