浅谈BLEU评分

Last updated on 2019-10-17…

BLEU，全称为Bilingual Evaluation Understudy（双语评估替换），是一种对生成语句进行评估的指标，用于比较候选文本翻译与其他一个或多个参考翻译的评价分数。

BLEU评分是由Kishore Papineni等人2002年的论文《BLEU: a Method for Automatic Evaluation of Machine Translation》中提出的。

这种评分标准是为了评估自动机器翻译系统的预测结果而开发的。

优点：计算速度快、计算成本低、容易理解、与具体语言无关、和人类给的评估高度相关。
缺点：不考虑语言表达（语法）上的准确性；测评精度会受常用词的干扰；短译句的测评精度有时会较高；没有考虑同义词或相似表达的情况，可能会导致合理翻译被否定。

除了翻译之外，BLEU评分结合深度学习方法可应用于其他的语言生成问题，例如：语言生成、图片标题生成、文本摘要、语音识别。

本文主要参考文章：链接1、链接2、链接3

n单位片段精确度

n-gram precision

1、n单位片段(n-gram)：指一个语句里面连续的n个单词组成的片段，一个18单词的语句有18个1-gram，每个单词都是一个1-gram；有17个2-gram，这个很好理解。

2、精确度(precision)：指Candidate语句里面的n-gram在所有Reference语句里面出现的概率。

Example 1：

Candidate 1：It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party .

在Example 1的Candidate 1 语句中，18个单词共有17个单词出现过，所以1-gram的precision是17/18；17个2-gram片段总共有10个出现过，所以2-gram的precision是10/17。

Example 2：

Candidate: the the the the the the the.

Reference 1: The cat is on the mat.

Reference 2: There is a cat on the mat.

在Example 2的Candidate 语句1-gram的Precision是7/7。

所以我们发现这个方法存在一个问题，就是可能Reference里面的单词会被重复利用，这是不合理的。所以有了“修正的n-单位精确度”(modified n-gram recision)。

modified n-gram recision

修正的n-单位精确度：主要思路是Reference语句里面如果一个单词片段已经被匹配，那么这个片段就不能再次被匹配，并且一个单词片段只能取一个Reference语句中出现次数的最大值，比如7个the分别在Reference 1 和 2中出现2和1次，所以取2而不是两者相加的3。

利用以上方法，每一个句子都可以得到一个modified n-gram recision，一个句子不能代表文本翻译的水平高低，于是把一段话或者所有翻译句子的结果综合起来可以得到p_n

简而言之，就是把所有句子的modified n-gram precision的分子加起来除以分母加起来。

BP值和BLEU值

对于不同的长度n都会有一个p_n，那么如何将不同n的p_n结合起来得到最终的Bleu值。研究者们还考虑到一种情况，就是待测译文翻译不完全不完整的情况，这个问题在机器翻译中是不能忽略的，而简单的p_n值不能反映这个问题，例如Example 3。

Example 3：

Candidate: of the

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

这个问题也不能用recall来解决，例如Example 4. 显然Candidate 1的回召率比Candidate 2要高，但是显然Candidate 1的翻译不如Candidate 2。所以recall并不能解决这个问题。

Example 4：

Candidate 1: I always invariably perpetually do.

Candidate 2: I always do.

Reference 1: I always do.

Reference 2: I invariably do.

Reference 3: I perpetually do.

首先引入BP值(Brevity Penalty)，作者指定当待评价译文同任意一个参考译文长度相等或超过参考译文长度时，BP值为1，当待评价译文的长度较短时，则用一个算法得出BP值。以c来表示待评价译文的长度，r来表示参考译文的文字长度，则

BLEU值(Bilingual Evaluation Understudy)计算为

在对数情况下，计算变得更加简便

通常这个N取4，w_n=1/4，这就是很多论文里面的一个经典指标Bleu4

计算BLEU分数（python）

Python自然语言工具包库（NLTK）提供了BLEU评分的实现，你可以使用它来评估生成的文本，通过与参考文本对比。

语句BLEU分数

NLTK提供了sentence_bleu()函数，用于根据一个或多个参考语句来评估候选语句。

参考语句必须作为语句列表来提供，其中每个语句是一个记号列表。

候选语句作为一个记号列表被提供。

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]
candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate)
print(score)

运行这个例子，会输出一个满分，因为候选语句完全匹配其中一个参考语句

1.0

语料库BLEU分数

NLTK还提供了一个称为corpus_bleu()的函数来计算多个句子（如段落或文档）的BLEU分数。

参考文本必须被指定为文档列表，其中每个文档是一个参考语句列表，每个可替换的参考语句是记号列表。

候选文档必须被指定为列表，其中每个文件是一个记号列表。

from nltk.translate.bleu_score import corpus_bleu

references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]

score = corpus_bleu(references, candidates)
print(score)

运行这个例子就像之前一样输出满分

1.0

此外，NLTK中提供的BLEU评分方法允许你在计算BLEU分数时为不同的n元组指定权重。这使你可以灵活地计算不同类型的BLEU分数，如单独和累加的n-gram分数。

单独的N-Gram分数

单独的N-gram分数是对特定顺序的匹配n元组的评分，例如单个单词（称为1-gram）或单词对（称为2-gram或bigram）。

权重被指定为一个数组，其中每个索引对应相应次序的n元组。仅要计算1-gram匹配的BLEU分数，你可以指定1-gram权重为1，对于2元,3元和4元指定权重为0，也就是权重为（1,0,0,0）。例如：

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
print(score)

输出

0.75

我们可以重复这个例子，对于从1元到4元的各个n-gram运行语句如下所示：

# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))

运行该示例，结果如下所示：

Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 1.000000
Individual 4-gram: 1.000000

虽然我们可以计算出单独的BLEU分数，但这并不是使用这个方法的初衷，而且得出的分数也没有过多的含义，或者看起来具有说明性。

累加的N-Gram分数

累加分数是指对从1到n的所有单独n-gram分数的计算，通过计算加权几何平均值来对它们进行加权计算。

默认情况下，sentence_bleu（）和corpus_bleu（）分数计算累加的4元组BLEU分数，也称为BLEU-4分数。

BLEU-4对1元组，2元组，3元组和4元组分数的权重为1/4（25％）或0.25。例如：

# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score) 

运行这个例子，输出下面的分数：

0.707106781187

累加的和单独的1元组BLEU使用相同的权重，也就是（1,0,0,0）。计算累加的2元组BLEU分数为1元组和2元组分别赋50％的权重，计算累加的3元组BLEU为1元组，2元组和3元组分别为赋33％的权重。

让我们通过计算BLEU-1，BLEU-2，BLEU-3和BLEU-4的累加得分来具体说明：

# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

运行该示例输出下面的分数。结果的差别很大，比单独的n-gram分数更具有表达性。

Cumulative 1-gram: 0.750000
Cumulative 2-gram: 0.500000
Cumulative 3-gram: 0.632878
Cumulative 4-gram: 0.707107

在描述文本生成系统的性能时，通常会报告从BLEU-1到BLEU-4的累加分数。

更详细的实验代码点这个链接。

From Neural Machine Translation