INTRODUCTION

These instructions tell you how to easily implement tools for
restoring punctuation information and case information in your MT
outputs. We assume here that outputs are in English.

The tools are based on commands of the SRI LM Toolkit, which can be
freely downloaded from http://www.speech.sri.com/projects/srilm. In
particular, the following commands of the SRI LM Toolkit will be used:

- ngram-count  , to estimate and build n-gram language models (LMs)
- hidden-ngram , to insert missing punctuation marks
- disambig     , to perform case restoration 

Remark 1: as training data you should use English texts compliant with
          the restrictions set by the evaluation track you are
          participating in.

Remark 2: the proposed tools are based on statistical models and, hence,
          are prone to errors. Performance improvements can be likely
          achieved by using more refined statistical models and more
          training data.

=== PUNCTUATION INFORMATION RESTORATION

You need to train an n-gram LM including punctuation marks but no case
information. In order to better recover question marks, we suggest to
format the training data for the LM as follows:

  <s> ? can i sit here ? </s>
  <s> . make some room for us , please . </s>
  <s> ? can we share the table ? </s>
  <s> . pass the salt , please . </s>

Basically, you should repeat the final punctuation mark at the begin
of each sentence. Let train-en.txt be the training corpus, you can
create a 3-gram LM file (train.lm) by running the command:

  $> ngram-count -text train-en.txt -lm train-en.lm -order 3 -unk -kndiscount1 -kndiscount2 -kndiscount3

Hence, let test-en.txt be a text file without punctuation marks with
the format:

  <s> i'd like coffee with cream please </s>
  <s> i'd like decaffeinated coffee please </s>
  <s> may i have some butter </s>

Punctuation marks can be inserted by running the following command,
where punct.txt is a text file containing the punctuation marks to be
inserted.

  $> hidden-ngram -text test-en.txt -lm train-en.lm -hidden-vocab punct.txt -order 3 -keep-unk > test-en.punc.txt

Notice: punct.txt must contain one punctuation mark in each line!

The output of the above command should be like this:

  <s> . i'd like coffee with cream , please . </s>
  <s> . i'd like decaffeinated coffee , please . </s>
  <s> ? may i have some butter . </s>

Notice that there might be disagreement between the first and last
punctuation marks. Our suggestion is that in case of disagreement the
last punctuation mark should be replaced with the first one. After
this post-processing, the first punctuation mark should be removed.

=== CASE INFORMATION RESTORATION

First, build a case sensitive LM (Train-En.lm), including punctuation
marks, from a corpus (Train-en.txt) with the following format:

  <s> Please input your pin number . </s>
  <s> This is my first time diving . </s>
  <s> I've never heard of this address around here . </s>
  <s> I have a sore pain here . </s>
  <s> Go straight until you see a drugstore . </s>

Build the LM with the command:

  $> ngram-count -order 3 -text Train-En.txt -lm Train-En.lm -unk -kndiscount1 -kndiscount2 -kndiscount3

Second, prepare a file (Map.txt) containing all observed case-variants
for each word.  You can collect all variants within the training
corpus with the following Perl script:

--
#! /usr/bin/perl

open(INP,"tr -s ' ' '\n'|");

while ($tk=<INP>){
  chop $tk;
  $nctk=$tk;$nctk=~tr/A-Z/a-z/;
  $map{$nctk}{$tk}++;
}
close(INP);

foreach $nctk (keys %map){
  print "$nctk";
  foreach $tk (keys %{$map{$nctk}}){printf(" %s",$tk);}
  print "\n";
}
--

Finally, pass case insensitive text, including punctuation marks,
(test-en.txt) through the following command:

  $> disambig -lm Text-En.lm -order 3 -map Map.txt -text test-en.txt -keep-unk -fb

---------------------
Marcello Federico
ITC-irst, Trento

July 3, 2006