segrev Chenchen Ding & Masao Utiyama Thu Jul 21 16:03:50 JST 2016 * Introduction segrev.cpp is the software used in Chenchen Ding, Masao Utiyama and Eiichiro Sumita. (2015) Improving fast_align by Reordering. EMNLP segrev.cpp is released under the MIT license. * Compling g++ -o segrev segrev.cpp * Usage segrev -s train.src -t train.tgt -a align.src-tgt -ri r_index -rs r_src -ra r_align train.{src, tgt} : parallel corpus for training align.src-tgt : alignment generated by a word-aligner based on train.*. It is compatible with the output of fast_align. r_index : permuntated word indices of train.src after seg_rev r_src : permuntated words of train.src after seg_rev (only used for checking) r_align : permuntated source indices of align.src-tgt after seg_rev (only used for cheching) segrev -s r_src -i r_index -a align.r_src-tgt -ri rec_index -rs rec_src -ra rec_align -rec r_{src, index} : as above align.r_src-tgt : alignment generated by a word-aligner based on r_src and train.tgt rec_align : final word algniment between original train.* rec_{index, rec_src} : only used for checking * Example A demo for using segrev for generate word alignemt on en->ja using train.en and train.ja. We use fast_align as the word aligner. Auxiliary processing: for_fa.py : a script preparing the format required by fast_align. clean_empty_align.py : segrev crashes when a sentence pair has no aligned word, just using a dummy 0-0 Please prepare these scripts according to your demands. The following commands may be used for that purpose. paste train.en train.ja | awk -F'\t' '{print $1, "|||", $2}' awk '{if(NF==0){print "0-0"}else{print}}' Note that the word alignement on the ja->en direction should also be prepared in the same way and then the synmmetric huristics can be applied. # # round-1 # for_fa.py train.en train.ja > train.en-ja.0 fast_align -i train.en-ja.0 -d -o -v | clean_empty_align.py > align.en-ja.0 segrev -s train.en -t train.ja -a align.en-ja.0 -ri r_i.0 -rs r_s.0 -ra r_a.0 segrev -s r_s.0 -t train.ja -a r_a.0 -ri rr_i.0 -rs rr_s.0 -ra rr_a.0 # # round-2 # for_fa.py rr_s.0 train.ja > train.en-ja.1 fast_align -i train.en-ja.1 -d -o -v | clean_empty_align.py > align.en-ja.1 segrev -s rr_s.0 -i rr_i.0 -a align.en-ja.1 -ri rrec_i.0 -rs rrec_s.0 -ra rrec_a.0 -rec segrev -s rrec_s.0 -i r_i.0 -a rrec_a.0 -ri rec_i.0 -rs rec_s.0 -ra rec_align.en-ja.1 -rec segrev -s train.en -t train.ja -a rec_align.en-ja.1 -ri r_i.1 -rs r_s.1 -ra r_a.1 segrev -s r_s.1 -t train.ja -a r_a.1 -ri rr_i.1 -rs rr_s.1 -ra rr_a.1 # # round-3 # for_fa.py rr_s.1 train.ja > train.en-ja.2 fast_align -i train.en-ja.2 -d -o -v | clean_empty_align.py > align.en-ja.2 segrev -s rr_s.1 -i rr_i.1 -a align.en-ja.2 -ri rrec_i.1 -rs rrec_s.1 -ra rrec_a.1 -rec segrev -s rrec_s.1 -i r_i.1 -a rrec_a.1 -ri rec_i.1 -rs rec_s.1 -ra rec_align.en-ja.2 -rec