Alignment of Reuters Corpora
We aligned English sentences in RCV1 and Japanese sentences in
RCV2 that are available from NIST.
We made this data available to the public under the Permitted Uses:
"Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is not possible to reconstruct the information from these summaries."
Sample
Download
- reuters-je-11.txt.gz (56782 one-to-one sentence alignments, 7548538 bytes, euc-jp-unix)
- reuters-je-nm.txt.gz (13338 one-to-many sentence alignments, 2386892 bytes, euc-jp-unix)
- split.zip (Japanese and English sentences same as the above are stored in separate shift-jis-dos and ascii-dos files. Japanese sentences were analyzed by ChaSen. 8099454 bytes)
How to cite this data
The following article should be cited:
instead of the web page you are now reading because this data
was created by using this method. Other JE corpora are
available from the first author's
web site.
Last updated: 2005/11/18