بسم الله الرحمن الرحیم
python run_giz.py [giza_bin_dir] [tokenizer_script_path(input null, if text is pre-tokenized)] [cleaner_script_path] [build_dir] [src_file] [trg_file] [src_lang_type] [trgt_lang_type] [min_len] [max_len]
This script is created by Mohammad Sadegh Rasooli during PhD research in Columbia University. This script enables you to run Giza++ (along with MKCLS) without any need to run moses decoder. The output is the word alingment from the target language to the source language and vice versa.
The following preprocessing steps are done while running the script:
corpus.tok.clean.[lang_id] are the files in [build_dir] after cleaning the size of the corpus and corpus.tok.clean.lower.[lang_id] are the files after lowercasing the files. In alignment, lowercased files are used, thus if you want to use the original casing, you can easily align them with corpus.tok.clean.[lang_id] files.
In [build_dir] you can find the following file patterns:
"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks.
A rule based analyzer for Persian verbs
Feed-forward POS tagger
A semi-supervised tagger (with partial tagging and without tagging dictionary)
Yara K-Beam Arc-Eager Dependency Parser
A semi-supervised parser
A full pipleline for YaraParser