Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
yh_cc e361b32c3a | 4 years ago | |
---|---|---|
.. | ||
README.md | 5 years ago | |
data-prepare.py | 4 years ago | |
data-process.py | 5 years ago | |
main.py | 4 years ago | |
make_data.sh | 5 years ago | |
model.py | 5 years ago | |
models.py | 4 years ago | |
optm.py | 5 years ago | |
train.py | 5 years ago | |
train.sh | 5 years ago | |
transformer.py | 5 years ago | |
utils.py | 5 years ago |
An implementation of Multi-Criteria Chinese Word Segmentation with Transformer with fastNLP.
We use the same datasets listed in paper.
First, download OpenCC to convert between Traditional Chinese and Simplified Chinese.
pip install opencc-python-reimplemented
Then, set a path to save processed data, and run the shell script to process the data.
export DATA_DIR=path/to/processed-data
bash make_data.sh path/to/sighan2005 path/to/sighan2008
It would take a few minutes to finish the process.
We use transformer to build the model, as described in paper.
Finally, to train the model, run the shell script.
The train.sh
takes one argument, the GPU-IDs to use, for example:
bash train.sh 0,1
This command use GPUs with ID 0 and 1.
Note: Please refer to the paper for details of hyper-parameters. And modify the settings in train.sh
to match your experiment environment.
Type
python main.py --help
to learn all arguments to be specified in training.
Results on the test sets of eight CWS datasets with multi-criteria learning.
Dataset | MSRA | AS | PKU | CTB | CKIP | CITYU | NCC | SXU | Avg. |
---|---|---|---|---|---|---|---|---|---|
Original paper | 98.05 | 96.44 | 96.41 | 96.99 | 96.51 | 96.91 | 96.04 | 97.61 | 96.87 |
Ours | 96.92 | 95.71 | 95.65 | 95.96 | 96.00 | 96.09 | 94.61 | 96.64 | 95.95 |
一款轻量级的自然语言处理(NLP)工具包,目标是减少用户项目中的工程型代码,例如数据处理循环、训练循环、多卡运行等
Python Jupyter Notebook Text CSV Markdown