History

yunfan 75d593702a [update] multi criteria cws		5 years ago
..
README.md	[add] reproduction of multi-criteria cws	5 years ago

data-prepare.py	[update] multi criteria cws	5 years ago

data-process.py	[add] reproduction of multi-criteria cws	5 years ago

main.py	[add] reproduction of multi-criteria cws	5 years ago

make_data.sh	[add] reproduction of multi-criteria cws	5 years ago

model.py	[add] reproduction of multi-criteria cws	5 years ago

models.py	[bugfix] 针对pytorch1.3.0版本bug的补丁	5 years ago

optm.py	[add] reproduction of multi-criteria cws	5 years ago

train.py	[add] reproduction of multi-criteria cws	5 years ago

train.sh	[add] reproduction of multi-criteria cws	5 years ago

transformer.py	[bugfix] 针对pytorch1.3.0版本bug的补丁	5 years ago

utils.py	[add] reproduction of multi-criteria cws	5 years ago

Multi-Criteria-CWS

Dataset

We use the same datasets listed in paper.

First, download OpenCC to convert between Traditional Chinese and Simplified Chinese.

pip install opencc-python-reimplemented

Then, set a path to save processed data, and run the shell script to process the data.

export DATA_DIR=path/to/processed-data
bash make_data.sh path/to/sighan2005 path/to/sighan2008

It would take a few minutes to finish the process.

We use transformer to build the model, as described in paper.

Finally, to train the model, run the shell script.
The train.sh takes one argument, the GPU-IDs to use, for example:

bash train.sh 0,1

This command use GPUs with ID 0 and 1.

Note: Please refer to the paper for details of hyper-parameters. And modify the settings in train.sh to match your experiment environment.

Type

python main.py --help

to learn all arguments to be specified in training.

Results on the test sets of eight CWS datasets with multi-criteria learning.

Dataset	MSRA	AS	PKU	CTB	CKIP	CITYU	NCC	SXU	Avg.
Original paper	98.05	96.44	96.41	96.99	96.51	96.91	96.04	97.61	96.87
Ours	96.92	95.71	95.65	95.96	96.00	96.09	94.61	96.64	95.95

一款轻量级的自然语言处理（NLP）工具包，目标是减少用户项目中的工程型代码，例如数据处理循环、训练循环、多卡运行等

自然语言处理 nlp

Python Jupyter Notebook Text CSV Markdown

poemsmileyh@gmail.com will131@foxmail.com writerphone@163.com yunfan.shao@outlook.com ygxu18@fudan.edu.cn xuyige1996@gmail.com xpqiu@fudan.edu.cn 42239874+lyhuang18@users.noreply.github.com brxx122@gmail.com 1901722105@qq.com 1505116161@qq.com lyhuang19@163.com 845465009@qq.com SrWYG@users.noreply.github.com keezen@qq.com violetyao@berkeley.edu yexu_i@qq.com 17210240044@fudan.edu.cn 1004473299@qq.com benbenjituo@gmail.com ynzheng15@fudan.edu.cn henryL7 1349342500@QQ.com 15307130288@fudan.edu.cn 17966083+Xiaoxiong-Liu@users.noreply.github.com