fastNLP

History

FengZiYjun 7514be6f30 - add validation loss into trainer.train - restructure: move reproduction outside - add evaluate in tester		6 years ago
..
rt-polaritydata	- add validation loss into trainer.train	6 years ago

.gitignore	- add validation loss into trainer.train	6 years ago

README.md	- add validation loss into trainer.train	6 years ago

__init__.py	- add validation loss into trainer.train	6 years ago

dataset.py	- add validation loss into trainer.train	6 years ago

model.py	- add validation loss into trainer.train	6 years ago

train.py	- add validation loss into trainer.train	6 years ago

Introduction

This is the implementation of Convolutional Neural Networks for Sentence Classification paper in PyTorch.

MRDataset, non-static-model(word2vec rained by Mikolov etal. (2013) on 100 billion words of Google News)
It can be run in both CPU and GPU
The best accuracy is 82.61%, which is better than 81.5% in the paper
(by Jingyuan Liu @Fudan University; Email:(fdjingyuan@outlook.com) Welcome to discussion!)

Requirement

python 3.6
pytorch > 0.1
numpy
gensim

Run

STEP 1
install packages like gensim (other needed pakages is the same)

pip install gensim

STEP 2
install MRdataset and word2vec resources

MRdataset: you can download the dataset in (https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz)
word2vec: you can download the file in (https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)

Since this file is more than 1.5G, I did not display in folders. If you download the file, please remember modify the path in Function def word_embeddings(path = './GoogleNews-vectors-negative300.bin/'):

STEP 3
train the model

python train.py

you will get the information printed in the screen, like

Epoch [1/20], Iter [100/192] Loss: 0.7008
Test Accuracy: 71.869159 %
Epoch [2/20], Iter [100/192] Loss: 0.5957
Test Accuracy: 75.700935 %
Epoch [3/20], Iter [100/192] Loss: 0.4934
Test Accuracy: 78.130841 %

......
Epoch [20/20], Iter [100/192] Loss: 0.0364
Test Accuracy: 81.495327 %
Best Accuracy: 82.616822 %
Best Model: models/cnn.pkl

Hyperparameters

According to the paper and experiment, I set:

Epoch	Kernel Size	dropout	learning rate	batch size
20	(h,300,100)	0.5	0.0001	50

h = [3,4,5]
If the accuracy is not improved, the learning rate will *0.8.

Result

I just tried one dataset : MR. (Other 6 dataset in paper SST-1, SST-2, TREC, CR, MPQA)
There are four models in paper: CNN-rand, CNN-static, CNN-non-static, CNN-multichannel.
I have tried CNN-non-static:A model with pre-trained vectors from word2vec.
All words—including the unknown ones that are randomly initialized and the pretrained vectors are fine-tuned for each task
(which has almost the best performance and the most difficut to implement among the four models)

Dataset	Class Size	Best Result	Kim's Paper Result
MR	2	82.617%(CNN-non-static)	81.5%(CNN-nonstatic)

Reference

一款轻量级的自然语言处理（NLP）工具包，目标是减少用户项目中的工程型代码，例如数据处理循环、训练循环、多卡运行等

自然语言处理 nlp

Python Jupyter Notebook Text CSV Markdown

writerphone@163.com 1901722105@qq.com xuyige1996@gmail.com keezen@qq.com henryL7 fdjingyuan@outlook.com xpqiu@fudan.edu.cn 294130139@qq.com