diff --git a/HAN-document_classification/README.md b/HAN-document_classification/README.md new file mode 100644 index 00000000..11a93297 --- /dev/null +++ b/HAN-document_classification/README.md @@ -0,0 +1,36 @@ +## Introduction +This is the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) paper in PyTorch. +* Dataset is 600k documents extracted from [Yelp 2018](https://www.yelp.com/dataset) customer reviews +* Use [NLTK](http://www.nltk.org/) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) to tokenize documents and sentences +* Both CPU & GPU support +* The best accuracy is 71%, reaching the same performance in the paper + +## Requirement +* python 3.6 +* pytorch = 0.3.0 +* numpy +* gensim +* nltk +* coreNLP + +## Parameters +According to the paper and experiment, I set model parameters: +|word embedding dimension|GRU hidden size|GRU layer|word/sentence context vector dimension| +|---|---|---|---| +|200|50|1|100| + +And the training parameters: +|Epoch|learning rate|momentum|batch size| +|---|---|---|---| +|3|0.01|0.9|64| + +## Run +1. Prepare dataset. Download the [data set](https://www.yelp.com/dataset), and unzip the custom reviews as a file. Use preprocess.py to transform file into data set foe model input. +2. Train the model. Word enbedding of train data in 'yelp.word2vec'. The model will trained and autosaved in 'model.dict' +``` +python train +``` +3. Test the model. +``` +python evaluate +``` \ No newline at end of file diff --git a/HAN-document_classification/data/test_samples.pkl b/HAN-document_classification/data/test_samples.pkl new file mode 100644 index 00000000..7a904f09 Binary files /dev/null and b/HAN-document_classification/data/test_samples.pkl differ diff --git a/HAN-document_classification/data/train_samples.pkl b/HAN-document_classification/data/train_samples.pkl new file mode 100644 index 00000000..a4ad849b Binary files /dev/null and b/HAN-document_classification/data/train_samples.pkl differ diff --git a/HAN-document_classification/data/yelp.word2vec b/HAN-document_classification/data/yelp.word2vec new file mode 100644 index 00000000..38eaf2b4 Binary files /dev/null and b/HAN-document_classification/data/yelp.word2vec differ