Introduction
This is the implementation of Hierarchical Attention Networks for Document Classification paper in PyTorch.
- Dataset is 600k documents extracted from Yelp 2018 customer reviews
- Use NLTK and Stanford CoreNLP to tokenize documents and sentences
- Both CPU & GPU support
- The best accuracy is 71%, reaching the same performance in the paper
Requirement
- python 3.6
- pytorch = 0.3.0
- numpy
- gensim
- nltk
- coreNLP
Parameters
According to the paper and experiment, I set model parameters:
word embedding dimension |
GRU hidden size |
GRU layer |
word/sentence context vector dimension |
200 |
50 |
1 |
100 |
And the training parameters:
Epoch |
learning rate |
momentum |
batch size |
3 |
0.01 |
0.9 |
64 |
Run
- Prepare dataset. Download the data set, and unzip the custom reviews as a file. Use preprocess.py to transform file into data set foe model input.
- Train the model. Word enbedding of train data in 'yelp.word2vec'. The model will trained and autosaved in 'model.dict'
python train
- Test the model.
python evaluate