Browse Source

add data samples

tags/v0.1.0
choosewhatulike 6 years ago
parent
commit
245d7cd2cc
4 changed files with 36 additions and 0 deletions
  1. +36
    -0
      HAN-document_classification/README.md
  2. BIN
      HAN-document_classification/data/test_samples.pkl
  3. BIN
      HAN-document_classification/data/train_samples.pkl
  4. BIN
      HAN-document_classification/data/yelp.word2vec

+ 36
- 0
HAN-document_classification/README.md View File

@@ -0,0 +1,36 @@
## Introduction
This is the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf) paper in PyTorch.
* Dataset is 600k documents extracted from [Yelp 2018](https://www.yelp.com/dataset) customer reviews
* Use [NLTK](http://www.nltk.org/) and [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) to tokenize documents and sentences
* Both CPU & GPU support
* The best accuracy is 71%, reaching the same performance in the paper

## Requirement
* python 3.6
* pytorch = 0.3.0
* numpy
* gensim
* nltk
* coreNLP

## Parameters
According to the paper and experiment, I set model parameters:
|word embedding dimension|GRU hidden size|GRU layer|word/sentence context vector dimension|
|---|---|---|---|
|200|50|1|100|

And the training parameters:
|Epoch|learning rate|momentum|batch size|
|---|---|---|---|
|3|0.01|0.9|64|

## Run
1. Prepare dataset. Download the [data set](https://www.yelp.com/dataset), and unzip the custom reviews as a file. Use preprocess.py to transform file into data set foe model input.
2. Train the model. Word enbedding of train data in 'yelp.word2vec'. The model will trained and autosaved in 'model.dict'
```
python train
```
3. Test the model.
```
python evaluate
```

BIN
HAN-document_classification/data/test_samples.pkl View File


BIN
HAN-document_classification/data/train_samples.pkl View File


BIN
HAN-document_classification/data/yelp.word2vec View File


Loading…
Cancel
Save