| @@ -0,0 +1,188 @@ | |||||
| # Jiagu自然语言处理工具 | |||||
| >>> Jiagu以BiLSTM等模型为基础,使用大规模语料训练而成。将提供中文分词、词性标注、命名实体识别、关键词抽取、文本摘要、新词发现等常用自然语言处理功能。参考了各大工具优缺点制作,将Jiagu回馈给大家。 | |||||
| ## 目录 | |||||
| * [安装方式](#安装方式) | |||||
| * [使用方式](#使用方式) | |||||
| * [评价标准](#评价标准) | |||||
| * [附录说明](#附录) | |||||
| --- | |||||
| 提供的功能有: | |||||
| * 中文分词 | |||||
| * 词性标注 | |||||
| * 命名实体识别 | |||||
| * 情感分析 (模型训练中) | |||||
| * 关键词提取 | |||||
| * 文本摘要 | |||||
| * 新词发现 | |||||
| * 等等。。。。 | |||||
| --- | |||||
| ### 安装方式 | |||||
| pip安装 | |||||
| ```shell | |||||
| pip install jiagu | |||||
| ``` | |||||
| 源码安装 | |||||
| ```shell | |||||
| git clone https://github.com/ownthink/Jiagu | |||||
| cd Jiagu | |||||
| python3 setup.py install | |||||
| ``` | |||||
| ### 使用方式 | |||||
| 1. 快速上手:分词、词性标注、命名实体识别 | |||||
| ```python3 | |||||
| import jiagu | |||||
| #jiagu.init() # 可手动初始化,也可以动态初始化 | |||||
| text = '厦门明天会不会下雨' | |||||
| words = jiagu.seg(text) # 分词 | |||||
| print(words) | |||||
| pos = jiagu.pos(words) # 词性标注 | |||||
| print(pos) | |||||
| ner = jiagu.ner(text) # 命名实体识别 | |||||
| print(ner) | |||||
| ``` | |||||
| 2. 中文分词 | |||||
| 分词各种模式使用方式 | |||||
| ```python3 | |||||
| import jiagu | |||||
| text = '汉服和服装' | |||||
| words = jiagu.seg(text) # 默认分词 | |||||
| print(words) | |||||
| words = jiagu.seg([text, text, text], input='batch') # 批量分词,加快速度。 | |||||
| print(words) | |||||
| words = jiagu.seg(text, model='mmseg') # 使用mmseg算法进行分词 | |||||
| print(list(words)) | |||||
| ``` | |||||
| 自定义分词模型(将单独提供msr、pku、cnc等分词标准) | |||||
| ```python3 | |||||
| import jiagu | |||||
| # 独立标准模型路径 | |||||
| # msr:test/extra_data/model/msr.model | |||||
| # pku:test/extra_data/model/pku.model | |||||
| # cnc:test/extra_data/model/cnc.model | |||||
| jiagu.load_model('test/extra_data/model/cnc.model') # 使用国家语委分词标准 | |||||
| words = jiagu.seg('结婚的和尚未结婚的') | |||||
| print(words) | |||||
| ``` | |||||
| 3. 关键词提取 | |||||
| ```python3 | |||||
| import jiagu | |||||
| text = ''' | |||||
| 该研究主持者之一、波士顿大学地球与环境科学系博士陈池(音)表示,“尽管中国和印度国土面积仅占全球陆地的9%,但两国为这一绿化过程贡献超过三分之一。考虑到人口过多的国家一般存在对土地过度利用的问题,这个发现令人吃惊。” | |||||
| NASA埃姆斯研究中心的科学家拉玛·内曼尼(Rama Nemani)说,“这一长期数据能让我们深入分析地表绿化背后的影响因素。我们一开始以为,植被增加是由于更多二氧化碳排放,导致气候更加温暖、潮湿,适宜生长。” | |||||
| “MODIS的数据让我们能在非常小的尺度上理解这一现象,我们发现人类活动也作出了贡献。” | |||||
| NASA文章介绍,在中国为全球绿化进程做出的贡献中,有42%来源于植树造林工程,对于减少土壤侵蚀、空气污染与气候变化发挥了作用。 | |||||
| 据观察者网过往报道,2017年我国全国共完成造林736.2万公顷、森林抚育830.2万公顷。其中,天然林资源保护工程完成造林26万公顷,退耕还林工程完成造林91.2万公顷。京津风沙源治理工程完成造林18.5万公顷。三北及长江流域等重点防护林体系工程完成造林99.1万公顷。完成国家储备林建设任务68万公顷。 | |||||
| ''' | |||||
| keywords = jiagu.keywords(text, 5) # 关键词 | |||||
| print(keywords) | |||||
| ``` | |||||
| 4. 文本摘要 | |||||
| ```python3 | |||||
| fin = open('input.txt', 'r') | |||||
| text = fin.read() | |||||
| fin.close() | |||||
| summarize = jiagu.summarize(text, 3) # 摘要 | |||||
| print(summarize) | |||||
| ``` | |||||
| 5. 新词发现 | |||||
| ```python3 | |||||
| import jiagu | |||||
| jiagu.findword('input.txt', 'output.txt') # 根据文本,利用信息熵做新词发现。 | |||||
| ``` | |||||
| ### 评价标准 | |||||
| 1. msr测试结果 | |||||
|  | |||||
| ## 附录 | |||||
| 1. 词性标注说明 | |||||
| ```text | |||||
| n 普通名词 | |||||
| nt 时间名词 | |||||
| nd 方位名词 | |||||
| nl 处所名词 | |||||
| nh 人名 | |||||
| nhf 姓 | |||||
| nhs 名 | |||||
| ns 地名 | |||||
| nn 族名 | |||||
| ni 机构名 | |||||
| nz 其他专名 | |||||
| v 动词 | |||||
| vd 趋向动词 | |||||
| vl 联系动词 | |||||
| vu 能愿动词 | |||||
| a 形容词 | |||||
| f 区别词 | |||||
| m 数词 | |||||
| q 量词 | |||||
| d 副词 | |||||
| r 代词 | |||||
| p 介词 | |||||
| c 连词 | |||||
| u 助词 | |||||
| e 叹词 | |||||
| o 拟声词 | |||||
| i 习用语 | |||||
| j 缩略语 | |||||
| h 前接成分 | |||||
| k 后接成分 | |||||
| g 语素字 | |||||
| x 非语素字 | |||||
| w 标点符号 | |||||
| ws 非汉字字符串 | |||||
| wu 其他未知的符号 | |||||
| ``` | |||||
| 2. 命名实体说明(采用BIO标记方式) | |||||
| ```text | |||||
| B-PER、I-PER 人名 | |||||
| B-LOC、I-LOC 地名 | |||||
| B-ORG、I-ORG 机构名 | |||||
| ``` | |||||
| ### 加入我们 | |||||
| 思知人工智能群QQ群:90780053,微信群联系作者微信:MrYener,作者邮箱联系方式:help@ownthink.com | |||||
| <p>捐赠作者(您的鼓励是作者开源最大的动力!!!):<a href="https://github.com/ownthink/Jiagu/wiki/donation"target="_blank">捐赠致谢</a> </p> | |||||
|  | |||||
| ### 贡献者: | |||||
| 1. [Yener](https://github.com/ownthink) | |||||
| 2. [zengbin93](https://github.com/zengbin93) | |||||
| 3. [dirtdust](https://github.com/dirtdust) | |||||
| @@ -0,0 +1,38 @@ | |||||
| import jiagu | |||||
| #jiagu.init() # 可手动初始化,也可以动态初始化 | |||||
| text = '厦门明天会不会下雨' | |||||
| words = jiagu.seg(text) # 分词,可以用model选择分词模式,不填则默认,mmseg则使用mmseg算法。 | |||||
| print(words) | |||||
| pos = jiagu.pos(words) # 词性标注 | |||||
| print(pos) | |||||
| ner = jiagu.ner(text) # 命名实体识别 | |||||
| print(ner) | |||||
| text = ''' | |||||
| 该研究主持者之一、波士顿大学地球与环境科学系博士陈池(音)表示,“尽管中国和印度国土面积仅占全球陆地的9%,但两国为这一绿化过程贡献超过三分之一。考虑到人口过多的国家一般存在对土地过度利用的问题,这个发现令人吃惊。” | |||||
| NASA埃姆斯研究中心的科学家拉玛·内曼尼(Rama Nemani)说,“这一长期数据能让我们深入分析地表绿化背后的影响因素。我们一开始以为,植被增加是由于更多二氧化碳排放,导致气候更加温暖、潮湿,适宜生长。” | |||||
| “MODIS的数据让我们能在非常小的尺度上理解这一现象,我们发现人类活动也作出了贡献。” | |||||
| NASA文章介绍,在中国为全球绿化进程做出的贡献中,有42%来源于植树造林工程,对于减少土壤侵蚀、空气污染与气候变化发挥了作用。 | |||||
| 据观察者网过往报道,2017年我国全国共完成造林736.2万公顷、森林抚育830.2万公顷。其中,天然林资源保护工程完成造林26万公顷,退耕还林工程完成造林91.2万公顷。京津风沙源治理工程完成造林18.5万公顷。三北及长江流域等重点防护林体系工程完成造林99.1万公顷。完成国家储备林建设任务68万公顷。 | |||||
| ''' | |||||
| keywords = jiagu.keywords(text, 5) # 关键词 | |||||
| print(keywords) | |||||
| summarize = jiagu.summarize(text, 3) # 摘要 | |||||
| print(summarize) | |||||
| # jiagu.findword('input.txt', 'output.txt') # 根据大规模语料,利用信息熵做新词发现。 | |||||
| @@ -0,0 +1,45 @@ | |||||
| #!/usr/bin/env python3 | |||||
| # -*-coding:utf-8-*- | |||||
| """ | |||||
| * Copyright (C) 2018 OwnThink. | |||||
| * | |||||
| * Name : __init__.py | |||||
| * Author : Yener <yener@ownthink.com> | |||||
| * Version : 0.01 | |||||
| * Description : | |||||
| """ | |||||
| from jiagu import analyze | |||||
| any = analyze.Analyze() | |||||
| init = any.init | |||||
| # 分词 | |||||
| seg = any.cws | |||||
| cws = any.cws | |||||
| cut = any.cws | |||||
| # 词性标注 | |||||
| pos = any.pos | |||||
| # 命名实体识别 | |||||
| ner = any.ner | |||||
| # 依存句法分析 | |||||
| # parser | |||||
| # 加载用户字典 | |||||
| # load_userdict | |||||
| # 自定义分词模型 | |||||
| load_model = any.load_model | |||||
| # 关键字抽取 | |||||
| keywords = any.keywords | |||||
| # 中文摘要 | |||||
| summarize = any.summarize | |||||
| # 新词发现 | |||||
| findword = any.findword | |||||
| @@ -0,0 +1,11 @@ | |||||
| #!/usr/bin/env python3 | |||||
| # -*-coding:utf-8-*- | |||||
| """ | |||||
| * Copyright (C) 2018 OwnThink. | |||||
| * | |||||
| * Name : __main__.py | |||||
| * Author : Yener <yener@ownthink.com> | |||||
| * Version : 0.01 | |||||
| * Description : | |||||
| """ | |||||
| @@ -0,0 +1,156 @@ | |||||
| #!/usr/bin/env python3 | |||||
| # -*-coding:utf-8-*- | |||||
| """ | |||||
| * Copyright (C) 2018 OwnThink. | |||||
| * | |||||
| * Name : analyze.py - 解析模块 | |||||
| * Author : Yener <yener@ownthink.com> | |||||
| * Version : 0.01 | |||||
| * Description : | |||||
| """ | |||||
| import os | |||||
| from jiagu import mmseg | |||||
| from jiagu import findword | |||||
| from jiagu import bilstm_crf | |||||
| from jiagu.textrank import Keywords | |||||
| from jiagu.textrank import Summarize | |||||
| def add_curr_dir(name): | |||||
| return os.path.join(os.path.dirname(__file__), name) | |||||
| class Analyze(object): | |||||
| def __init__(self): | |||||
| self.seg_model = None | |||||
| self.pos_model = None | |||||
| self.ner_model = None | |||||
| self.seg_mmseg = None | |||||
| self.keywords_model = None | |||||
| self.summarize_model = None | |||||
| def init(self): | |||||
| self.init_cws() | |||||
| self.init_pos() | |||||
| self.init_ner() | |||||
| def init_cws(self): | |||||
| if self.seg_model is None: | |||||
| self.seg_model = bilstm_crf.Predict(add_curr_dir('model/cws.model')) | |||||
| def load_model(self, model_path): | |||||
| self.seg_model = bilstm_crf.Predict(model_path) | |||||
| def init_pos(self): | |||||
| if self.pos_model is None: | |||||
| self.pos_model = bilstm_crf.Predict(add_curr_dir('model/pos.model')) | |||||
| def init_ner(self): | |||||
| if self.ner_model is None: | |||||
| self.ner_model = bilstm_crf.Predict(add_curr_dir('model/ner.model')) | |||||
| def init_mmseg(self): | |||||
| if self.seg_mmseg is None: | |||||
| self.seg_mmseg = mmseg.MMSeg() | |||||
| @staticmethod | |||||
| def __lab2word(sentence, labels): | |||||
| sen_len = len(sentence) | |||||
| tmp_word = "" | |||||
| words = [] | |||||
| for i in range(sen_len): | |||||
| label = labels[i] | |||||
| w = sentence[i] | |||||
| if label == "B": | |||||
| tmp_word += w | |||||
| elif label == "M": | |||||
| tmp_word += w | |||||
| elif label == "E": | |||||
| tmp_word += w | |||||
| words.append(tmp_word) | |||||
| tmp_word = "" | |||||
| else: | |||||
| tmp_word = "" | |||||
| words.append(w) | |||||
| if tmp_word: | |||||
| words.append(tmp_word) | |||||
| return words | |||||
| def cws_text(self, sentence): | |||||
| if sentence == '': | |||||
| return [''] | |||||
| labels = self.seg_model.predict([sentence])[0] | |||||
| return self.__lab2word(sentence, labels) | |||||
| def cws_list(self, sentences): | |||||
| text_list = sentences | |||||
| all_labels = self.seg_model.predict(text_list) | |||||
| sent_words = [] | |||||
| for ti, text in enumerate(text_list): | |||||
| seg_labels = all_labels[ti] | |||||
| sent_words.append(self.__lab2word(text, seg_labels)) | |||||
| return sent_words | |||||
| def cws(self, sentence, input='text', model='default'): | |||||
| """中文分词 | |||||
| :param sentence: str or list | |||||
| 文本或者文本列表,根据input的模式来定 | |||||
| :param input: str | |||||
| 句子输入的格式,text则为默认的文本,batch则为批量的文本列表 | |||||
| :param model: str | |||||
| 分词所使用的模式,default为默认模式,mmseg为mmseg分词方式 | |||||
| :return: | |||||
| """ | |||||
| if model == 'default': | |||||
| self.init_cws() | |||||
| if input == 'batch': | |||||
| words_list = self.cws_list(sentence) | |||||
| return words_list | |||||
| else: | |||||
| words = self.cws_text(sentence) | |||||
| return words | |||||
| elif model == 'mmseg': | |||||
| self.init_mmseg() | |||||
| words = self.seg_mmseg.cws(sentence) | |||||
| return words | |||||
| else: | |||||
| pass | |||||
| return [] | |||||
| def pos(self, sentence, input='words'): # 传入的是词语 | |||||
| self.init_pos() | |||||
| if input == 'batch': | |||||
| all_labels = self.pos_model.predict(sentence) | |||||
| return all_labels | |||||
| else: | |||||
| labels = self.pos_model.predict([sentence])[0] | |||||
| return labels | |||||
| def ner(self, sentence, input='text'): # 传入的是文本 | |||||
| self.init_ner() | |||||
| if input == 'batch': | |||||
| all_labels = self.ner_model.predict(sentence) | |||||
| return all_labels | |||||
| else: | |||||
| labels = self.ner_model.predict([sentence])[0] | |||||
| return labels | |||||
| def keywords(self, text, topkey=5): | |||||
| if self.keywords_model == None: | |||||
| self.keywords_model = Keywords(tol=0.0001, window=2) | |||||
| return self.keywords_model.keywords(text, topkey) | |||||
| def summarize(self, text, topsen=5): | |||||
| if self.summarize_model == None: | |||||
| self.summarize_model = Summarize(tol=0.0001) | |||||
| return self.summarize_model.summarize(text, topsen) | |||||
| def findword(self, input, output): | |||||
| findword.new_word_find(input, output) | |||||
| @@ -0,0 +1,77 @@ | |||||
| #!/usr/bin/env python3 | |||||
| # -*-coding:utf-8-*- | |||||
| """ | |||||
| * Copyright (C) 2018 OwnThink. | |||||
| * | |||||
| * Name : bilstm_crf.py - 预测 | |||||
| * Author : Yener <yener@ownthink.com> | |||||
| * Version : 0.01 | |||||
| * Description : | |||||
| """ | |||||
| import os | |||||
| os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' | |||||
| import pickle | |||||
| import numpy as np | |||||
| import tensorflow as tf | |||||
| from tensorflow.contrib.crf import viterbi_decode | |||||
| class Predict(object): | |||||
| def __init__(self, model_file): | |||||
| with open(model_file, 'rb') as f: | |||||
| model, char_to_id, id_to_tag = pickle.load(f) | |||||
| self.char_to_id = char_to_id | |||||
| self.id_to_tag = {int(k): v for k, v in id_to_tag.items()} | |||||
| self.num_class = len(self.id_to_tag) | |||||
| graph_def = tf.GraphDef() | |||||
| graph_def.ParseFromString(model) | |||||
| with tf.Graph().as_default() as graph: | |||||
| tf.import_graph_def(graph_def, name="prefix") | |||||
| self.input_x = graph.get_tensor_by_name("prefix/char_inputs:0") | |||||
| self.lengths = graph.get_tensor_by_name("prefix/lengths:0") | |||||
| self.dropout = graph.get_tensor_by_name("prefix/dropout:0") | |||||
| self.logits = graph.get_tensor_by_name("prefix/project/logits:0") | |||||
| self.trans = graph.get_tensor_by_name("prefix/crf_loss/transitions:0") | |||||
| self.sess = tf.Session(graph=graph) | |||||
| self.sess.as_default() | |||||
| def decode(self, logits, trans, sequence_lengths, tag_num): | |||||
| small = -1000.0 | |||||
| viterbi_sequences = [] | |||||
| start = np.asarray([[small] * tag_num + [0]]) | |||||
| for logit, length in zip(logits, sequence_lengths): | |||||
| score = logit[:length] | |||||
| pad = small * np.ones([length, 1]) | |||||
| score = np.concatenate([score, pad], axis=1) | |||||
| score = np.concatenate([start, score], axis=0) | |||||
| viterbi_seq, viterbi_score = viterbi_decode(score, trans) | |||||
| viterbi_sequences.append(viterbi_seq[1:]) | |||||
| return viterbi_sequences | |||||
| def predict(self, sents): | |||||
| inputs = [] | |||||
| lengths = [len(text) for text in sents] | |||||
| max_len = max(lengths) | |||||
| for sent in sents: | |||||
| sent_ids = [self.char_to_id.get(w) if w in self.char_to_id else self.char_to_id.get("<OOV>") for w in sent] | |||||
| padding = [0] * (max_len - len(sent_ids)) | |||||
| sent_ids += padding | |||||
| inputs.append(sent_ids) | |||||
| inputs = np.array(inputs, dtype=np.int32) | |||||
| feed_dict = { | |||||
| self.input_x: inputs, | |||||
| self.lengths: lengths, | |||||
| self.dropout: 1.0 | |||||
| } | |||||
| logits, trans = self.sess.run([self.logits, self.trans], feed_dict=feed_dict) | |||||
| path = self.decode(logits, trans, lengths, self.num_class) | |||||
| labels = [[self.id_to_tag.get(l) for l in p] for p in path] | |||||
| return labels | |||||
| @@ -0,0 +1,145 @@ | |||||
| # -*- encoding:utf-8 -*- | |||||
| """ | |||||
| * Copyright (C) 2017 OwnThink. | |||||
| * | |||||
| * Name : findword.py - 新词发现 | |||||
| * Author : Yener <yener@ownthink.com> | |||||
| * Version : 0.01 | |||||
| * Description : 新词发现算法实现 | |||||
| special thanks to | |||||
| http://www.matrix67.com/blog/archives/5044 | |||||
| https://github.com/zoulala/New_words_find | |||||
| """ | |||||
| import re | |||||
| from math import log | |||||
| from collections import Counter | |||||
| max_word_len = 6 | |||||
| re_chinese = re.compile(u"[\w]+", re.U) | |||||
| def count_words(input_file): | |||||
| word_freq = Counter() | |||||
| fin = open(input_file, 'r', encoding='utf8') | |||||
| for index, line in enumerate(fin): | |||||
| words = [] | |||||
| for sentence in re_chinese.findall(line): | |||||
| length = len(sentence) | |||||
| for i in range(length): | |||||
| words += [sentence[i: j + i] for j in range(1, min(length - i + 1, max_word_len + 1))] | |||||
| word_freq.update(words) | |||||
| fin.close() | |||||
| return word_freq | |||||
| def lrg_info(word_freq, total_word, min_freq, min_mtro): | |||||
| l_dict = {} | |||||
| r_dict = {} | |||||
| k = 0 | |||||
| for word, freq in word_freq.items(): | |||||
| k += 1 | |||||
| if len(word) < 3: | |||||
| continue | |||||
| left_word = word[:-1] | |||||
| ml = word_freq[left_word] | |||||
| if ml > min_freq: | |||||
| mul_info1 = ml * total_word / (word_freq[left_word[1:]] * word_freq[left_word[0]]) | |||||
| mul_info2 = ml * total_word / (word_freq[left_word[-1]] * word_freq[left_word[:-1]]) | |||||
| mul_info = min(mul_info1, mul_info2) | |||||
| if mul_info > min_mtro: | |||||
| if left_word in l_dict: | |||||
| l_dict[left_word].append(freq) | |||||
| else: | |||||
| l_dict[left_word] = [ml, freq] | |||||
| right_word = word[1:] | |||||
| mr = word_freq[right_word] | |||||
| if mr > min_freq: | |||||
| mul_info1 = mr * total_word / (word_freq[right_word[1:]] * word_freq[right_word[0]]) | |||||
| mul_info2 = mr * total_word / (word_freq[right_word[-1]] * word_freq[right_word[:-1]]) | |||||
| mul_info = min(mul_info1, mul_info2) | |||||
| if mul_info > min_mtro: | |||||
| if right_word in r_dict: | |||||
| r_dict[right_word].append(freq) | |||||
| else: | |||||
| r_dict[right_word] = [mr, freq] | |||||
| return l_dict, r_dict | |||||
| def cal_entro(r_dict): | |||||
| entro_r_dict = {} | |||||
| for word in r_dict: | |||||
| m_list = r_dict[word] | |||||
| r_list = m_list[1:] | |||||
| fm = m_list[0] | |||||
| entro_r = 0 | |||||
| krm = fm - sum(r_list) | |||||
| if krm > 0: | |||||
| entro_r -= 1 / fm * log(1 / fm, 2) * krm | |||||
| for rm in r_list: | |||||
| entro_r -= rm / fm * log(rm / fm, 2) | |||||
| entro_r_dict[word] = entro_r | |||||
| return entro_r_dict | |||||
| def entro_lr_fusion(entro_r_dict, entro_l_dict): | |||||
| entro_in_rl_dict = {} | |||||
| entro_in_r_dict = {} | |||||
| entro_in_l_dict = entro_l_dict.copy() | |||||
| for word in entro_r_dict: | |||||
| if word in entro_l_dict: | |||||
| entro_in_rl_dict[word] = [entro_l_dict[word], entro_r_dict[word]] | |||||
| entro_in_l_dict.pop(word) | |||||
| else: | |||||
| entro_in_r_dict[word] = entro_r_dict[word] | |||||
| return entro_in_rl_dict, entro_in_l_dict, entro_in_r_dict | |||||
| def entro_filter(entro_in_rl_dict, entro_in_l_dict, entro_in_r_dict, word_freq, min_entro): | |||||
| entro_dict = {} | |||||
| l, r, rl = 0, 0, 0 | |||||
| for word in entro_in_rl_dict: | |||||
| if entro_in_rl_dict[word][0] > min_entro and entro_in_rl_dict[word][1] > min_entro: | |||||
| entro_dict[word] = word_freq[word] | |||||
| rl += 1 | |||||
| for word in entro_in_l_dict: | |||||
| if entro_in_l_dict[word] > min_entro: | |||||
| entro_dict[word] = word_freq[word] | |||||
| l += 1 | |||||
| for word in entro_in_r_dict: | |||||
| if entro_in_r_dict[word] > min_entro: | |||||
| entro_dict[word] = word_freq[word] | |||||
| r += 1 | |||||
| return entro_dict | |||||
| def new_word_find(input_file, output_file): | |||||
| min_freq = 10 | |||||
| min_mtro = 80 | |||||
| min_entro = 3 | |||||
| word_freq = count_words(input_file) | |||||
| total_word = sum(word_freq.values()) | |||||
| l_dict, r_dict = lrg_info(word_freq, total_word, min_freq, min_mtro) | |||||
| entro_r_dict = cal_entro(l_dict) | |||||
| entro_l_dict = cal_entro(r_dict) | |||||
| entro_in_rl_dict, entro_in_l_dict, entro_in_r_dict = entro_lr_fusion(entro_r_dict, entro_l_dict) | |||||
| entro_dict = entro_filter(entro_in_rl_dict, entro_in_l_dict, entro_in_r_dict, word_freq, min_entro) | |||||
| result = sorted(entro_dict.items(), key=lambda x: x[1], reverse=True) | |||||
| with open(output_file, 'w', encoding='utf-8') as kf: | |||||
| for w, m in result: | |||||
| kf.write(w + '\t%d\n' % m) | |||||
| @@ -0,0 +1,121 @@ | |||||
| #!/usr/bin/env python | |||||
| # encoding: utf-8 | |||||
| """ | |||||
| * Copyright (C) 2018 OwnThink. | |||||
| * | |||||
| * Name : mmseg.py | |||||
| * Author : Leo <1162441289@qq.com> | |||||
| * Version : 0.01 | |||||
| * Description : mmseg分词方法,目前算法比较耗时,仍在优化中 | |||||
| """ | |||||
| import os | |||||
| import pickle | |||||
| from math import log | |||||
| from collections import defaultdict | |||||
| def add_curr_dir(name): | |||||
| return os.path.join(os.path.dirname(__file__), name) | |||||
| class Trie(object): | |||||
| def __init__(self): | |||||
| self.root = {} | |||||
| self.value = "value" | |||||
| self.trie_file_path = os.path.join(os.path.dirname(__file__), "data/Trie.pkl") | |||||
| def get_matches(self, word): | |||||
| ret = [] | |||||
| node = self.root | |||||
| for c in word: | |||||
| if c not in node: | |||||
| break | |||||
| node = node[c] | |||||
| if self.value in node: | |||||
| ret.append(node[self.value]) | |||||
| return ret | |||||
| def load(self): | |||||
| with open(self.trie_file_path, "rb") as f: | |||||
| data = pickle.load(f) | |||||
| self.root = data | |||||
| class Chunk: | |||||
| def __init__(self, words_list, chrs, word_freq): | |||||
| # self.sentence_sep = ['?', '!', ';', '?', '!', '。', ';', '……', '…', ",", ",", "."] | |||||
| self.words = words_list | |||||
| self.lens_list = map(lambda x: len(x), words_list) | |||||
| self.length = sum(self.lens_list) | |||||
| self.mean = float(self.length) / len(words_list) | |||||
| self.var = sum(map(lambda x: (x - self.mean) ** 2, self.lens_list)) / len(self.words) | |||||
| self.entropy = sum([log(float(chrs.get(x, 1))) for x in words_list]) | |||||
| # 计算词频信息熵 | |||||
| self.word_entropy = sum([log(float(word_freq.get(x, 1))) for x in words_list]) | |||||
| def __lt__(self, other): | |||||
| return (self.length, self.mean, -self.var, self.entropy, self.word_entropy) < \ | |||||
| (other.length, other.mean, -other.var, other.entropy, other.word_entropy) | |||||
| class MMSeg: | |||||
| def __init__(self): | |||||
| # 加载词语字典 | |||||
| trie = Trie() | |||||
| trie.load() | |||||
| self.words_dic = trie | |||||
| # 加载字频字典 | |||||
| self.chrs_dic = self._load_freq(filename="data/chars.dic") | |||||
| # 加载词频字典 | |||||
| self.word_freq = self._load_freq(filename="data/words.dic") | |||||
| def _load_freq(self, filename): | |||||
| chrs_dic = defaultdict() | |||||
| with open(add_curr_dir(filename), "r", encoding="utf-8") as f: | |||||
| for line in f: | |||||
| if line: | |||||
| key, value = line.strip().split(" ") | |||||
| chrs_dic.setdefault(key, int(value)) | |||||
| return chrs_dic | |||||
| def __get_start_words(self, sentence): | |||||
| match_words = self.words_dic.get_matches(sentence) | |||||
| if sentence: | |||||
| if not match_words: | |||||
| return [sentence[0]] | |||||
| else: | |||||
| return match_words | |||||
| else: | |||||
| return False | |||||
| def __get_chunks(self, sentence): | |||||
| # 获取chunk,每个chunk中最多三个词 | |||||
| ret = [] | |||||
| def _iter_chunk(sentence, num, tmp_seg_words): | |||||
| match_words = self.__get_start_words(sentence) | |||||
| if (not match_words or num == 0) and tmp_seg_words: | |||||
| ret.append(Chunk(tmp_seg_words, self.chrs_dic, self.word_freq)) | |||||
| else: | |||||
| for word in match_words: | |||||
| _iter_chunk(sentence[len(word):], num - 1, tmp_seg_words + [word]) | |||||
| _iter_chunk(sentence, num=3, tmp_seg_words=[]) | |||||
| return ret | |||||
| def cws(self, sentence): | |||||
| """ | |||||
| :param sentence: 输入的数据 | |||||
| :return: 返回的分词生成器 | |||||
| """ | |||||
| while sentence: | |||||
| chunks = self.__get_chunks(sentence) | |||||
| word = max(chunks).words[0] | |||||
| sentence = sentence[len(word):] | |||||
| yield word | |||||
| if __name__ == "__main__": | |||||
| mmseg = MMSeg() | |||||
| print(list(mmseg.cws("武汉市长江大桥上的日落非常好看,很喜欢看日出日落。"))) | |||||
| print(list(mmseg.cws("人要是行干一行行一行."))) | |||||
| @@ -0,0 +1,197 @@ | |||||
| # -*- encoding:utf-8 -*- | |||||
| """ | |||||
| * Copyright (C) 2017 OwnThink. | |||||
| * | |||||
| * Name : textrank.py - 解析 | |||||
| * Author : zengbin93 <zeng_bin8888@163.com> | |||||
| * Version : 0.01 | |||||
| * Description : TextRank算法实现 | |||||
| special thanks to https://github.com/ArtistScript/FastTextRank | |||||
| """ | |||||
| import sys | |||||
| import numpy as np | |||||
| from jiagu import utils | |||||
| from heapq import nlargest | |||||
| from collections import defaultdict | |||||
| from itertools import count, product | |||||
| class Keywords(object): | |||||
| def __init__(self, | |||||
| use_stopword=True, | |||||
| stop_words_file=utils.default_stopwords_file(), | |||||
| max_iter=100, | |||||
| tol=0.0001, | |||||
| window=2): | |||||
| self.__use_stopword = use_stopword | |||||
| self.__max_iter = max_iter | |||||
| self.__tol = tol | |||||
| self.__window = window | |||||
| self.__stop_words = set() | |||||
| self.__stop_words_file = utils.default_stopwords_file() | |||||
| if stop_words_file: | |||||
| self.__stop_words_file = stop_words_file | |||||
| if use_stopword: | |||||
| with open(self.__stop_words_file, 'r', encoding='utf-8') as f: | |||||
| for word in f: | |||||
| self.__stop_words.add(word.strip()) | |||||
| np.seterr(all='warn') | |||||
| @staticmethod | |||||
| def build_vocab(sents): | |||||
| word_index = {} | |||||
| index_word = {} | |||||
| words_number = 0 | |||||
| for word_list in sents: | |||||
| for word in word_list: | |||||
| if word not in word_index: | |||||
| word_index[word] = words_number | |||||
| index_word[words_number] = word | |||||
| words_number += 1 | |||||
| return word_index, index_word, words_number | |||||
| @staticmethod | |||||
| def create_graph(sents, words_number, word_index, window=2): | |||||
| graph = [[0.0 for _ in range(words_number)] for _ in range(words_number)] | |||||
| for word_list in sents: | |||||
| for w1, w2 in utils.combine(word_list, window): | |||||
| if w1 in word_index and w2 in word_index: | |||||
| index1 = word_index[w1] | |||||
| index2 = word_index[w2] | |||||
| graph[index1][index2] += 1.0 | |||||
| graph[index2][index1] += 1.0 | |||||
| return graph | |||||
| def keywords(self, text, n): | |||||
| text = text.replace('\n', '') | |||||
| text = text.replace('\r', '') | |||||
| text = utils.as_text(text) | |||||
| tokens = utils.cut_sentences(text) | |||||
| sentences, sents = utils.psegcut_filter_words(tokens, | |||||
| self.__stop_words, | |||||
| self.__use_stopword) | |||||
| word_index, index_word, words_number = self.build_vocab(sents) | |||||
| graph = self.create_graph(sents, words_number, | |||||
| word_index, window=self.__window) | |||||
| scores = utils.weight_map_rank(graph, max_iter=self.__max_iter, | |||||
| tol=self.__tol) | |||||
| sent_selected = nlargest(n, zip(scores, count())) | |||||
| sent_index = [] | |||||
| for i in range(n): | |||||
| sent_index.append(sent_selected[i][1]) | |||||
| return [index_word[i] for i in sent_index] | |||||
| class Summarize(object): | |||||
| def __init__(self, use_stopword=True, | |||||
| stop_words_file=None, | |||||
| dict_path=None, | |||||
| max_iter=100, | |||||
| tol=0.0001): | |||||
| if dict_path: | |||||
| raise RuntimeError("True") | |||||
| self.__use_stopword = use_stopword | |||||
| self.__dict_path = dict_path | |||||
| self.__max_iter = max_iter | |||||
| self.__tol = tol | |||||
| self.__stop_words = set() | |||||
| self.__stop_words_file = utils.default_stopwords_file() | |||||
| if stop_words_file: | |||||
| self.__stop_words_file = stop_words_file | |||||
| if use_stopword: | |||||
| for word in open(self.__stop_words_file, 'r', encoding='utf-8'): | |||||
| self.__stop_words.add(word.strip()) | |||||
| np.seterr(all='warn') | |||||
| def filter_dictword(self, sents): | |||||
| _sents = [] | |||||
| dele = set() | |||||
| for sentence in sents: | |||||
| for word in sentence: | |||||
| if word not in self.__word2vec: | |||||
| dele.add(word) | |||||
| if sentence: | |||||
| _sents.append([word for word in sentence if word not in dele]) | |||||
| return _sents | |||||
| def summarize(self, text, n): | |||||
| text = text.replace('\n', '') | |||||
| text = text.replace('\r', '') | |||||
| text = utils.as_text(text) | |||||
| tokens = utils.cut_sentences(text) | |||||
| sentences, sents = utils.cut_filter_words(tokens, self.__stop_words, self.__use_stopword) | |||||
| graph = self.create_graph(sents) | |||||
| scores = utils.weight_map_rank(graph, self.__max_iter, self.__tol) | |||||
| sent_selected = nlargest(n, zip(scores, count())) | |||||
| sent_index = [] | |||||
| for i in range(n): | |||||
| sent_index.append(sent_selected[i][1]) | |||||
| return [sentences[i] for i in sent_index] | |||||
| @staticmethod | |||||
| def create_graph(word_sent): | |||||
| num = len(word_sent) | |||||
| board = [[0.0 for _ in range(num)] for _ in range(num)] | |||||
| for i, j in product(range(num), repeat=2): | |||||
| if i != j: | |||||
| board[i][j] = utils.sentences_similarity(word_sent[i], word_sent[j]) | |||||
| return board | |||||
| def compute_similarity_by_avg(self, sents_1, sents_2): | |||||
| if len(sents_1) == 0 or len(sents_2) == 0: | |||||
| return 0.0 | |||||
| vec1 = self.__word2vec[sents_1[0]] | |||||
| for word1 in sents_1[1:]: | |||||
| vec1 = vec1 + self.__word2vec[word1] | |||||
| vec2 = self.__word2vec[sents_2[0]] | |||||
| for word2 in sents_2[1:]: | |||||
| vec2 = vec2 + self.__word2vec[word2] | |||||
| similarity = utils.cosine_similarity(vec1 / len(sents_1), | |||||
| vec2 / len(sents_2)) | |||||
| return similarity | |||||
| class TextRank: | |||||
| d = 0.85 | |||||
| def __init__(self): | |||||
| self.graph = defaultdict(list) | |||||
| def add_edge(self, start, end, weight=1): | |||||
| self.graph[start].append((start, end, weight)) | |||||
| self.graph[end].append((end, start, weight)) | |||||
| def rank(self): | |||||
| ws = defaultdict(float) | |||||
| out_sum = defaultdict(float) | |||||
| wsdef = 1.0 / (len(self.graph) or 1.0) | |||||
| for n, out in self.graph.items(): | |||||
| ws[n] = wsdef | |||||
| out_sum[n] = sum((e[2] for e in out), 0.0) | |||||
| sorted_keys = sorted(self.graph.keys()) | |||||
| for x in range(10): | |||||
| for n in sorted_keys: | |||||
| s = 0 | |||||
| for e in self.graph[n]: | |||||
| s += e[2] / out_sum[e[1]] * ws[e[1]] | |||||
| ws[n] = (1 - self.d) + self.d * s | |||||
| min_rank, max_rank = sys.float_info[0], sys.float_info[3] | |||||
| for w in ws.values(): | |||||
| if w < min_rank: | |||||
| min_rank = w | |||||
| if w > max_rank: | |||||
| max_rank = w | |||||
| for n, w in ws.items(): | |||||
| ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0) | |||||
| return ws | |||||
| @@ -0,0 +1,182 @@ | |||||
| # -*- encoding:utf-8 -*- | |||||
| """ | |||||
| * Copyright (C) 2017 OwnThink. | |||||
| * | |||||
| * Name : utils.py - 解析 | |||||
| * Author : zengbin93 <zeng_bin8888@163.com> | |||||
| * Version : 0.01 | |||||
| * Description : 常用工具函数 | |||||
| """ | |||||
| import os | |||||
| import jiagu | |||||
| import math | |||||
| import numpy as np | |||||
| def default_stopwords_file(): | |||||
| d = os.path.dirname(os.path.realpath(__file__)) | |||||
| return os.path.join(d, 'data/stopwords.txt') | |||||
| sentence_delimiters = ['。', '?', '!', '…'] | |||||
| allow_speech_tags = ['an', 'i', 'j', 'l', 'n', 'nr', 'nrfg', 'ns', | |||||
| 'nt', 'nz', 't', 'v', 'vd', 'vn', 'eng'] | |||||
| def as_text(v): | |||||
| """生成unicode字符串""" | |||||
| if v is None: | |||||
| return None | |||||
| elif isinstance(v, bytes): | |||||
| return v.decode('utf-8', errors='ignore') | |||||
| elif isinstance(v, str): | |||||
| return v | |||||
| else: | |||||
| raise ValueError('Unknown type %r' % type(v)) | |||||
| def is_text(v): | |||||
| return isinstance(v, str) | |||||
| def cut_sentences(sentence): | |||||
| tmp = [] | |||||
| for ch in sentence: # 遍历字符串中的每一个字 | |||||
| tmp.append(ch) | |||||
| if ch in sentence_delimiters: | |||||
| yield ''.join(tmp) | |||||
| tmp = [] | |||||
| yield ''.join(tmp) | |||||
| def cut_filter_words(cutted_sentences, stopwords, use_stopwords=False): | |||||
| sentences = [] | |||||
| sents = [] | |||||
| for sent in cutted_sentences: | |||||
| sentences.append(sent) | |||||
| if use_stopwords: | |||||
| sents.append([word for word in jiagu.cut(sent) if word and word not in stopwords]) # 把句子分成词语 | |||||
| else: | |||||
| sents.append([word for word in jiagu.cut(sent) if word]) | |||||
| return sentences, sents | |||||
| def psegcut_filter_words(cutted_sentences, stopwords, use_stopwords=True): | |||||
| sents = [] | |||||
| sentences = [] | |||||
| for sent in cutted_sentences: | |||||
| sentences.append(sent) | |||||
| word_list = jiagu.seg(sent) | |||||
| word_list = [word for word in word_list if len(word) > 0] | |||||
| if use_stopwords: | |||||
| word_list = [word.strip() for word in word_list if word.strip() not in stopwords] | |||||
| sents.append(word_list) | |||||
| return sentences, sents | |||||
| def weight_map_rank(weight_graph, max_iter, tol): | |||||
| # 初始分数设置为0.5 | |||||
| # 初始化每个句子的分子和老分数 | |||||
| scores = [0.5 for _ in range(len(weight_graph))] | |||||
| old_scores = [0.0 for _ in range(len(weight_graph))] | |||||
| denominator = get_degree(weight_graph) | |||||
| # 开始迭代 | |||||
| count = 0 | |||||
| while different(scores, old_scores, tol): | |||||
| for i in range(len(weight_graph)): | |||||
| old_scores[i] = scores[i] | |||||
| # 计算每个句子的分数 | |||||
| for i in range(len(weight_graph)): | |||||
| scores[i] = get_score(weight_graph, denominator, i) | |||||
| count += 1 | |||||
| if count > max_iter: | |||||
| break | |||||
| return scores | |||||
| def get_degree(weight_graph): | |||||
| length = len(weight_graph) | |||||
| denominator = [0.0 for _ in range(len(weight_graph))] | |||||
| for j in range(length): | |||||
| for k in range(length): | |||||
| denominator[j] += weight_graph[j][k] | |||||
| if denominator[j] == 0: | |||||
| denominator[j] = 1.0 | |||||
| return denominator | |||||
| def get_score(weight_graph, denominator, i): | |||||
| """ | |||||
| :param weight_graph: | |||||
| :param denominator: | |||||
| :param i: int | |||||
| 第i个句子 | |||||
| :return: float | |||||
| """ | |||||
| length = len(weight_graph) | |||||
| d = 0.85 | |||||
| added_score = 0.0 | |||||
| for j in range(length): | |||||
| # [j,i]是指句子j指向句子i | |||||
| fraction = weight_graph[j][i] * 1.0 | |||||
| # 除以j的出度 | |||||
| added_score += fraction / denominator[j] | |||||
| weighted_score = (1 - d) + d * added_score | |||||
| return weighted_score | |||||
| def different(scores, old_scores, tol=0.0001): | |||||
| flag = False | |||||
| for i in range(len(scores)): | |||||
| if math.fabs(scores[i] - old_scores[i]) >= tol: # 原始是0.0001 | |||||
| flag = True | |||||
| break | |||||
| return flag | |||||
| def cosine_similarity(vec1, vec2): | |||||
| """计算两个向量的余弦相似度 | |||||
| :param vec1: list or np.array | |||||
| :param vec2: list or np.array | |||||
| :return: float | |||||
| """ | |||||
| tx = np.array(vec1) | |||||
| ty = np.array(vec2) | |||||
| cos1 = np.sum(tx * ty) | |||||
| cos21 = np.sqrt(sum(tx ** 2)) | |||||
| cos22 = np.sqrt(sum(ty ** 2)) | |||||
| cosine_value = cos1 / float(cos21 * cos22) | |||||
| return cosine_value | |||||
| def combine(word_list, window=2): | |||||
| if window < 2: | |||||
| window = 2 | |||||
| for x in range(1, window): | |||||
| if x >= len(word_list): | |||||
| break | |||||
| word_list2 = word_list[x:] | |||||
| res = zip(word_list, word_list2) | |||||
| for r in res: | |||||
| yield r | |||||
| def sentences_similarity(s1, s2): | |||||
| """计算两个句子的相似度 | |||||
| :param s1: list | |||||
| :param s2: list | |||||
| :return: float | |||||
| """ | |||||
| counter = 0 | |||||
| for sent in s1: | |||||
| if sent in s2: | |||||
| counter += 1 | |||||
| if counter == 0: | |||||
| return 0 | |||||
| return counter / (math.log(len(s1) + len(s2))) | |||||
| @@ -0,0 +1,20 @@ | |||||
| The MIT License (MIT) | |||||
| Copyright (c) 2018 OwnThink | |||||
| Permission is hereby granted, free of charge, to any person obtaining a copy of | |||||
| this software and associated documentation files (the "Software"), to deal in | |||||
| the Software without restriction, including without limitation the rights to | |||||
| use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of | |||||
| the Software, and to permit persons to whom the Software is furnished to do so, | |||||
| subject to the following conditions: | |||||
| The above copyright notice and this permission notice shall be included in all | |||||
| copies or substantial portions of the Software. | |||||
| THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |||||
| IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS | |||||
| FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR | |||||
| COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER | |||||
| IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN | |||||
| CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. | |||||
| @@ -0,0 +1,2 @@ | |||||
| tensorflow>=1.4.0 | |||||
| numpy>=1.12.1 | |||||
| @@ -0,0 +1,16 @@ | |||||
| #!/usr/bin/env python | |||||
| # -*- coding:utf-8 -*- | |||||
| from setuptools import setup | |||||
| setup(name='jiagu', | |||||
| version='0.1.6', | |||||
| description='Jiagu Natural Language Processing', | |||||
| author='Yener(Zheng Wenyu)', | |||||
| author_email='help@ownthink.com', | |||||
| url='https://github.com/ownthink/Jiagu', | |||||
| license='MIT', | |||||
| install_requires=['tensorflow>=1.4.0', 'numpy>=1.12.1'], | |||||
| packages=['jiagu'], | |||||
| package_dir={'jiagu': 'jiagu'}, | |||||
| package_data={'jiagu': ['*.*', 'model/*', 'data/*']} | |||||
| ) | |||||
| @@ -0,0 +1,15 @@ | |||||
| import jiagu | |||||
| # jiagu.init() # 可手动初始化,也可以动态初始化 | |||||
| text = '厦门明天会不会下雨' | |||||
| words = jiagu.seg(text) # 分词 | |||||
| print(words) | |||||
| pos = jiagu.pos(words) # 词性标注 | |||||
| print(pos) | |||||
| ner = jiagu.ner(text) # 命名实体识别 | |||||
| print(ner) | |||||
| @@ -0,0 +1,29 @@ | |||||
| # -*- encoding:utf-8 -*- | |||||
| """ | |||||
| * Copyright (C) 2017 OwnThink. | |||||
| * | |||||
| * Name : test_findword.py | |||||
| * Author : zengbin93 <zeng_bin8888@163.com> | |||||
| * Version : 0.01 | |||||
| * Description : 新词发现算法 Unittest | |||||
| """ | |||||
| import os | |||||
| import jiagu | |||||
| import unittest | |||||
| class TestFindWord(unittest.TestCase): | |||||
| def setUp(self): | |||||
| self.input_file = r"C:\迅雷下载\test_msr.txt" | |||||
| self.output_file = self.input_file.replace(".txt", '_words.txt') | |||||
| def tearDown(self): | |||||
| os.remove(self.output_file) | |||||
| def test_findword(self): | |||||
| jiagu.findword(self.input_file, self.output_file) | |||||
| self.assertTrue(os.path.exists(self.output_file)) | |||||
| @@ -0,0 +1,35 @@ | |||||
| #!/usr/bin/env python | |||||
| # -*- coding: utf-8 -*- | |||||
| """ | |||||
| * Copyright (C) 2018. | |||||
| * | |||||
| * Name : test_mmseg.py | |||||
| * Author : Leo <1162441289@qq.com> | |||||
| * Version : 0.01 | |||||
| * Description : mmseg分词方法测试 | |||||
| """ | |||||
| import unittest | |||||
| import jiagu | |||||
| class TestTextRank(unittest.TestCase): | |||||
| def setUp(self): | |||||
| pass | |||||
| def tearDown(self): | |||||
| pass | |||||
| def test_seg_one(self): | |||||
| sentence = "人要是行干一行行一行" | |||||
| words = jiagu.seg(sentence, model="mmseg") | |||||
| self.assertTrue(list(words) == ['人', '要是', '行', '干一行', '行', '一行']) | |||||
| def test_seg_two(self): | |||||
| sentence = "武汉市长江大桥上的日落非常好看,很喜欢看日出日落。" | |||||
| words = jiagu.seg(sentence, model="mmseg") | |||||
| self.assertTrue(list(words) == ['武汉市', '长江大桥', '上', '的', '日落', '非常', '好看', ',', '很', '喜欢', '看', '日出日落', '。']) | |||||
| if __name__ == '__main__': | |||||
| unittest.main() | |||||
| @@ -0,0 +1,10 @@ | |||||
| import jiagu | |||||
| text = '厦门明天的天气怎么样' | |||||
| words = jiagu.seg(text) # 分词 | |||||
| print(words) | |||||
| pos = jiagu.pos(words) # 词性标注 | |||||
| print(pos) | |||||
| @@ -0,0 +1,127 @@ | |||||
| # -*- encoding:utf-8 -*- | |||||
| """ | |||||
| * Copyright (C) 2017 OwnThink. | |||||
| * | |||||
| * Name : test_textrank.py | |||||
| * Author : zengbin93 <zeng_bin8888@163.com> | |||||
| * Version : 0.01 | |||||
| * Description : TextRank算法 Unittest | |||||
| """ | |||||
| import unittest | |||||
| import re | |||||
| from jiagu import utils | |||||
| import jiagu | |||||
| class TestTextRank(unittest.TestCase): | |||||
| def setUp(self): | |||||
| pass | |||||
| def tearDown(self): | |||||
| pass | |||||
| def test_default_stopwords(self): | |||||
| file = utils.default_stopwords_file() | |||||
| stopwords = [x.strip() for x in open(file, 'r', encoding='utf-8')] | |||||
| self.assertTrue(len(stopwords) > 0) | |||||
| def test_keywords(self): | |||||
| text = ''' | |||||
| 该研究主持者之一、波士顿大学地球与环境科学系博士陈池(音)表示, | |||||
| “尽管中国和印度国土面积仅占全球陆地的9%,但两国为这一绿化过程 | |||||
| 贡献超过三分之一。考虑到人口过多的国家一般存在对土地过度利用的 | |||||
| 问题,这个发现令人吃惊。” | |||||
| NASA埃姆斯研究中心的科学家拉玛·内曼尼(Rama Nemani)说,“这一 | |||||
| 长期数据能让我们深入分析地表绿化背后的影响因素。我们一开始以为, | |||||
| 植被增加是由于更多二氧化碳排放,导致气候更加温暖、潮湿,适宜生长。” | |||||
| “MODIS的数据让我们能在非常小的尺度上理解这一现象,我们发现人类活动 | |||||
| 也作出了贡献。” | |||||
| NASA文章介绍,在中国为全球绿化进程做出的贡献中,有42%来源于植树造林 | |||||
| 工程,对于减少土壤侵蚀、空气污染与气候变化发挥了作用。 | |||||
| 据观察者网过往报道,2017年我国全国共完成造林736.2万公顷、森林抚育 | |||||
| 830.2万公顷。其中,天然林资源保护工程完成造林26万公顷,退耕还林工程 | |||||
| 完成造林91.2万公顷。京津风沙源治理工程完成造林18.5万公顷。三北及长江 | |||||
| 流域等重点防护林体系工程完成造林99.1万公顷。完成国家储备林建设任务68万公顷。 | |||||
| ''' | |||||
| keywords = jiagu.keywords(text, 5) # 关键词 | |||||
| self.assertTrue(len(keywords) == 5) | |||||
| def test_summarize(self): | |||||
| text = ''''江西省上饶市信州区人民法院 刑事判决书 (2016)赣1102刑初274号 公诉机关 | |||||
| 上饶市信州区人民检察院。 被告人曾榴仙,女,1954年11月22日出生于江西省上饶市信州区, | |||||
| 汉族,文盲,无业,家住上饶市信州区,因涉嫌过失致人死亡罪,2016年4月27日被上饶市公 | |||||
| 安局信州区分局刑事拘留,2016年6月1日被执行逮捕。 辩护人毛巧云,江西盛义律师事务所 | |||||
| 律师。 上饶市信州区人民检察院以饶信检公诉刑诉[2016]260号起诉书指控被告人曾榴仙犯 | |||||
| 过失致人死亡罪,于2016年8月22日向本院提起公诉。本院依法组成合议庭,公开开庭审理了 | |||||
| 本案。上饶市信州区人民检察院指派检察员苏雪莉出庭支持公诉,被告人曾榴仙及辩护人毛巧云, | |||||
| 到庭参加诉讼。现已审理终结。 公诉机关指控: 被告人曾榴仙与被害人祝某两家系位于信州区 | |||||
| 沙溪镇向阳村柘阳的多年邻居,被害人祝某有多年心脏病史,被告人曾榴仙对此事明知。 | |||||
| 2016年4月27日7时许,被告人曾榴仙的丈夫徐某1在修理两家相邻路埂时因权属问题遭到对方阻拦, | |||||
| 被告人曾榴仙和其丈夫徐某1分别与被害人祝某及其丈夫徐某2发生争吵、拉扯,被告人曾榴仙与 | |||||
| 被害人祝某拉扯至祝某家的厕所边,后被害人祝某心脏病发作倒地,在送往医院途中死亡。 | |||||
| 经江西上饶司法鉴定中心鉴定,被害人祝某额部及全身多处皮肤因外力作用致软组织挫擦伤, | |||||
| 生前患有心肌肥大,心瓣膜病等器质性疾病导致心源性猝死是主因,本次与他人发生争吵、 | |||||
| 拉扯系导致心源性猝死的诱因。 2016年4月27日,被告人曾榴仙得知祝某死亡的消息后, | |||||
| 在其丈夫徐某1的陪同下到沙溪派出所投案自首。 被告人曾榴仙对起诉书指控的犯罪事实不持异议。 | |||||
| 辩护人毛巧云提出辩护意见,其对起诉书指控的犯罪事实不持异议,但认为本案系意外事件。 | |||||
| 理由如下:1、被告人曾榴仙是否知道祝某有心脏病;2、被告人曾榴仙即便是知道祝某有心脏病, | |||||
| 这一明知并不能等同于对死亡结果有预见。同时认为被告人曾榴仙具有如下量刑情节: | |||||
| 1、自首;2、当庭认罪;3、一贯表现良好;4、有悔罪表现。 经审理查明: | |||||
| 被告人曾榴仙与被害人祝某两家系位于信州区沙溪镇向阳村柘阳的多年邻居,被害人祝某有多年 | |||||
| 心脏病史,被告人曾榴仙对此事明知。2016年4月27日7时许,被告人曾榴仙的丈夫徐某1在修理 | |||||
| 两家相邻路埂时因权属问题遭到对方阻拦,被告人曾榴仙和其丈夫徐某1分别与被害人祝某及其 | |||||
| 丈夫徐某2发生争吵、拉扯,被告人曾榴仙与被害人祝某拉扯至祝某家的厕所边,后被害人 | |||||
| 祝某心脏病发作倒地,在送往医院途中死亡。经江西上饶司法鉴定中心鉴定,被害人祝某额 | |||||
| 部及全身多处皮肤因外力作用致软组织挫擦伤,生前患有心肌肥大,心瓣膜病等器质性疾病 | |||||
| 导致心源性猝死是主因,本次与他人发生争吵、拉扯系导致心源性猝死的诱因。 | |||||
| 2016年4月27日,被告人曾榴仙得知祝某死亡的消息后,在其丈夫徐某1的陪同下 | |||||
| 到沙溪派出所投案自首。 本案在审理过程中,被告人曾榴仙家属赔偿了被害人祝某家属的损失, | |||||
| 并取得了谅解。 上述事实,被告人曾榴仙在开庭审理过程中亦无异议,且有被告人曾榴仙的 | |||||
| 常住人口信息,归案情况说明,证人徐某1、冯某、徐某3、徐某2、黄某、郑某的证言, | |||||
| 被告曾榴仙的供述及辨认笔录,鉴定意见,现场勘查笔录等证据证实,足以认定。 本院认为, | |||||
| 被告人曾榴仙明知被害人祝某有心脏病,应当预见其行为可能导致祝某病发死亡的后果, | |||||
| 因轻信能够避免而与被害人祝某发生争吵和拉扯,导致被害人病发死亡。其行为已触犯刑法, | |||||
| 构成过失致人死亡罪。公诉机关指控的罪名成立,本院予以支持。辩护人毛巧云辩称该案系意外 | |||||
| 事件的意见本院不予支持。案发后,被告人曾榴仙主动到公安机关投案,并如实供述自己的罪行, | |||||
| 系自首,依法具备可以从轻或减轻处罚情节;被告人曾榴仙家属赔偿了被害人祝某家属的损失, | |||||
| 并取得了谅解,被告人曾榴仙具备酌情从轻处罚情节。本案系因邻里纠纷矛盾激化引发,被告人 | |||||
| 曾榴仙具备酌情从轻处罚情节。依照《中华人民共和国刑法》第二百三十三条、 | |||||
| 第六十七条第一款、第七十二条第一款、第七十三条第二款、第三款的规定, | |||||
| 判决如下: 被告人曾榴仙犯过失致人死亡罪,判处有期徒刑一年,缓刑一年。 | |||||
| (缓刑考验期限,从判决确定之日起计算) 如不服本判决,可在接到判决书的第二日起十日内, | |||||
| 通过本院或者直接向江西省上饶市中级人民法院提出上诉。书面上诉的, | |||||
| 应当提交上诉状正本一份,副本二份。 审判长程明 人民陪审员钱进 人民陪审员郑艳 | |||||
| 二〇一六年十一月十四日 书记员郭建锋 " value="江西省上饶市信州区人民法院 | |||||
| 刑事判决书 (2016)赣1102刑初274号 公诉机关上饶市信州区人民检察院。 | |||||
| 被告人曾榴仙,女,1954年11月22日出生于江西省上饶市信州区,汉族,文盲,无业, | |||||
| 家住上饶市信州区,因涉嫌过失致人死亡罪,2016年4月27日被上饶市公安局信州区分局刑事 | |||||
| 拘留,2016年6月1日被执行逮捕。 辩护人毛巧云,江西盛义律师事务所律师。 | |||||
| 上饶市信州区人民检察院以饶信检公诉刑诉[2016]260号起诉书指控被告人 | |||||
| 曾榴仙犯过失致人死亡罪,于2016年8月22日向本院提起公诉。本院依法组成合议庭, | |||||
| 公开开庭审理了本案。上饶市信州区人民检察院指派检察员苏雪莉出庭支持公诉, | |||||
| 被告人曾榴仙及辩护人毛巧云,到庭参加诉讼。现已审理终结。 公诉机关指控: | |||||
| 被告人曾榴仙与被害人祝某两家系位于信州区沙溪镇向阳村柘阳的多年邻居, | |||||
| 被害人祝某有多年心脏病史,被告人曾榴仙对此事明知。 | |||||
| 2016年4月27日7时许,被告人曾榴仙的丈夫徐某1在修理两家相邻路埂时因权属问题遭到 | |||||
| 对方阻拦,被告人曾榴仙和其丈夫徐某1分别与被害人祝某及其丈夫徐某2发生争吵、拉扯, | |||||
| 被告人曾榴仙与被害人祝某拉扯至祝某家的厕所边,后被害人祝某心脏病发作倒地, | |||||
| 在送往医院途中死亡。经江西上饶司法鉴定中心鉴定,被害人祝某额部及全身多处皮肤因外力 | |||||
| 作用致软组织挫擦伤,生前患有心肌肥大,心瓣膜病等器质性疾病导致心源性猝死是主因, | |||||
| 本次与他人发生争吵、拉扯系导致心源性猝死的诱因。 2016年4月27日,被告人曾榴仙得 | |||||
| 知祝某死亡的消息后,在其丈夫徐某1的陪同下到沙溪派出所投案自首。 被告人曾榴仙对 | |||||
| 起诉书指控的犯罪事实不持异议。 辩护人毛巧云提出辩护意见,其对起诉书指控的犯罪事 | |||||
| 实不持异议,但认为本案系意外事件。理由如下:1、被告人曾榴仙是否知道祝某有心脏病; | |||||
| 2、被告人曾榴仙即便是知道祝某有心脏病,这一明知并不能等同于对死亡结果有预见。 | |||||
| 同时认为被告人曾榴仙具有如下量刑情节:1、自首;2、当庭认罪;3、一贯表现良好; | |||||
| 4、有悔罪表现。 经审理查明: 被告人曾榴仙与被害人祝某两家系位于信州区沙溪镇向阳村 | |||||
| 柘阳的多年邻居,被害人祝某有多年心脏病史,被告人曾榴仙对此事明知。''' | |||||
| text = re.sub('\\n| ', '', text) | |||||
| summarize = jiagu.summarize(text, 3) # 摘要 | |||||
| print(summarize) | |||||
| self.assertTrue(len(summarize) == 3) | |||||
| if __name__ == '__main__': | |||||
| unittest.main() | |||||