History

bxdd a09b619d82 [MNT] unify the code format		1 year ago
..
README.md	[MNT] solve the conflict	1 year ago

config.py	[MNT] unify the code format	1 year ago

workflow.py	[MNT] unify the code format	1 year ago

README.md

Text Dataset Workflow Example

Text Dataset Workflow Example

Introduction

We conducted experiments on the widely used text benchmark dataset: 20-newsgroup.
20-newsgroup is a renowned text classification benchmark with a hierarchical structure, featuring 5 superclasses {comp, rec, sci, talk, misc}.

In the submitting stage, we enumerated all combinations of three superclasses from the five available, randomly sampling 50% of each combination from the training set to create datasets for 50 uploaders.

In the deploying stage, we considered all combinations of two superclasses out of the five, selecting all data for each combination from the testing set as a test dataset for one user. This resulted in 10 users.
The user's own training data was generated using the same sampling procedure as the user test data, despite originating from the training dataset.

Model training comprised two parts: the first part involved training a tfidf feature extractor, and the second part used the extracted text feature vectors to train a naive Bayes classifier.

Our experiments comprises two components:

unlabeled_text_example is designed to evaluate performance when users possess only testing data, searching and reusing learnware available in the market.
labeled_text_example aims to assess performance when users have both testing and limited training data, searching and reusing learnware directly from the market instead of training a model from scratch. This helps determine the amount of training data saved for the user.

Run the code

Run the following command to start the unlabeled_text_example.

python workflow.py unlabeled_text_example

Run the following command to start the labeled_text_example.

python workflow.py labeled_text_example

Results

`unlabeled_text_example`:

The table below presents the mean accuracy of search and reuse across all users:

Setting	Accuracy
Mean in Market (Single)	0.507
Best in Market (Single)	0.859
Top-1 Reuse (Single)	0.846
Job Selector Reuse (Multiple)	0.845
Average Ensemble Reuse (Multiple)	0.862

`labeled_text_example`:

We present the change curves in classification error rates for both the user's self-trained model and the multiple learnware reuse(EnsemblePrune), showcasing their performance on the user's test data as the user's training data increases. The average results across 10 users are depicted below:

From the figure above, it is evident that when the user's own training data is limited, the performance of multiple learnware reuse surpasses that of the user's own model. As the user's training data grows, it is expected that the user's model will eventually outperform the learnware reuse. This underscores the value of reusing learnware to significantly conserve training data and achieve superior performance when user training data is limited.

基于学件范式，全流程地支持学件上传、检测、组织、查搜、部署和复用等功能。同时，该仓库作为北冥坞系统的引擎，支撑北冥坞系统的核心功能。

Python Markdown

bxddream@gmail.com liujd@lamda.nju.edu.cn liuht@lamda.nju.edu.cn 44857064+GeneLiuXe@users.noreply.github.com 45119470+bxdd@users.noreply.github.com chenzx@lamda.nju.edu.cn 1582857295@qq.com xingyuncao2002@outlook.com xiey@lamda.nju.edu.cn tanp@lamda.nju.edu.cn zouxiaochuan@163.com leihy@lamda.nju.edu.cn 356340460@qq.com bixd@lamda.nju.edu.cn 41855617+nju-xy@users.noreply.github.com 201220101@smail.nju.edu.cn 70428486+HarryLiu-34@users.noreply.github.com bixiaodong@bixiaodongdeMacBook-Pro.local 2652610018@qq.com shi_haoyu@outlook.com 82640795+Asymptotez@users.noreply.github.com

README.md

Text Dataset Workflow Example

Introduction

Run the code

Results

unlabeled_text_example:

labeled_text_example:

Contributors (21) All

`unlabeled_text_example`:

`labeled_text_example`:

Contributors (21)
All