Merge pull request #391 from luosiqi/proposal

Proposal and tutorial of unstructured lifelong learning
2 years ago · 0197964b2b
--- a/docs/proposals/images/lifelong-cloud-robotics.png
+++ b/docs/proposals/images/lifelong-cloud-robotics.png
--- a/docs/proposals/images/unstructured-lifelong-learning-algorithm-procedure.png
+++ b/docs/proposals/images/unstructured-lifelong-learning-algorithm-procedure.png
--- a/docs/proposals/lifelong-learning/images/aachen_000000_000019_gtFine_color.png
+++ b/docs/proposals/lifelong-learning/images/aachen_000000_000019_gtFine_color.png
--- a/docs/proposals/lifelong-learning/images/aachen_000000_000019_leftImg8bit.png
+++ b/docs/proposals/lifelong-learning/images/aachen_000000_000019_leftImg8bit.png
--- a/docs/proposals/lifelong-learning/images/s_000_22-06-2016_17-35-02_000000.png
+++ b/docs/proposals/lifelong-learning/images/s_000_22-06-2016_17-35-02_000000.png
--- a/docs/proposals/lifelong-learning/images/s_000_22-06-2016_17-35-02_000000_G.png_direct.png
+++ b/docs/proposals/lifelong-learning/images/s_000_22-06-2016_17-35-02_000000_G.png_direct.png
--- a/docs/proposals/lifelong-learning/lifelong-learning.md
+++ b/docs/proposals/lifelong-learning/lifelong-learning.md
@@ -0,0 +1,9 @@
 # Lifelong Learning

 At present, edge-cloud synergy machine learning is confronted with the challenge of heterogeneous data distributions in complex scenarios and small samples on the edge. The edge-cloud synergy lifelong learning is accordingly proposed: 1) In order to learn with shared knowledge between historical scenarios, the scheme is essentially the combination of another two learning schemes, i.e., multi-task learning and incremental learning; 2) The cloud knowledge base in lifelong learning empowers the scheme with memory ability, which helps to adapt historical knowledge to new and unseen situations on the edge. Joining the forces of multi-task learning, incremental learning and the knowledge base, the lifelong learning scheme seeks to fundamentally overcome the above challenges of edge-cloud synergy machine learning.

 For showing detailed design of Sedna lifelong learning in iterative versions, we list them as belows.

 - [Structured lifelong learning](./structured-lifelong-learning.md): it realizes edge-cloud collaborative continuous learning, knowledge sharing across the edge of the cloud, and automatic discovery and transfer learning of new knowledge.
 - [Unstructured lifelong learning](./unstructured-lifelong-learning.md): compared to structured lifelong learning, this version accomplishes unseen task detection, online unseen task processing and offline unseen task training, in order to taskle data heterogeneity and data deficiency at the edge.

--- a/docs/proposals/lifelong-learning/structured-lifelong-learning.md
+++ b/docs/proposals/lifelong-learning/structured-lifelong-learning.md
@@ -1,7 +1,5 @@
 * [Lifelong Learning](#lifelong-learning)
   * [Motivation](#motivation)
     * [Goals](#goals)
     * [Non\-goals](#non-goals)
   * [Goals](#goals)
   * [Proposal](#proposal)
     * [Use Cases](#use-cases)
   * [Design Details](#design-details)
@@ -18,12 +16,8 @@
   * [Workers Communication](#workers-communication)

 # Lifelong Learning
 ## Motivation


 At present, edge-cloud synergy machine learning is confronted with the challenge of heterogeneous data distributions in complex scenarios and small samples on the edge. The edge-cloud synergy lifelong learning is accordingly proposed: 1) In order to learn with shared knowledge between historical scenarios, the scheme is essentially the combination of another two learning schemes, i.e., multi-task learning and incremental learning; 2) The cloud knowledge base in lifelong learning empowers the scheme with memory ability, which helps to adapt historical knowledge to new and unseen situations on the edge. Joining the forces of multi-task learning, incremental learning and the knowledge base, the lifelong learning scheme seeks to fundamentally overcome the above challenges of edge-cloud synergy machine learning.
 ### Goals

 ## Goals
 In this version of Sedna lifelong learning framework, we realize the following features:

 * edge-cloud collaborative continuous learning.
 * Knowledge sharing across the edge of the cloud.
@@ -33,7 +27,7 @@ At present, edge-cloud synergy machine learning is confronted with the challenge
 We propose using Kubernetes Custom Resource Definitions (CRDs) to describe 
 the lifelong learning specification/status and a controller to synchronize these updates between edge and cloud.

 ![](./images/lifelong-learning-job-crd.png)
 ![](../images/lifelong-learning-job-crd.png)

 ### Use Cases

@@ -91,7 +85,7 @@ These are not separate controllers as such but named here for clarity.
 - upstream: synchronize the lifelong-learning-job updates from the edge to the cloud node.

 ### Lifelong Learning Controller
 ![](./images/lifelong-learning-controller.png)
 ![](../images/lifelong-learning-controller.png)

 The lifelong-learning controller watches for the updates of lifelong-learning jobs and the corresponding pods against the K8S API server.<br/>
 Updates are categorized below along with the possible actions:
@@ -104,7 +98,7 @@ Updates are categorized below along with the possible actions:
 |The corresponding pod created/running/completed/failed                 | Update the status of lifelong-learning job.|

 ### Downstream Controller
 ![](./images/lifelong-learning-downstream-controller.png)
 ![](../images/lifelong-learning-downstream-controller.png)

 The downstream controller watches for the lifelong-learning job updates against the K8S API server.<br/>
 Updates are categorized below along with the possible actions that the downstream controller can take:
@@ -115,7 +109,7 @@ Updates are categorized below along with the possible actions that the downstrea
 |Lifelong-learning-job Deleted                 | The controller sends the delete event to LCs.|

 ### Upstream Controller
 ![](./images/lifelong-learning-upstream-controller.png)
 ![](../images/lifelong-learning-upstream-controller.png)

 The upstream controller watches for the lifelong-learning job updates from the edge node and applies these updates against the API server in the cloud.<br/>
 Updates are categorized below along with the possible actions that the upstream controller can take:
@@ -130,19 +124,19 @@ Updates are categorized below along with the possible actions that the upstream
 ### The flows of lifelong learning job
 - Flow of the job creation:

 ![](./images/lifelong-learning-flow-creation.png)
 ![](../images/lifelong-learning-flow-creation.png)

 - Flow of the `train` stage:

 ![](./images/lifelong-learning-flow-train-stage.png)
 ![](../images/lifelong-learning-flow-train-stage.png)

 - Flow of the `eval` stage:

 ![](./images/lifelong-learning-flow-eval-stage.png)
 ![](../images/lifelong-learning-flow-eval-stage.png)

 - Flow of the `deploy` stage:

 ![](./images/lifelong-learning-flow-deploy-stage.png)
 ![](../images/lifelong-learning-flow-deploy-stage.png)

 ## Workers Communication
 No need to communicate between workers.
--- a/docs/proposals/lifelong-learning/unstructured-lifelong-learning.md
+++ b/docs/proposals/lifelong-learning/unstructured-lifelong-learning.md
@@ -0,0 +1,253 @@
 * [Unstructured Lifelong Learning](#lifelong-learning)
   * [Motivation](#motivation)
     * [Goals](#goals)
     * [Algorithm Process Design](#algorithm-process-design)
     * [Scenario](#scenario)
     * [Dataset](#dataset)
   * [Proposal](#proposal)
     * [Use Cases](#use-cases)
   * [Design Details](#design-details)
     * [CRD API Group and Version](#crd-api-group-and-version)
     * [Lifelong learning CRD](#lifelong-learning-crd)
     * [Lifelong learning type definition](#lifelong-learning-job-type-definition)
     * [Lifelong learning sample](#lifelong-learning-job-sample)
     * [Validation](#validation)
   * [Controller Design](#controller-design)
     * [Lifelong Learning Controller](#lifelong-learning-controller)
     * [Downstream Controller](#downstream-controller)
     * [Upstream Controller](#upstream-controller)
     * [Details of api between GM(cloud) and LC(edge)](#details-of-api-between-gmcloud-and-lcedge)
   * [Workers Communication](#workers-communication)

 # Unstructured Lifelong Learning
 ## Motivation

 Edge cloud synergy lifelong learning has been proposed and realized through Sedna in [ATCII example](https://github.com/kubeedge/sedna/blob/main/examples/lifelong_learning/atcii/README.md) which handles structured data. However, [version v0.5.1](https://github.com/kubeedge/sedna/releases/tag/v0.5.1) of Sedna lifelong learning scheme has the following limitations.

 * **It can't directly apply to unstructured data scenarios** such as images, texts, voices, etc, which instead a majority of AI applications have been working on. The adaption to unstructured data is essential for Sedna lifelong learning to support more diversified applications.

 * **It hasn't enable to detect or tackle unseen samples**, also called heterogeneous samples, at the inference stage. As is well known, Non-IID is prevailing situation that exists in distributed AI. When encountering heterogeneous data, the real-time inference performance of original model is greatly reduced which might cause unacceptable loss in certain scenarios. Hence, how to detect heterogeneous or unseen data in advance and process them properly in real time become sigificant topics.

 * **It has not realized efficient unseen task processing**. As unseen samples are collected, turning them to be seen samples in order to make model smarter is challenging and important. In addition to retraining all the data, efficiently training unseen sample is worth studying. 

 In this version of lifelong learning, we seek to fundamentally overcome the above challenges of edge-cloud distributed machine learning.
 ### Goals

 * Supports unstructured lifelong learning to complete semantic segmentation based on RGB-D images.
 * Supports unseen task detection before inference to avoid model performance deteriation. 
 * Supports unseen task inference. When unseen task is detected which shouldn't be inferenced by base model, we provide extra means such as manul intervention, mechanism methods, etc., to guarantee that unseen task will be well processed.
 * Supports unseen task training. In addition to retraining all the data without pretrained model, more efficient unseen task training is supported with different strategies and pretrained model can be configured to accelerate and improve the training process.
 * Supports knowledgebase content exhibition. Namely, Sedna will provide knowledgebase messages, e.g., number of tasks or models, number of unseen samples, etc., for users using kubenetes commands.

 ### Algorithm Process Design
 * Initial training stage: create knowledge base with initial dataset by multi-task learning. After that, seen tasks of the first round are stored in knowledge base and wait to be deployed to inference workers.

 * Evaluation stage: before inference starts, this stage finds out proper seen tasks for specified edge inference workers by configured evaluation mechanism, in order to improve inference performance.

 * Inference stage: at this stage, we design unseen task detection and unseen task inference to cope with heterogeneous data which are creatively proposed. With real-time unseen task processing, economic loss can be saved in certain industrial scenarios such as quality inspection.

 * Update: after enough unseen samples are collected and labeled, knowledge base update is carried out via unseen task training.

 ![](../images/unstructured-lifelong-learning-algorithm-procedure.png)

 ### Scenario
 In this proposal, we achieve intelligent navigation based on Sedna lifelong perception of cloud robotics. 

 ![](../images/lifelong-cloud-robotics.png)

 Robots are more generally to be utilized to conduct delivery or inspection tasks. However, robot laser radar usually fails to detect low obstacles, resulting in falling. Therefore, AI visual detection helps to solve laser radar deficiency, and recognize environment accurately. Due to limitation resources of a robot, edge-cloud synergy AI makes robots smarter to detect low obstacles, offers more accurate models from cloud and help robots make intelligent decisions. However, cloud robotics still face challenges from edge data heterogeneity and edge data deficiency.

 - Edge data heterogeneity: when inference samples generated from unseen spots, severe weather or different brightness, the model performance deteriorates greatly.
 - Edge data deficiency: it is hard to train an accurate new model quickly for a new spot because of huge labeling cost and few samples at a single edge site. Usually model training requires cold start.

 While lifelong learning can help to improve model performance, tackle edge data heterogeneity and data deficiency, and save labeling cost greatly in cloud robotics scenarios. It proposes the following three modules to solve the above challenges.

 - Unseen task detection: it recognizes and saves edge heterogeneous data (unseen task) to the cloud. On one hand, we can alarm robot to stop. On the other hand, unseen tasks at cloud can be used for further training.
 - Unseen task inference: it processes edge heterogeneous data in real time by notifying manual intervention, teleoperation, etc.
 - Unseen task training: it trains new models for unseen samples by multi-task learning and transfer learning and turns unseen tasks to be seen for edge data deficiency.

 ### Dataset
 In this example, we utilize two sets of dataset, i.e., **CITYSCAPES** and **SYNTHIA** to apply semantic segmentation to cloud robotics and realize intelligent environment perception.

 #### [CITYSCAPES](https://www.cityscapes-dataset.com)
 CITYSCAPES dataset contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5000 frames in addition to a larger set of 20 000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts.

 The Cityscapes Dataset is intended for assessing the performance of vision algorithms for major tasks of semantic urban scene understanding: pixel-level, instance-level, and panoptic semantic labeling; supporting research that aims to exploit large volumes of (weakly) annotated data, e.g. for training deep neural networks.

 Below are examples of high quality dense pixel annotations. 
 ![](./images/aachen_000000_000019_leftImg8bit.png)<center>CITYSCAPE Image</center>

 ![](./images/aachen_000000_000019_gtFine_color.png)<center>CITYSCAPE Annotation</center>

 #### [SYNTHIA](https://synthia-dataset.net/)
 SYNTHIA, the SYNTHetic collection of Imagery and Annotations, is a dataset that has been generated with the purpose of aiding semantic segmentation and related scene understanding problems in the context of driving scenarios. SYNTHIA consists of a collection of photo-realistic frames rendered from a virtual city and comes with precise pixel-level semantic annotations for 13 classes: misc, sky, building, road, sidewalk, fence, vegetation, pole, car, sign, pedestrian, cyclist, lane-marking.

 Below are the examples of synthia images and pixel annotations.
 ![](./images/s_000_22-06-2016_17-35-02_000000.png)<center>SYNTHIA Image</center>

 ![](./images/s_000_22-06-2016_17-35-02_000000_G.png_direct.png)<center>SYNTHIA Annotation</center>

 #### Re-organized dataset of lifelong cloud robotics
 While we also provided a [re-organized dataset](https://kubeedge.obs.cn-north-1.myhuaweicloud.com/examples/robo_dog_delivery/segmentation_data.zip) from CITYSCAPES and SYNTHIA to run this example. After decompression, the data organization is listed as follows:

 <table class="tg">
 <thead>
  <tr>
    <th class="tg-0pky" colspan="2">Dataset</th>
    <th class="tg-0pky">Number</th>
  </tr>
 </thead>
 <tbody>
  <tr>
    <td class="tg-0pky">CITYSCAPES</td>
    <td class="tg-0pky">rgb</td>
    <td class="tg-0pky">250</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">disparity</td>
    <td class="tg-0pky">250</td>
  </tr>
  <tr>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">groudtruth annotation</td>
    <td class="tg-0pky">250</td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="3">SYNTHIA</td>
    <td class="tg-0pky">rgb</td>
    <td class="tg-0pky">250</td>
  </tr>
  <tr>
    <td class="tg-0pky">disparity</td>
    <td class="tg-0pky">250</td>
  </tr>
  <tr>
    <td class="tg-0pky">groudtruth annotation</td>
    <td class="tg-0pky">250</td>
  </tr>
  <tr>
    <td class="tg-0pky" colspan="2">Total</td>
    <td class="tg-0pky">1500</td>
  </tr>
 </tbody>
 </table>

 ** Note: the following sections of **Proposal**, **Design Details**, **Controller Design** and **Workers Communication** are the same as that of version v0.5.1.

 ## Proposal
 We propose using Kubernetes Custom Resource Definitions (CRDs) to describe 
 the lifelong learning specification/status and a controller to synchronize these updates between edge and cloud.

 ![](../images/lifelong-learning-job-crd.png)

 ### Use Cases

 * Users can create the lifelong learning jobs, by providing training scripts, configuring training hyperparameters, providing training datasets, configuring training and deployment triggers.


 ## Design Details
 There are three stages in a lifelong learning job: train/eval/deploy.

 Each stage contains these below states:
 1. Waiting: wait to trigger satisfied, i.e. wait to train/eval/deploy
 1. Ready: the corresponding trigger satisfied, now ready to train/eval/deploy
 1. Starting: the corresponding stage is starting
 1. Running: the corresponding stage is running
 1. Failed: the corresponding stage failed
 1. Completed: the corresponding stage completed

 ### CRD API Group and Version
 The `LifelongLearningJob` CRD will be namespace-scoped.
 The tables below summarize the group, kind and API version details for the CRD.

 * LifelongLearningJob

 | Field                 | Description             |
 |-----------------------|-------------------------|
 |Group                  | sedna.io     |
 |APIVersion             | v1alpha1                |
 |Kind                   | LifelongLearningJob             |

 ### Lifelong learning CRD
 See the [crd source](/build/crds/sedna/sedna.io_lifelonglearningjobs.yaml) for details.

 ### Lifelong learning job type definition

 See the [golang source](/pkg/apis/sedna/v1alpha1/lifelonglearningjob_types.go) for details.

 #### Validation
 [Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests.
 Invalid values for fields (example string value for a boolean field etc) can be validated using this.

 Here is a list of validations we need to support :
 1. The `dataset` specified in the crd should exist in k8s.
 1. The edgenode name specified in the crd should exist in k8s.

 ### Lifelong learning job sample
 See the [source](/build/crd-samples/sedna/lifelonglearningjob_v1alpha1.yaml) for an example.
    
 ## Controller Design

 The Lifelong learning controller starts three separate goroutines called `upstream`, `downstream` and `Lifelonglearningjob`controller.<br/>
 These are not separate controllers as such but named here for clarity.
 - Lifelong learning: watch the updates of lifelong-learning job crds, and create the workers depending on the state machine.
 - downstream: synchronize the lifelong-learning-job updates from the cloud to the edge node.
 - upstream: synchronize the lifelong-learning-job updates from the edge to the cloud node.

 ### Lifelong Learning Controller
 ![](../images/lifelong-learning-controller.png)

 The lifelong-learning controller watches for the updates of lifelong-learning jobs and the corresponding pods against the K8S API server.<br/>
 Updates are categorized below along with the possible actions:

 | Update Type                    | Action                                       |
 |-------------------------------|---------------------------------------------- |
 |New lifelong-learning-job Created             | Wait to train trigger satisfied|
 |lifelong-learning-job Deleted                 | NA. These workers will be deleted by [k8s gc](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/).|
 |The Status of lifelong-learning-job Updated               | Create the train/eval worker if it's ready.|
 |The corresponding pod created/running/completed/failed                 | Update the status of lifelong-learning job.|

 ### Downstream Controller
 ![](../images/lifelong-learning-downstream-controller.png)

 The downstream controller watches for the lifelong-learning job updates against the K8S API server.<br/>
 Updates are categorized below along with the possible actions that the downstream controller can take:

 | Update Type                    | Action                                       |
 |-------------------------------|---------------------------------------------- |
 |New Lifelong-learning-job Created             |Sends the job information to LCs.|
 |Lifelong-learning-job Deleted                 | The controller sends the delete event to LCs.|

 ### Upstream Controller
 ![](../images/lifelong-learning-upstream-controller.png)

 The upstream controller watches for the lifelong-learning job updates from the edge node and applies these updates against the API server in the cloud.<br/>
 Updates are categorized below along with the possible actions that the upstream controller can take:

 | Update Type                        | Action                                        |
 |-------------------------------     |---------------------------------------------- |
 |Lifelong-learning-job Reported State Updated    |  The controller appends the reported status of the job by LC in the cloud. |

 ### Details of api between GM(cloud) and LC(edge)
 [Reference](https://github.com/kubeedge/sedna/blob/main/docs/proposals/incremental-learning.md#details-of-api-between-gmcloud-and-lcedge)

 ### The flows of lifelong learning job
 - Flow of the job creation:

 ![](../images/lifelong-learning-flow-creation.png)

 - Flow of the `train` stage:

 ![](../images/lifelong-learning-flow-train-stage.png)

 - Flow of the `eval` stage:

 ![](../images/lifelong-learning-flow-eval-stage.png)

 - Flow of the `deploy` stage:

 ![](../images/lifelong-learning-flow-deploy-stage.png)

 ## Workers Communication
 No need to communicate between workers.
--- a/examples/lifelong_learning/cityscapes/cityscapes-segmentation-lifelong-learning-tutorial.md
+++ b/examples/lifelong_learning/cityscapes/cityscapes-segmentation-lifelong-learning-tutorial.md
@@ -0,0 +1,593 @@
 This tutorial targets at lifelong learning job in smart environment perception scenario, and includes how to run the default example with customized configurations, as well as how to develop and integrate user-defined modules.

 # 1 Configure Default Example
 With Custom Resource Definitions (CRDs) of Kubernetes, developers are able to configure the default lifelong process using the following configurations.
 ## 1.1 Install Sedna
 Follow the [Sedna installation document](https://sedna.readthedocs.io/en/v0.5.0/setup/install.html) to install Sedna.

 ## 1.2 Prepare Dataset
 Users can use semantic segmentation datasets from [CITYSCAPES](https://www.cityscapes-dataset.com/). While we also provide a re-organized [dataset segmentation_data.zip](https://kubeedge.obs.cn-north-1.myhuaweicloud.com/examples/robo_dog_delivery/segmentation_data.zip) of CITYSCAPES as an example for training and evaluation. 

 Download and unzip segmentation_data.zip by executing the following commands. 
 ```
 mkdir /data
 cd /data
 wget https://kubeedge.obs.cn-north-1.myhuaweicloud.com/examples/robo_dog_delivery/segmentation_data.zip
 unzip segmentation_data.zip
 ```

 ## 1.3 Create Dataset CRD
 | Property | Required | Description |
 |----------|----------|-------------|
 |name|yes|Dataset name defined in metadata|
 |url|yes|Url of dataset index file, which is generally stored in data node|
 |format|yes|Format of dataset index file|
 |nodeName|yes|Name of data node that stores data and dataset index file|

 After preparing specific dataset and index file, users can configure them as in the following example for training and evaluation. So data will be automatically downloaded from where the index file indicates to the corresponding pods.
 ```
 DATA_NODE = "cloud-node" 
 ```

 ```
 kubectl create -f - << EOF
 apiVersion: sedna.io/v1alpha1
 kind: Dataset
 metadata:
  name: lifelong-robo-dataset
 spec:
  url: "$data_url"
  format: "txt"
  nodeName: "$DATA_NODE"
 EOF
 ```

 ## 1.4 Start Lifelong Learning Job
 To run lifelong learning jobs, users need to configure their own lifelong learning CRDs in training, evaluation, and inference phases. The configuration process for these three phases is similar.

 | Property | Required | Description |
 |----------|----------|-------------|
 |nodeName|yes|Name of the node where worker runs|
 |dnsPolicy|yes|DNS policy set at pod level|
 |imagePullPolicy|yes|Image pulling policy when local image does not exist|
 |args|yes|Arguments to run images. In this example, it is the startup file of each stage| 
 |env|no|Environment variables passed to each stage |
 |trigger|yes|Configuration for when training begins|
 |resourcs|yes|Limited or required resources of CPU and memory|
 |volumeMounts|no|Specified path to be mounted to the host|
 |volumes|no|Directory in the node which file systems in a worker are mounted to|

 First, configure parameters for lifelong learning job as follows.

 ```
 local_prefix=/data
 cloud_image=docker.io/luosiqi/sedna-robo:v0.1.2
 edge_image=docker.io/luosiqi/sedna-robo:v0.1.2
 data_url=$local_prefix/segmentation_data/data.txt

 WORKER_NODE=sedna-mini-control-plane

 DATA_NODE=$WORKER_NODE 
 TRAIN_NODE=$WORKER_NODE 
 EVAL_NODE=$WORKER_NODE 
 INFER_NODE=$WORKER_NODE 
 OUTPUT=$local_prefix/lifelonglearningjob/output
 job_name=robo-demo
 ```

 Second, use the following yaml configuration to create and run lifelong learning job.

 ```
 kubectl create -f - <<EOF
 apiVersion: sedna.io/v1alpha1
 kind: LifelongLearningJob
 metadata:
  name: $job_name
 spec:
  dataset:
    name: "lifelong-robo-dataset"
    trainProb: 0.8
  trainSpec:
    template:
      spec:
        nodeName: $TRAIN_NODE
        dnsPolicy: ClusterFirstWithHostNet
        containers:
          - image: $cloud_image
            name:  train-worker
            imagePullPolicy: IfNotPresent       
            args: ["train.py"]
            env:
              - name: "num_class"
                value: "24"
              - name: "epoches"
                value: "1"
              - name: "attribute"
                value: "real, sim"
              - name: "city"
                value: "berlin"
              - name: "BACKEND_TYPE"
                value: "PYTORCH"
            resources:
              limits:
                cpu: 6
                memory: 12Gi
              requests:
                cpu: 4
                memory: 12Gi
            volumeMounts:
            - mountPath: /dev/shm
              name: cache-volume
        volumes:
        - emptyDir:
            medium: Memory
            sizeLimit: 256Mi
          name: cache-volume
    trigger:
      checkPeriodSeconds: 30
      timer:
        start: 00:00
        end: 24:00
      condition:
        operator: ">"
        threshold: 100
        metric: num_of_samples
  evalSpec:
    template:
      spec:
        nodeName: $EVAL_NODE
        dnsPolicy: ClusterFirstWithHostNet
        containers:
          - image: $cloud_image
            name:  eval-worker
            imagePullPolicy: IfNotPresent
            args: ["evaluate.py"]
            env:
              - name: "operator"
                value: "<"
              - name: "model_threshold"
                value: "0"
              - name: "num_class"
                value: "24"
              - name: "BACKEND_TYPE"
                value: "PYTORCH"
            resources:
              limits:
                cpu: 6
                memory: 12Gi
              requests:
                cpu: 4
                memory: 12Gi
  deploySpec:
    template:
      spec:
        nodeName: $INFER_NODE
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        containers:
        - image: $edge_image
          name:  infer-worker
          imagePullPolicy: IfNotPresent
          args: ["predict.py"]
          env:
            - name: "test_data"
              value: "/data/test_data"
            - name: "num_class"
              value: "24"
            - name: "unseen_save_url"
              value: "/data/unseen_samples"
            - name: "INFERENCE_RESULT_DIR"
              value: "/data/infer_results"
            - name: "BACKEND_TYPE"
              value: "PYTORCH"
          volumeMounts:
          - name: unseenurl
            mountPath: /data/unseen_samples
          - name: inferdata
            mountPath: /data/infer_results
          - name: testdata
            mountPath: /data/test_data
          resources:
            limits:
              cpu: 6
              memory: 12Gi
            requests:
              cpu: 4
              memory: 12Gi
        volumes:
          - name: unseenurl
            hostPath:
              path: /data/unseen_samples
              type: DirectoryOrCreate
          - name: inferdata
            hostPath:
              path: /data/infer_results
              type: DirectoryOrCreate
          - name: testdata
            hostPath:
              path: /data/test_data
              type: DirectoryOrCreate
  outputDir: $OUTPUT/$job_name
 EOF
 ```

 ## 1.5 Check Lifelong Learning Job
 **(1). Query lifelong learning service status**

 ```
 kubectl get lifelonglearningjob robo-demo
 ```

 **(2). View pods related to lifelong learning job**

 ```
 kubectl get pod
 ```

 **(3). View result files**

 Knowledgebase contents including multi-task learning models, dataset, etc., can be found in `outputDir`. While inference results are stored in `/data/infer_results` of `$INFER_NODE`.

 # 2 Develop and Integrate Customized Modules

 Before starting development, you should prepare the [development environment](https://github.com/kubeedge/sedna/blob/main/docs/contributing/prepare-environment.md) and learn about the [interface design of Sedna](https://sedna.readthedocs.io/en/latest/autoapi/lib/sedna/index.html).

 ## 2.1 Develop Sedna AI Module

 The Sedna framework components are decoupled and the registration mechanism is used to combine functional components to facilitate function and algorithm expansion. For details about the Sedna architecture and main mechanisms, see [Lib README](https://github.com/kubeedge/sedna/blob/51219027a0ec915bf3afb266dc5f9a7fb3880074/lib/sedna/README.md).

 The following contents explains how to develop customized AI modules of a Sedna project, including **dataset**, **base model**, **algorithms**, etc.

 ### 2.1.1 Import Service Datasets

 During Sedna application development, the first problem users encounter is how to import service datasets to Sedna. Sedna provides interfaces and public methods related to data conversion and sampling in the [Dataset class](https://github.com/kubeedge/sedna/blob/c763c1a90e74b4ff1ab0afa06fb976fbb5efa512/lib/sedna/datasources/__init__.py). 

 All dataset classes of Sedna are inherited from the base class `sedna.datasources.BaseDataSource`. This base class defines the interfaces and attributes to process datasets customizedly and provides default implementation. The derived class can reload these default implementations as required.

 We take `txt format` dataset index file which contains sets of images as an example.

 **(1). Inherite from BaseDataSource**

 ```python
 class BaseDataSource:
    """
    An abstract class representing a :class:`BaseDataSource`.
    All datasets that represent a map from keys to data samples should subclass
    it. All subclasses should overwrite parse`, supporting get train/eval/infer
    data by a function. Subclasses could also optionally overwrite `__len__`,
    which is expected to return the size of the dataset.overwrite `x` for the
    feature-embedding, `y` for the target label.
    Parameters
    ----------
    data_type : str
        define the datasource is train/eval/test
    func: function
        function use to parse an iter object batch by batch
    """

    def __init__(self, data_type="train", func=None):
        self.data_type = data_type  # sample type: train/eval/test
        self.process_func = None
        if callable(func):
            self.process_func = func
        elif func:
            self.process_func = ClassFactory.get_cls(
                ClassType.CALLBACK, func)()
        self.x = None  # sample feature
        self.y = None  # sample label
        self.meta_attr = None  # special in lifelong learning

    def num_examples(self) -> int:
        return len(self.x)

    def __len__(self):
        return self.num_examples()

    def parse(self, *args, **kwargs):
        raise NotImplementedError

    @property
    def is_test_data(self):
        return self.data_type == "test"

    def save(self, output=""):
        return FileOps.dump(self, output)


 class TxtDataParse(BaseDataSource, ABC):
    """
    txt file which contain image list parser
    """

    def __init__(self, data_type, func=None):
        super(TxtDataParse, self).__init__(data_type=data_type, func=func)

    def parse(self, *args, **kwargs):
        x_data = []
        y_data = []
        use_raw = kwargs.get("use_raw")
        for f in args:
            if not (f and FileOps.exists(f)):
                continue
            with open(f) as fin:
                if self.process_func:
                    res = list(map(self.process_func, [
                               line.strip() for line in fin.readlines()]))
                else:
                    res = [line.strip().split() for line in fin.readlines()]
            for tup in res:
                if not len(tup):
                    continue
                if use_raw:
                    x_data.append(tup)
                else:
                    x_data.append(tup[0])
                    if not self.is_test_data:
                        if len(tup) > 1:
                            y_data.append(tup[1])
                        else:
                            y_data.append(0)
        self.x = np.array(x_data)
        self.y = np.array(y_data)
 ```

 Then users can load and preprocess the dataset url file utilizing `sedna.datasources.TxtDataParse` as follows. Particularly, the attribute `func` of `TxtDataParse` defines the customized preprocessing function for each row's data index in dataset url file.

 ```python
 def _load_txt_dataset(dataset_url):
    # use original dataset url
    original_dataset_url = Context.get_parameters('original_dataset_url', "")
    dataset_urls = dataset_url.split()
    dataset_urls = [
        os.path.join(
            os.path.dirname(original_dataset_url),
            dataset_url) for dataset_url in dataset_urls]
    return dataset_urls[:-1], dataset_urls[-1]

 def run():
    estimator = Estimator(num_class=int(Context.get_parameters("num_class", 24)),
                      epochs=int(Context.get_parameters("epoches", 1)))
    train_dataset_url = BaseConfig.train_dataset_url
    train_data = TxtDataParse(data_type="train", func=_load_txt_dataset)
    train_data.parse(train_dataset_url, use_raw=False)

    train(estimator, train_data)


 if __name__ == '__main__':
    run()
 ```

 ### 2.1.2 Modify Base Model

 Estimator is a high-level API that greatly simplifies machine learning programming. Estimators encapsulate `train`, `evaluate`, `predict`, `load` and `save` functions which users should customizedly realize.

 **(1). Define an Estimator**

 In lifelong learning robotics case, Estimator is defined in interface.py, and users can replace the existing base model with the models that best suits their purposes.

 ```python
 class Estimator:
    def __init__(self, **kwargs):
        self.train_args = TrainingArguments(**kwargs)
        self.val_args = EvaluationArguments(**kwargs)

        self.train_args.resume = Context.get_parameters(
            "PRETRAINED_MODEL_URL", None)
        self.trainer = None
        self.train_model_url = None

        label_save_dir = Context.get_parameters(
            "INFERENCE_RESULT_DIR",
            os.path.join(BaseConfig.data_path_prefix,
                         "inference_results"))
        self.val_args.color_label_save_path = os.path.join(
            label_save_dir, "color")
        self.val_args.merge_label_save_path = os.path.join(
            label_save_dir, "merge")
        self.val_args.label_save_path = os.path.join(label_save_dir, "label")
        self.val_args.weight_path = kwargs.get("weight_path")
        self.validator = Validator(self.val_args)

    def train(self, train_data, valid_data=None, **kwargs):
        self.trainer = Trainer(
            self.train_args, train_data=train_data, valid_data=valid_data)
        LOGGER.info("Total epoches: {}".format(self.trainer.args.epochs))
        for epoch in range(
                self.trainer.args.start_epoch,
                self.trainer.args.epochs):
            if epoch == 0 and self.trainer.val_loader:
                self.trainer.validation(epoch)
            self.trainer.training(epoch)

            if self.trainer.args.no_val and \
                (epoch % self.trainer.args.eval_interval ==
                    (self.trainer.args.eval_interval - 1) or
                 epoch == self.trainer.args.epochs - 1):
                # save checkpoint when it meets eval_interval
                # or the training finishes
                is_best = False
                train_model_url = self.trainer.saver.save_checkpoint({
                    'epoch': epoch + 1,
                    'state_dict': self.trainer.model.state_dict(),
                    'optimizer': self.trainer.optimizer.state_dict(),
                    'best_pred': self.trainer.best_pred,
                }, is_best)

        self.trainer.writer.close()
        self.train_model_url = train_model_url

        return {"mIoU": 0 if not valid_data
                else self.trainer.validation(epoch)}

    def predict(self, data, **kwargs):
        if isinstance(data[0], dict):
            data = preprocess_frames(data)

        if isinstance(data[0], np.ndarray):
            data = preprocess_url(data)

        self.validator.test_loader = DataLoader(
            data,
            batch_size=self.val_args.test_batch_size,
            shuffle=False,
            pin_memory=False)

        return self.validator.validate()

    def evaluate(self, data, **kwargs):
        predictions = self.predict(data.x)
        return accuracy(data.y, predictions)

    def load(self, model_url, **kwargs):
        if model_url:
            self.validator.new_state_dict = torch.load(model_url)
            self.validator.model = load_my_state_dict(
                self.validator.model,
                self.validator.new_state_dict['state_dict'])

            self.train_args.resume = model_url
        else:
            raise Exception("model url does not exist.")

    def save(self, model_path=None):
        if not model_path:
            LOGGER.warning(f"Not specify model path.")
            return self.train_model_url

        return FileOps.upload(self.train_model_url, model_path)
 ```

 **(2). Initialize a lifelong learning job**

 ```python
 import Estimator from interface


 ll_job = LifelongLearning(
    estimator=Estimator
 )
 ```

 Noted that `Estimator` is the base model for your lifelong learning job.

 ### 2.1.3 Develop Customized Algorithms

 Users may need to develop new algorithms based on the basic classes provided by Sedna, such as `unseen task detection` in lifelong learning example.

 Sedna provides a class called `class_factory.py` in `common` package, in which only a few lines of changes are required to integrate existing algorithms into Sedna.

 The following content takes a hard example mining algorithm as an example to explain how to add an HEM algorithm to the Sedna hard example mining algorithm library.

 **(1). Start from the `class_factory.py`**

 First, let's start from the `class_factory.py`. Two classes are defined in `class_factory.py`, namely `ClassType` and `ClassFactory`.

 `ClassFactory` can register the modules you want to reuse through decorators. For the new `ClassType.STP` algorithm of task definition in lifelong learning, the code is as follows:

 ```python
@ClassFactory.register(ClassType.STP)
 class TaskDefinitionByOrigin(BaseTaskDefinition):
    """
    Dividing datasets based on the their origins.
    Parameters
    ----------
    attr_filed Tuple[Metadata]
        metadata is usually a class feature label with a finite values.
    """

    def __init__(self, **kwargs):
        super(TaskDefinitionByOrigin, self).__init__()
        self.attribute = kwargs.get("attribute").split(", ")
        self.city = kwargs.get("city")

    def __call__(self,
                 samples: BaseDataSource, **kwargs) -> Tuple[List[Task],
                                                             Any,
                                                             BaseDataSource]:

        tasks = []
        d_type = samples.data_type

        task_index = dict(zip(self.attribute, range(len(self.attribute))))
        sample_index = range(samples.num_examples())

        _idx = [i for i in sample_index if self.city in samples.y[i]]
        _y = samples.y[_idx]
        _x = samples.x[_idx]
        _sample = BaseDataSource(data_type=d_type)
        _sample.x, _sample.y = _x, _y

        g_attr = f"{self.attribute[0]}.model"
        task_obj = Task(entry=g_attr, samples=_sample,
                        meta_attr=self.attribute[0])
        tasks.append(task_obj)

        _idx = list(set(sample_index) - set(_idx))
        _y = samples.y[_idx]
        _x = samples.x[_idx]
        _sample = BaseDataSource(data_type=d_type)
        _sample.x, _sample.y = _x, _y

        g_attr = f"{self.attribute[-1]}.model"
        task_obj = Task(entry=g_attr, samples=_sample,
                        meta_attr=self.attribute[-1])
        tasks.append(task_obj)

        return tasks, task_index, samples
 ```

 In this step, you have customized an **task definition algorithm**, and the line of `ClassFactory.register(ClassType.STP)` is to complete the registration.

 **(2). Configure algorithm module in Sedna**

 After registration, you only need to configure task definition algorithim in corresponding script. Take the following codes in `train.py` as an example.

 ```python
 def train(estimator, train_data):
    task_definition = {
        "method": "TaskDefinitionByOrigin",
        "param": {
            "attribute": Context.get_parameters("attribute"),
            "city": Context.get_parameters("city")
        }
    }

    task_allocation = {
        "method": "TaskAllocationByOrigin"
    }

    ll_job = LifelongLearning(estimator,
                              task_definition=task_definition,
                              task_relationship_discovery=None,
                              task_allocation=task_allocation,
                              task_remodeling=None,
                              inference_integrate=None,
                              task_update_decision=None,
                              unseen_task_allocation=None,
                              unseen_sample_recognition=None,
                              unseen_sample_re_recognition=None
                              )

    ll_job.train(train_data)
 ```
 Users can configure task definition algorithm and its parameters by dictionary and then pass the dictionary to LifelongLearning class when creating lifelong learning job.


 ## 2.2 Run Customized Example

 **(1). Build worker images**

 First, you need to modify lifelong-learning-cityscapes-segmentation.Dockerfile based on your development.

 Then generate Images by the script [build_images.sh](https://github.com/kubeedge/sedna/blob/main/examples/build_image.sh).

 **(2). Start customized lifelong job**

 This process is similar to that in section `1.4`. But remember to modify the dataset (explained in `1.3`) and configure the base model and parameters in yaml like section `1.4`.

 ## 2.3 Further Development

 In addition to developing on the lifelong learning case, users can also [develop the control plane](https://github.com/kubeedge/sedna/blob/main/docs/contributing/control-plane/development.md) of the Sedna project, as well as [adding a new synergy feature](https://github.com/kubeedge/sedna/blob/51219027a0ec915bf3afb266dc5f9a7fb3880074/docs/contributing/control-plane/add-a-new-synergy-feature.md).