You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

incremental-learning.md 9.2 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194
  1. * [Incremental Learning](#incremental-learning)
  2. * [Motivation](#motivation)
  3. * [Goals](#goals)
  4. * [Non\-goals](#non-goals)
  5. * [Proposal](#proposal)
  6. * [Use Cases](#use-cases)
  7. * [Design Details](#design-details)
  8. * [CRD API Group and Version](#crd-api-group-and-version)
  9. * [Incremental learning CRD](#incremental-learning-crd)
  10. * [Incremental learning type definition](#incremental-learning-job-type-definition)
  11. * [Incremental learning sample](#incremental-learning-job-sample)
  12. * [Validation](#validation)
  13. * [Controller Design](#controller-design)
  14. * [Incremental Learning Controller](#incremental-learning-controller)
  15. * [Downstream Controller](#downstream-controller)
  16. * [Upstream Controller](#upstream-controller)
  17. * [Details of api between GM(cloud) and LC(edge)](#details-of-api-between-gmcloud-and-lcedge)
  18. * [Workers Communication](#workers-communication)
  19. # Incremental Learning
  20. ## Motivation
  21. Data is continuously generated on the edge side. Traditionally, the data is collected manually and periodically retrained on the cloud to improve the model effect. This method wastes a lot of human resources, and the model update frequency is slow. Incremental learning allows users to continuously monitor the newly generated data and by configuring some triggering rules to determine whether to start training, evaluation, and deployment automatically, and continuously improve the model performance.
  22. ### Goals
  23. * Automatically retrains, evaluates, and updates models based on the data generated at the edge.
  24. * Support time trigger, sample size trigger, and precision-based trigger.
  25. * Support manual triggering of training, evaluation, and model update.
  26. * support hard sample discovering of unlabeled data, for reducing the manual labeling workload.
  27. * Support lifelong learning that reserves historical knowledge to avoid frequent re-training/ re-fine-tuning, and tackles samples uncovered in historical knowledge base.
  28. ## Proposal
  29. We propose using Kubernetes Custom Resource Definitions (CRDs) to describe
  30. the incremental learning specification/status and a controller to synchronize these updates between edge and cloud.
  31. ![](./images/incremental-learning-job-crd.png)
  32. ### Use Cases
  33. * Users can create the incremental learning jobs, by providing training scripts, configuring training hyperparameters, providing training datasets, configuring training and deployment triggers.
  34. ## Design Details
  35. There are three stages in a incremental learning job: train/eval/deploy.
  36. Each stage contains these below states:
  37. 1. Waiting: wait to trigger satisfied, i.e. wait to train/eval/deploy
  38. 1. Ready: the corresponding trigger satisfied, now ready to train/eval/deploy
  39. 1. Starting: the corresponding stage is starting
  40. 1. Running: the corresponding stage is running
  41. 1. Failed: the corresponding stage failed
  42. 1. Completed: the corresponding stage completed
  43. ![](./images/incremental-learning-state-machine.png)
  44. ### CRD API Group and Version
  45. The `IncrementalLearningJob` CRD will be namespace-scoped.
  46. The tables below summarize the group, kind and API version details for the CRD.
  47. * IncrementalLearningJob
  48. | Field | Description |
  49. |-----------------------|-------------------------|
  50. |Group | sedna.io |
  51. |APIVersion | v1alpha1 |
  52. |Kind | IncrementalLearningJob |
  53. ### Incremental learning CRD
  54. See the [crd source](/build/crds/sedna.io_incrementallearningjobs.yaml) for details.
  55. ### Incremental learning job type definition
  56. See the [golang source](/pkg/apis/sedna/v1alpha1/incrementallearningjob_types.go) for details.
  57. #### Validation
  58. [Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests.
  59. Invalid values for fields (example string value for a boolean field etc) can be validated using this.
  60. Here is a list of validations we need to support :
  61. 1. The `dataset` specified in the crd should exist in k8s.
  62. 1. The `model` specified in the crd should exist in k8s.
  63. 1. The edgenode name specified in the crd should exist in k8s.
  64. ### Incremental learning job sample
  65. See the [source](/build/crd-samples/sedna/incrementallearningjob_v1alpha1.yaml) for an example.
  66. ## Controller Design
  67. The incremental learning controller starts three separate goroutines called `upstream`, `downstream` and `incrementallearningjob`controller.<br/>
  68. These are not separate controllers as such but named here for clarity.
  69. - incremental learning: watch the updates of incremental-learning job crds, and create the workers depending on the state machine.
  70. - downstream: synchronize the incremental-learning-job updates from the cloud to the edge node.
  71. - upstream: synchronize the incremental-learning-job updates from the edge to the cloud node.
  72. ### Incremental Learning Controller
  73. ![](./images/incremental-learning-controller.png)
  74. The incremental-learning controller watches for the updates of incremental-learning jobs and the corresponding pods against the K8S API server.<br/>
  75. Updates are categorized below along with the possible actions:
  76. | Update Type | Action |
  77. |-------------------------------|---------------------------------------------- |
  78. |New Incremental-learning-job Created | Wait to train trigger satisfied|
  79. |Incremental-learning-job Deleted | NA. These workers will be deleted by [k8s gc](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/).|
  80. |The Status of Incremental-learning-job Updated | Create the train/eval worker if it's ready.|
  81. |The corresponding pod created/running/completed/failed | Update the status of incremental-learning job.|
  82. ### Downstream Controller
  83. ![](./images/incremental-learning-downstream-controller.png)
  84. The downstream controller watches for the incremental-learning job updates against the K8S API server.<br/>
  85. Updates are categorized below along with the possible actions that the downstream controller can take:
  86. | Update Type | Action |
  87. |-------------------------------|---------------------------------------------- |
  88. |New Incremental-learning-job Created |Sends the job information to LCs.|
  89. |Incremental-learning-job Deleted | The controller sends the delete event to LCs.|
  90. ### Upstream Controller
  91. ![](./images/incremental-learning-upstream-controller.png)
  92. The upstream controller watches for the incremental-learning job updates from the edge node and applies these updates against the API server in the cloud.<br/>
  93. Updates are categorized below along with the possible actions that the upstream controller can take:
  94. | Update Type | Action |
  95. |------------------------------- |---------------------------------------------- |
  96. |Incremental-learning-job Reported State Updated | The controller appends the reported status of the job by LC in the cloud. |
  97. ### Details of api between GM(cloud) and LC(edge)
  98. 1. GM(downstream controller) syncs the job info to LC:
  99. ```go
  100. // POST <namespace>/incrementallearningjobs/<job-name>
  101. // body same to the job crd of k8s api, omitted here.
  102. ```
  103. 1. LC uploads the job status which reported by the worker to GM(upstream controller):
  104. ```go
  105. // POST <namespace>/incrementallearningjobs/<job-name>/status
  106. // WorkerMessage defines the message from that the training worker. It will send to GM.
  107. type WorkerMessage struct {
  108. Phase string `json:"phase"`
  109. Status string `json:"status"`
  110. Output *WorkerOutput `json:"output"`
  111. }
  112. //
  113. type WorkerOutput struct {
  114. Models []*Model `json:"models"`
  115. OwnerInfo *OwnerInfo `json:"ownerInfo"`
  116. }
  117. // Model defines the model information
  118. type Model struct {
  119. Format string `json:"format"`
  120. URL string `json:"url"`
  121. // Including the metrics, e.g. precision/recall
  122. Metrics map[string]float64 `json:"metrics"`
  123. }
  124. // TaskInfo defines the task information
  125. type TaskInfo struct {
  126. // Current training round
  127. CurrentRound int `json:"currentRound"`
  128. UpdateTime string `json:"updateTime"`
  129. }
  130. ```
  131. ### The flows of incremental learning job
  132. - Flow of the job creation:
  133. ![](./images/incremental-learning-flow-creation.png)
  134. - Flow of the `train` stage:
  135. ![](./images/incremental-learning-flow-train-stage.png)
  136. - Flow of the `eval` stage:
  137. ![](./images/incremental-learning-flow-eval-stage.png)
  138. - Flow of the `deploy` stage:
  139. ![](./images/incremental-learning-flow-deploy-stage.png)
  140. ## Workers Communication
  141. No need to communicate between workers.