You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

joint-inference.md 8.4 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199
  1. * [Joint Inference](#joint-inference)
  2. * [Motivation](#motivation)
  3. * [Goals](#goals)
  4. * [Non\-goals](#non-goals)
  5. * [Proposal](#proposal)
  6. * [Use Cases](#use-cases)
  7. * [Design Details](#design-details)
  8. * [CRD API Group and Version](#crd-api-group-and-version)
  9. * [Joint inference CRD](#joint-inference-crd)
  10. * [Joint inference type definition](#joint-inference-type-definition)
  11. * [Joint inference sample](#joint-inference-sample)
  12. * [Validation](#validation)
  13. * [Controller Design](#controller-design)
  14. * [Joint Inference Controller](#joint-inference-controller)
  15. * [Downstream Controller](#downstream-controller)
  16. * [Upstream Controller](#upstream-controller)
  17. * [Details of api between GM(cloud) and LC(edge)](#details-of-api-between-gmcloud-and-lcedge)
  18. * [Details of api between Worker(edge) and LC(edge)](#details-of-api-between-workeredge-and-lcedge)
  19. * [Flow of Joint Inference](#flow-of-joint-inference)
  20. * [Workers Communication](#workers-communication)
  21. # Joint Inference
  22. ## Motivation
  23. Inference on the edge can get a shorter latency and a higher throughput, and inference on the cloud can get better inference precision.
  24. The collaborative inference technology detects hard samples on the edge and sends them to the cloud for inference.
  25. **In this way, simple samples inference on the edge ensures latency and throughput, while hard samples inference on the cloud improves the overall precision.**
  26. ### Goals
  27. * Joint inference improves the inference precision without significantly reducing the time and throughput.
  28. ## Proposal
  29. We propose using Kubernetes Custom Resource Definitions (CRDs) to describe
  30. the joint inference specification/status and a controller to synchronize these updates between edge and cloud.
  31. ![](./images/joint-inference-service-crd.png)
  32. ### Use Cases
  33. * User can create a joint inference service with providing a training script,
  34. specifying the aggregation algorithm, configuring training hyper parameters,
  35. configuring training datasets.
  36. * Users can get the joint inference status, including the counts of inference at the edge/cloud.
  37. ## Design Details
  38. ### CRD API Group and Version
  39. The `JointInferenceService` CRD will be namespace-scoped.
  40. The tables below summarize the group, kind and API version details for the CRD.
  41. * JointInferenceService
  42. | Field | Description |
  43. |-----------------------|-------------------------|
  44. |Group | sedna.io |
  45. |APIVersion | v1alpha1 |
  46. |Kind | JointInferenceService |
  47. ### Joint inference CRD
  48. see [crd source](/build/crds/sedna.io_jointinferenceservices.yaml)
  49. ### Joint inference type definition
  50. see [go source](/pkg/apis/sedna/v1alpha1/jointinferenceservice_types.go)
  51. #### Validation
  52. [Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests.
  53. Invalid values for fields ( example string value for a boolean field etc) can be validated using this.
  54. Here is a list of validations we need to support :
  55. 1. The `dataset` specified in the crd should exist in k8s.
  56. 1. The `model` specified in the crd should exist in k8s.
  57. 1. The edgenode name specified in the crd should exist in k8s.
  58. ### joint inference sample
  59. see [sample source](/build/crd-samples/sedna/jointinferenceservice_v1alpha1.yaml)
  60. ## Controller Design
  61. The joint inference controller starts three separate goroutines called `upstream`, `downstream` and `joint-inference`controller. These are not separate controllers as such but named here for clarity.
  62. - joint inference: watch the updates of joint-inference-task crds, and create the workers to complete the task.
  63. - downstream: synchronize the joint-inference updates from the cloud to the edge node.
  64. - upstream: synchronize the joint-inference updates from the edge to the cloud node.
  65. ### Joint Inference Controller
  66. ![](./images/joint-inference-controller.png)
  67. The joint-inference controller watches for the updates of joint-inference tasks and the corresponding pods against the K8S API server.
  68. Updates are categorized below along with the possible actions:
  69. | Update Type | Action |
  70. |-------------------------------|---------------------------------------------- |
  71. |New Joint-inference-service Created |Create the cloud/edge worker|
  72. |Joint-inference-service Deleted | NA. These workers will be deleted by GM.|
  73. |The corresponding pod created/running/completed/failed | Update the status of joint-inference task.|
  74. ### Downstream Controller
  75. ![](./images/joint-inference-downstream-controller.png)
  76. The downstream controller watches for joint-inference updates against the K8S API server.
  77. Updates are categorized below along with the possible actions that the downstream controller can take:
  78. | Update Type | Action |
  79. |-------------------------------|---------------------------------------------- |
  80. |New Joint-inference-service Created |Sends the task information to LCs.|
  81. |Joint-inference-service Deleted | The controller sends the delete event to LCs.|
  82. ### Upstream Controller
  83. ![](./images/joint-inference-upstream-controller.png)
  84. The upstream controller watches for joint-inference-task updates from the edge node and applies these updates against the API server in the cloud.
  85. Updates are categorized below along with the possible actions that the upstream controller can take:
  86. | Update Type | Action |
  87. |------------------------------- |---------------------------------------------- |
  88. |Joint-inference-service Reported State Updated | The controller appends the reported status of the Joint-inference-service in the cloud. |
  89. ### Details of api between GM(cloud) and LC(edge)
  90. 1. GM(downstream controller) syncs the task info to LC:
  91. ```go
  92. // POST <namespace>/sedna/downstream/jointinferenceservices/<name>/insert
  93. // body same to the task crd of k8s api, omitted here.
  94. ```
  95. 1. LC uploads the task status which reported by the worker to GM(upstream controller):
  96. ```go
  97. // POST <namespace>/sedna/upstream/jointinferenceservices/<name>/status
  98. // JoinInferenceServiceStatus defines status that send to GlobalManager
  99. type JoinInferenceServiceStatus struct {
  100. Phase string `json:"phase"`
  101. Status string `json:"status"`
  102. Output *Output `json:"output"`
  103. }
  104. // Output defines task output information
  105. type Output struct {
  106. Models []Model `json:"models"`
  107. TaskInfo *TaskInfo `json:"taskInfo"`
  108. }
  109. // Model defines the model information
  110. type Model struct {
  111. Format string `json:"format"`
  112. URL string `json:"url"`
  113. }
  114. // TaskInfo defines the task information
  115. type TaskInfo struct {
  116. InferenceNumber int `json:"inferenceNumber"`
  117. HardExampleNumber int `json:"hardExampleNumber"`
  118. UploadCloudRatio float64 `json:"uploadCloudRatio"`
  119. StartTime string `json:"startTime"`
  120. CurrentTime string `json:"currentTime"`
  121. }
  122. ```
  123. ### Details of api between Worker(edge) and LC(edge)
  124. 1. Worker sends inference info to LC in same edge node:
  125. ```
  126. // POST /sedna/workers/<worker-name>/info
  127. ```
  128. ```json
  129. {
  130. "name": "worker-name",
  131. "namespace": "default",
  132. "ownerName": "jointinferenceservice-name",
  133. "ownerKind": "jointinferenceservice",
  134. "kind": "inference",
  135. "status": "completed/failed/running",
  136. "taskInfo": {
  137. "inferenceNumber": 1000,
  138. "hardExampleNumber": 100,
  139. "uploadCloudRatio": 0.1,
  140. "startTime": "2020-11-03T08:39:22.517Z",
  141. "updateTime": "2020-11-03T08:50:22.517Z"
  142. }
  143. }
  144. ```
  145. ### Flow of Joint Inference
  146. - The flow of joint inference service creation:
  147. ![](./images/joint-inference-flow-creation.png)
  148. ## Workers Communication
  149. ![](./images/joint-inference-worker-communication.png)