You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

joint-inference.md 20 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553
  1. * [Joint Inference](#joint-inference)
  2. * [Motivation](#motivation)
  3. * [Goals](#goals)
  4. * [Non\-goals](#non-goals)
  5. * [Proposal](#proposal)
  6. * [Use Cases](#use-cases)
  7. * [Design Details](#design-details)
  8. * [CRD API Group and Version](#crd-api-group-and-version)
  9. * [Joint inference CRD](#joint-inference-crd)
  10. * [Joint inference type definition](#joint-inference-type-definition)
  11. * [Joint inference sample](#joint-inference-sample)
  12. * [Validation](#validation)
  13. * [Controller Design](#controller-design)
  14. * [Joint Inference Controller](#joint-inference-controller)
  15. * [Downstream Controller](#downstream-controller)
  16. * [Upstream Controller](#upstream-controller)
  17. * [Details of api between GM(cloud) and LC(edge)](#details-of-api-between-gmcloud-and-lcedge)
  18. * [Details of api between Worker(edge) and LC(edge)](#details-of-api-between-workeredge-and-lcedge)
  19. * [Flow of Joint Inference](#flow-of-joint-inference)
  20. * [Workers Communication](#workers-communication)
  21. # Joint Inference
  22. ## Motivation
  23. Inference on the edge can get a shorter latency and a higher throughput, and inference on the cloud can get better inference precision.
  24. The collaborative inference technology detects hard samples on the edge and sends them to the cloud for inference.
  25. **In this way, simple samples inference on the edge ensures latency and throughput, while hard samples inference on the cloud improves the overall precision.**
  26. ### Goals
  27. * Joint inference improves the inference precision without significantly reducing the time and throughput.
  28. ## Proposal
  29. We propose using Kubernetes Custom Resource Definitions (CRDs) to describe
  30. the joint inference specification/status and a controller to synchronize these updates between edge and cloud.
  31. ![](./images/joint-inference-service-crd.png)
  32. ### Use Cases
  33. * User can create a joint inference service with providing a training script,
  34. specifying the aggregation algorithm, configuring training hyper parameters,
  35. configuring training datasets.
  36. * Users can get the joint inference status, including the counts of inference at the edge/cloud.
  37. ## Design Details
  38. ### CRD API Group and Version
  39. The `JointInferenceService` CRD will be namespace-scoped.
  40. The tables below summarize the group, kind and API version details for the CRD.
  41. * JointInferenceService
  42. | Field | Description |
  43. |-----------------------|-------------------------|
  44. |Group | neptune.io |
  45. |APIVersion | v1alpha1 |
  46. |Kind | JointInferenceService |
  47. ### Joint inference CRD
  48. ![](./images/joint-inference-service-crd-details.png)
  49. Below is the CustomResourceDefinition yaml for `JointInferenceService`:
  50. [crd source](/build/crds/neptune/jointinferenceservice_v1alpha1.yaml)
  51. ```yaml
  52. apiVersion: apiextensions.k8s.io/v1
  53. kind: CustomResourceDefinition
  54. metadata:
  55. name: jointinferenceservices.neptune.io
  56. spec:
  57. group: neptune.io
  58. names:
  59. kind: JointInferenceService
  60. plural: jointinferenceservices
  61. shortNames:
  62. - jointinferenceservice
  63. - jis
  64. scope: Namespaced
  65. versions:
  66. - name: v1alpha1
  67. subresources:
  68. # status enables the status subresource.
  69. status: {}
  70. served: true
  71. storage: true
  72. schema:
  73. openAPIV3Schema:
  74. type: object
  75. properties:
  76. spec:
  77. type: object
  78. required:
  79. - edgeWorker
  80. - cloudWorker
  81. properties:
  82. edgeWorker:
  83. type: object
  84. required:
  85. - name
  86. - model
  87. - nodeName
  88. - hardExampleAlgorithm
  89. - workerSpec
  90. properties:
  91. name:
  92. type: string
  93. model:
  94. type: object
  95. required:
  96. - name
  97. properties:
  98. name:
  99. type: string
  100. nodeName:
  101. type: string
  102. hardExampleAlgorithm:
  103. type: object
  104. required:
  105. - name
  106. properties:
  107. name:
  108. type: string
  109. workerSpec:
  110. type: object
  111. required:
  112. - scriptDir
  113. - scriptBootFile
  114. - frameworkType
  115. - frameworkVersion
  116. properties:
  117. scriptDir:
  118. type: string
  119. scriptBootFile:
  120. type: string
  121. frameworkType:
  122. type: string
  123. frameworkVersion:
  124. type: string
  125. parameters:
  126. type: array
  127. items:
  128. type: object
  129. required:
  130. - key
  131. - value
  132. properties:
  133. key:
  134. type: string
  135. value:
  136. type: string
  137. cloudWorker:
  138. type: object
  139. required:
  140. - name
  141. - model
  142. - nodeName
  143. - workerSpec
  144. properties:
  145. name:
  146. type: string
  147. model:
  148. type: object
  149. required:
  150. - name
  151. properties:
  152. name:
  153. type: string
  154. nodeName:
  155. type: string
  156. workerSpec:
  157. type: object
  158. required:
  159. - scriptDir
  160. - scriptBootFile
  161. - frameworkType
  162. - frameworkVersion
  163. properties:
  164. scriptDir:
  165. type: string
  166. scriptBootFile:
  167. type: string
  168. frameworkType:
  169. type: string
  170. frameworkVersion:
  171. type: string
  172. parameters:
  173. type: array
  174. items:
  175. type: object
  176. required:
  177. - key
  178. - value
  179. properties:
  180. key:
  181. type: string
  182. value:
  183. type: string
  184. status:
  185. type: object
  186. properties:
  187. conditions:
  188. type: array
  189. items:
  190. type: object
  191. properties:
  192. type:
  193. type: string
  194. status:
  195. type: string
  196. lastHeartbeatTime:
  197. type: string
  198. format: date-time
  199. lastTransitionTime:
  200. type: string
  201. format: date-time
  202. reason:
  203. type: string
  204. message:
  205. type: string
  206. startTime:
  207. type: string
  208. format: date-time
  209. active:
  210. type: integer
  211. failed:
  212. type: integer
  213. metrics:
  214. type: array
  215. items:
  216. type: object
  217. properties:
  218. key:
  219. type: string
  220. value:
  221. type: string
  222. additionalPrinterColumns:
  223. - name: status
  224. type: string
  225. description: The status of the jointinference service
  226. jsonPath: ".status.conditions[-1].type"
  227. - name: active
  228. type: integer
  229. description: The number of active worker
  230. jsonPath: ".status.active"
  231. - name: failed
  232. type: integer
  233. description: The number of failed worker
  234. jsonPath: ".status.failed"
  235. - name: Age
  236. type: date
  237. jsonPath: .metadata.creationTimestamp
  238. ```
  239. ### Joint inference type definition
  240. [go source](cloud/pkg/apis/neptune/v1alpha1/jointinferenceservice_types.go)
  241. ```go
  242. package v1alpha1
  243. import (
  244. v1 "k8s.io/api/core/v1"
  245. metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
  246. )
  247. // +genclient
  248. // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
  249. // JointInferenceService describes the data that a jointinferenceservice resource should have
  250. type JointInferenceService struct {
  251. metav1.TypeMeta `json:",inline"`
  252. metav1.ObjectMeta `json:"metadata"`
  253. Spec JointInferenceServiceSpec `json:"spec"`
  254. Status JointInferenceServiceStatus `json:"status,omitempty"`
  255. }
  256. // JointInferenceServiceSpec is a description of a jointinferenceservice
  257. type JointInferenceServiceSpec struct {
  258. EdgeWorker EdgeWorker `json:"edgeWorker"`
  259. CloudWorker CloudWorker `json:"cloudWorker"`
  260. }
  261. // EdgeWorker describes the data a edge worker should have
  262. type EdgeWorker struct {
  263. Name string `json:"name"`
  264. Model SmallModel `json:"model"`
  265. NodeName string `json:"nodeName"`
  266. HardExampleAlgorithm HardExampleAlgorithm `json:"hardExampleAlgorithm"`
  267. WorkerSpec CommonWorkerSpec `json:"workerSpec"`
  268. }
  269. // CloudWorker describes the data a cloud worker should have
  270. type CloudWorker struct {
  271. Name string `json:"name"`
  272. Model BigModel `json:"model"`
  273. NodeName string `json:"nodeName"`
  274. WorkerSpec CommonWorkerSpec `json:"workerSpec"`
  275. }
  276. type SmallModel struct {
  277. Name string `json:"name"`
  278. }
  279. type BigModel struct {
  280. Name string `json:"name"`
  281. }
  282. type HardExampleAlgorithm struct {
  283. Name string `json:"name"`
  284. }
  285. // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
  286. // JointInferenceServiceList is a list of JointInferenceServices.
  287. type JointInferenceServiceList struct {
  288. metav1.TypeMeta `json:",inline"`
  289. metav1.ListMeta `json:"metadata"`
  290. Items []JointInferenceService `json:"items"`
  291. }
  292. // JointInferenceServiceStatus represents the current state of a joint inference service.
  293. type JointInferenceServiceStatus struct {
  294. // The latest available observations of a joint inference service's current state.
  295. // +optional
  296. Conditions []JointInferenceServiceCondition `json:"conditions,omitempty"`
  297. // Represents time when the service was acknowledged by the service controller.
  298. // It is not guaranteed to be set in happens-before order across separate operations.
  299. // It is represented in RFC3339 form and is in UTC.
  300. // +optional
  301. StartTime *metav1.Time `json:"startTime,omitempty"`
  302. // The number of actively running workers.
  303. // +optional
  304. Active int32 `json:"active"`
  305. // The number of workers which reached to Failed.
  306. // +optional
  307. Failed int32 `json:"failed"`
  308. // Metrics of the joint inference service.
  309. Metrics []Metric `json:"metrics,omitempty"`
  310. }
  311. type JointInferenceServiceConditionType string
  312. // These are valid conditions of a service.
  313. const (
  314. // JointInferenceServiceCondPending means the service has been accepted by the system,
  315. // but one or more of the workers has not been started.
  316. JointInferenceServiceCondPending JointInferenceServiceConditionType = "Pending"
  317. // JointInferenceServiceCondFailed means the service has failed its execution.
  318. JointInferenceServiceCondFailed JointInferenceServiceConditionType = "Failed"
  319. // JointInferenceServiceReady means the service has been ready.
  320. JointInferenceServiceCondRunning JointInferenceServiceConditionType = "Running"
  321. )
  322. // JointInferenceServiceCondition describes current state of a service.
  323. type JointInferenceServiceCondition struct {
  324. // Type of service condition, Complete or Failed.
  325. Type JointInferenceServiceConditionType `json:"type"`
  326. // Status of the condition, one of True, False, Unknown.
  327. Status v1.ConditionStatus `json:"status"`
  328. // Last time the condition was checked.
  329. // +optional
  330. LastHeartbeatTime metav1.Time `json:"lastHeartbeatTime,omitempty"`
  331. // Last time the condition transit from one status to another.
  332. // +optional
  333. LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"`
  334. // (brief) reason for the condition's last transition.
  335. // +optional
  336. Reason string `json:"reason,omitempty"`
  337. // Human readable message indicating details about last transition.
  338. // +optional
  339. Message string `json:"message,omitempty"`
  340. }
  341. ```
  342. #### Validation
  343. [Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests.
  344. Invalid values for fields ( example string value for a boolean field etc) can be validated using this.
  345. Here is a list of validations we need to support :
  346. 1. The `dataset` specified in the crd should exist in k8s.
  347. 1. The `model` specified in the crd should exist in k8s.
  348. 1. The edgenode name specified in the crd should exist in k8s.
  349. ### joint inference sample
  350. ```yaml
  351. apiVersion: neptune.io/v1alpha1
  352. kind: JointInferenceService
  353. metadata:
  354. name: helmet-detection-demo
  355. namespace: default
  356. spec:
  357. edgeWorker:
  358. name: "edgeworker"
  359. model:
  360. name: "small-model"
  361. nodeName: "edge0"
  362. hardExampleAlgorithm:
  363. name: "IBT"
  364. workerSpec:
  365. scriptDir: "/code"
  366. scriptBootFile: "edge_inference.py"
  367. frameworkType: "tensorflow"
  368. frameworkVersion: "1.18"
  369. parameters:
  370. - key: "nms_threshold"
  371. value: "0.6"
  372. cloudWorker:
  373. name: "work"
  374. model:
  375. name: "big-model"
  376. nodeName: "solar-corona-cloud"
  377. workerSpec:
  378. scriptDir: "/code"
  379. scriptBootFile: "cloud_inference.py"
  380. frameworkType: "tensorflow"
  381. frameworkVersion: "1.18"
  382. parameters:
  383. - key: "nms_threshold"
  384. value: "0.6"
  385. ```
  386. ## Controller Design
  387. The joint inference controller starts three separate goroutines called `upstream`, `downstream` and `joint-inference`controller. These are not separate controllers as such but named here for clarity.
  388. - joint inference: watch the updates of joint-inference-task crds, and create the workers to complete the task.
  389. - downstream: synchronize the joint-inference updates from the cloud to the edge node.
  390. - upstream: synchronize the joint-inference updates from the edge to the cloud node.
  391. ### Joint Inference Controller
  392. ![](./images/joint-inference-controller.png)
  393. The joint-inference controller watches for the updates of joint-inference tasks and the corresponding pods against the K8S API server.
  394. Updates are categorized below along with the possible actions:
  395. | Update Type | Action |
  396. |-------------------------------|---------------------------------------------- |
  397. |New Joint-inference-service Created |Create the cloud/edge worker|
  398. |Joint-inference-service Deleted | NA. These workers will be deleted by GM.|
  399. |The corresponding pod created/running/completed/failed | Update the status of joint-inference task.|
  400. ### Downstream Controller
  401. ![](./images/joint-inference-downstream-controller.png)
  402. The downstream controller watches for joint-inference updates against the K8S API server.
  403. Updates are categorized below along with the possible actions that the downstream controller can take:
  404. | Update Type | Action |
  405. |-------------------------------|---------------------------------------------- |
  406. |New Joint-inference-service Created |Sends the task information to LCs.|
  407. |Joint-inference-service Deleted | The controller sends the delete event to LCs.|
  408. ### Upstream Controller
  409. ![](./images/joint-inference-upstream-controller.png)
  410. The upstream controller watches for joint-inference-task updates from the edge node and applies these updates against the API server in the cloud.
  411. Updates are categorized below along with the possible actions that the upstream controller can take:
  412. | Update Type | Action |
  413. |------------------------------- |---------------------------------------------- |
  414. |Joint-inference-service Reported State Updated | The controller appends the reported status of the Joint-inference-service in the cloud. |
  415. ### Details of api between GM(cloud) and LC(edge)
  416. 1. GM(downstream controller) syncs the task info to LC:
  417. ```go
  418. // POST <namespace>/neptune/downstream/jointinferenceservices/<name>/insert
  419. // body same to the task crd of k8s api, omitted here.
  420. ```
  421. 1. LC uploads the task status which reported by the worker to GM(upstream controller):
  422. ```go
  423. // POST <namespace>/neptune/upstream/jointinferenceservices/<name>/status
  424. // JoinInferenceServiceStatus defines status that send to GlobalManager
  425. type JoinInferenceServiceStatus struct {
  426. Phase string `json:"phase"`
  427. Status string `json:"status"`
  428. Output *Output `json:"output"`
  429. }
  430. // Output defines task output information
  431. type Output struct {
  432. Models []Model `json:"models"`
  433. TaskInfo *TaskInfo `json:"taskInfo"`
  434. }
  435. // Model defines the model information
  436. type Model struct {
  437. Format string `json:"format"`
  438. URL string `json:"url"`
  439. }
  440. // TaskInfo defines the task information
  441. type TaskInfo struct {
  442. InferenceNumber int `json:"inferenceNumber"`
  443. HardExampleNumber int `json:"hardExampleNumber"`
  444. UploadCloudRatio float64 `json:"uploadCloudRatio"`
  445. StartTime string `json:"startTime"`
  446. CurrentTime string `json:"currentTime"`
  447. }
  448. ```
  449. ### Details of api between Worker(edge) and LC(edge)
  450. 1. Worker sends inference info to LC in same edge node:
  451. ```
  452. // POST /neptune/workers/<worker-name>/info
  453. ```
  454. ```json
  455. {
  456. "name": "worker-name",
  457. "namespace": "default",
  458. "ownerName": "jointinferenceservice-name",
  459. "ownerKind": "jointinferenceservice",
  460. "kind": "inference",
  461. "status": "completed/failed/running",
  462. "taskInfo": {
  463. "inferenceNumber": 1000,
  464. "hardExampleNumber": 100,
  465. "uploadCloudRatio": 0.1,
  466. "startTime": "2020-11-03T08:39:22.517Z",
  467. "updateTime": "2020-11-03T08:50:22.517Z"
  468. }
  469. }
  470. ```
  471. ### Flow of Joint Inference
  472. - The flow of joint inference service creation:
  473. ![](./images/joint-inference-flow-creation.png)
  474. ## Workers Communication
  475. ![](./images/joint-inference-worker-communication.png)

No Description