You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

joint-inference-hpa.md 19 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455
  1. ## Horizontal Pod Autoscaling (HPA) for Sedna Joint Inference
  2. - [Project Requirements](#project-requirements)
  3. - [Description](#description)
  4. - [Expected Outcomes](#expected-outcomes)
  5. - [Recommended Skills](#recommended-skills)
  6. - [Kubeedge Elastic Inference Example](#kubeedge-elastic-inference-example)
  7. - [Prerequisites](#prerequisites)
  8. - [Why](#why)
  9. - [Related Reference Links](#related-reference-links)
  10. - [Tips](#tips)
  11. - [Deployment Template](#deployment-template)
  12. - [Sedna integrates HPA](#sedna-integrates-hpa)
  13. - [Why Need HPA](#why-need-hpa)
  14. - [Overall Architecture](#overall-architecture)
  15. - [Specific Implementation](#specific-implementation)
  16. - [The Joint Inference API Adds Support for the Definition of HPA](#the-joint-inference-api-adds-support-for-the-definition-of-hpa)
  17. - [Sedna Joint Inference Example](#sedna-joint-inference-example)
  18. - [Actual Demonstration Effect](#actual-demonstration-effect)
  19. ### Elastic Inference for Deep Learning Models Using KubeEdge
  20. #### Project Requirements
  21. ##### Description
  22. The rapid advancement of AI has led to the widespread application of deep learning models across various fields. However, the resource demands for model inference tasks can fluctuate significantly, especially during peak periods, posing a challenge to the system's computing capabilities. To address this varying load demand, we propose an elastic inference solution leveraging KubeEdge and Horizontal Pod Autoscaling (HPA) to enable dynamic scaling of inference tasks.
  23. KubeEdge is an edge computing framework that extends Kubernetes' capabilities to edge devices, allowing applications to be deployed and managed on edge nodes. By utilizing KubeEdge, we can distribute inference tasks across different edge devices and cloud resources, achieving efficient resource utilization and task processing.
  24. The core of collaborative inference lies in coordinating computing resources across various devices, allowing inference tasks to be dynamically allocated based on the current load. When the system detects an increase in load, the HPA mechanism automatically scales out the number of edge nodes or enhances resource configurations to meet the inference task demands. Conversely, when the load decreases, resource allocation is scaled down to reduce operational costs. This approach ensures optimal resource allocation while maintaining inference performance.
  25. ##### Expected Outcomes
  26. 1. Based on kubeedge to complete an elastic scaling AI inference example.
  27. 2. Based on kubeedge and sedna to complete the joint inference task elastic scaling development and output example.
  28. 3. Output blog.
  29. ##### Recommended Skills
  30. 1. Theoretical and practical knowledge of edge and cloud computing, specifically using the KubeEdge and Sedna frameworks.
  31. 2. Experience in deploying and managing Kubernetes, including configuring and tuning the HPA mechanism.
  32. 3. Expertise in developing and tuning deep learning models.
  33. 4. Programming experience, particularly in Python and Go.
  34. #### Kubeedge Elastic Inference Example
  35. ##### Prerequisites
  36. - It needs to be used in conjunction with [edgemesh](https://github.com/kubeedge/edgemesh)
  37. - The cluster needs to have [metrics-server](https://github.com/kubernetes-sigs/metrics-server) installed
  38. - The edge nodes need to be configured with [metrics reporting](https://kubeedge.io/zh/docs/advanced/metrics)
  39. ##### Why
  40. - Without edgemesh, when the number of instances in a deployment is greater than 1, there is no way to provide load - balancing capabilities on the edge side, thus making the HPA on the edge side of little use; the HPA capability requires monitoring the relevant metrics information of the pod and then performing dynamic scaling in and out in combination with the user's HPA configuration.
  41. ##### Related Reference Links
  42. - AI Project Address: [LlamaEdge](https://github.com/LlamaEdge/LlamaEdge)
  43. - HPA Documentation: [horizontal-pod-autoscale](https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale/)
  44. - HPA Example: [hpa-example](https://kubernetes.io/zh-cn/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/)
  45. ##### Tips
  46. - The CPU allocated to the container should preferably be greater than 4 cores, otherwise the inference will be very slow. The minimum memory is 1GB.
  47. - The inference service port is exposed at 8080.
  48. ##### Deployment Template
  49. ```yaml
  50. apiVersion: apps/v1
  51. kind: Deployment
  52. metadata:
  53. labels:
  54. app: wasmedge-qwen-2-0-5b-allminilm-2
  55. name: wasmedge-qwen-2-0-5b-allminilm-2
  56. namespace: default
  57. spec:
  58. selector:
  59. matchLabels:
  60. app: wasmedge-qwen-2-0-5b-allminilm-2
  61. template:
  62. metadata:
  63. labels:
  64. app: wasmedge-qwen-2-0-5b-allminilm-2
  65. spec:
  66. containers:
  67. - image: docker.io/secondstate/qwen-2-0.5b-allminilm-2:latest
  68. imagePullPolicy: IfNotPresent
  69. name: qwen-2-0-5b-container
  70. resources:
  71. limits:
  72. cpu: 3000m
  73. requests:
  74. cpu: 3000m
  75. affinity:
  76. nodeAffinity:
  77. requiredDuringSchedulingIgnoredDuringExecution:
  78. nodeSelectorTerms:
  79. - matchExpressions:
  80. - key: kubernetes.io/hostname
  81. operator: In
  82. values:
  83. - edgenode
  84. schedulerName: default-scheduler
  85. nodeName: nvidia-edge-node
  86. tolerations:
  87. - key: node-role.kubernetes.io/edge
  88. operator: Exists
  89. effect: NoSchedule
  90. # Note:
  91. # 1. The replicas field is not required.
  92. # 2. The resources field is required. If it is not restricted, the resource utilization rate cannot be calculated.
  93. ```
  94. The above solution is to directly run the Wasmedge server inside a regular container.
  95. **ToDo:** Another solution is to directly use the Wasm runtime to run it. We can try this later. When creating the Wasm image, a Wasm label needs to be added so that when containerd uses the underlying runtime, it will use the Wasm runtime.
  96. ##### Configure HPA
  97. - Configure with `kubectl`
  98. ```shell
  99. kubectl autoscale deployment wasmedge-qwen-2-0-5b-allminilm-2 --cpu-percent=50 --min=1 --max=10
  100. ```
  101. - Configure with `yaml`
  102. ```yaml
  103. apiVersion: autoscaling/v2
  104. kind: HorizontalPodAutoscaler
  105. metadata:
  106. name: hpa
  107. namespace: default
  108. spec:
  109. maxReplicas: 10
  110. metrics:
  111. - resource:
  112. name: cpu
  113. target:
  114. averageUtilization: 50
  115. type: Utilization
  116. type: Resource
  117. minReplicas: 1
  118. scaleTargetRef:
  119. apiVersion: apps/v1
  120. kind: Deployment
  121. name: wasmedge-qwen-2-0-5b-allminilm-2
  122. ```
  123. **Tips**: Applications using `HPA` should be used in conjunction with Service. Otherwise, even if `HPA` is performed, if the service access still uses the `hostNetwork`, the service access traffic will still be on only one machine, and the traffic cannot be load-balanced to other `pods`.
  124. #### Sedna integrates `HPA`
  125. ##### Why Need `HPA`
  126. In the scenario of large - model inference, the resource requirements of inference tasks usually increase significantly with the increase in the number of accesses. In the current cloud - edge joint - inference architecture, the fixed single - instance configuration is difficult to effectively cope with such fluctuations, resulting in insufficient resource utilization or performance bottlenecks. By configuring `HPA` (Horizontal Pod Autoscaler) in the `deployment`, the number of inference instances can be automatically adjusted according to the real - time number of accesses, and resources can be dynamically expanded or reduced. This mechanism can increase instances during high - load periods and reduce instances during low - load periods, thereby improving concurrent processing capabilities, maximizing the optimization of resource utilization, and ensuring the high efficiency and scalability of the inference service.
  127. ##### Overall Architecture
  128. ![2601728373323_.pic](./images/joint-inference-hpa.png)
  129. ##### Specific Implementation
  130. The HPA implementation of `sedna` is achieved based on the `sedna` joint inference controller.
  131. - With the help of `deployment`, use `HPA` to configure dynamic scaling for its instances.
  132. - Using `deployment` allows for load balancing based on `Service`.
  133. - The cloud and the edge can independently choose whether to enable the HPA mode. If they choose to enable it, the joint - inference controller will automatically create HPA resources for the cloud or the edge, which can be viewed by `kubectl get hpa -n {ns}`.
  134. - Since the HPA capability of k8s provided a stable version in k8s version 1.23, it is necessary to upgrade the k8s API to version 1.23.
  135. ##### The Joint Inference API Adds Support for the Definition of HPA
  136. ```go
  137. // HPA describes the desired functionality of the HorizontalPodAutoscaler.
  138. type HPA struct {
  139. // +optional
  140. MinReplicas *int32 `json:"minReplicas,omitempty"`
  141. MaxReplicas int32 `json:"maxReplicas"`
  142. // +optional
  143. Metrics []autoscalingv2.MetricSpec `json:"metrics,omitempty"`
  144. // +optional
  145. Behavior *autoscalingv2.HorizontalPodAutoscalerBehavior `json:"behavior,omitempty"`
  146. }
  147. // EdgeWorker describes the data a edge worker should have
  148. type EdgeWorker struct {
  149. Model SmallModel `json:"model"`
  150. HardExampleMining HardExampleMining `json:"hardExampleMining"`
  151. Template v1.PodTemplateSpec `json:"template"`
  152. // HPA describes the desired functionality of the HorizontalPodAutoscaler.
  153. // +optional
  154. HPA *HPA `json:"hpa"`
  155. }
  156. // CloudWorker describes the data a cloud worker should have
  157. type CloudWorker struct {
  158. Model BigModel `json:"model"`
  159. Template v1.PodTemplateSpec `json:"template"`
  160. // HPA describes the desired functionality of the HorizontalPodAutoscaler.
  161. // +optional
  162. HPA *HPA `json:"hpa"`
  163. }
  164. ```
  165. **According to the API definition, add the creation, update, and deletion logic of HPA resources in the joint - inference controller.**
  166. ```go
  167. func CreateHPA(client kubernetes.Interface, object CommonInterface, kind, scaleTargetRefName, workerType string, hpa *sednav1.HPA) error {
  168. hpaName := "hpa-" + scaleTargetRefName
  169. newHPA := &autoscalingv2.HorizontalPodAutoscaler{
  170. ObjectMeta: metav1.ObjectMeta{
  171. Name: hpaName,
  172. Namespace: object.GetNamespace(),
  173. OwnerReferences: []metav1.OwnerReference{
  174. *metav1.NewControllerRef(object, object.GroupVersionKind()),
  175. },
  176. Labels: generateLabels(object, workerType),
  177. },
  178. Spec: autoscalingv2.HorizontalPodAutoscalerSpec{
  179. MaxReplicas: hpa.MaxReplicas,
  180. Metrics: hpa.Metrics,
  181. MinReplicas: hpa.MinReplicas,
  182. ScaleTargetRef: autoscalingv2.CrossVersionObjectReference{
  183. APIVersion: "apps/v1",
  184. Kind: kind,
  185. Name: scaleTargetRefName,
  186. },
  187. Behavior: hpa.Behavior,
  188. },
  189. }
  190. _, err := client.AutoscalingV2().HorizontalPodAutoscalers(object.GetNamespace()).Create(context.TODO(), newHPA, metav1.CreateOptions{})
  191. if err != nil {
  192. return fmt.Errorf("failed to create hpa for %s %s, err: %s", kind, hpaName, err)
  193. }
  194. return nil
  195. }
  196. func UpdateHPA(client kubernetes.Interface, object CommonInterface, kind, scaleTargetRefName, workerType string, hpa *sednav1.HPA) error {
  197. // get existing HPA
  198. hpaName := "hpa-" + scaleTargetRefName
  199. existingHPA, err := client.AutoscalingV2().HorizontalPodAutoscalers(object.GetNamespace()).Get(context.TODO(), hpaName, metav1.GetOptions{})
  200. if err != nil {
  201. // create HPA if not found
  202. if errors.IsNotFound(err) {
  203. klog.Info("hpa not found, creating new hpa...")
  204. return CreateHPA(client, object, kind, scaleTargetRefName, workerType, hpa)
  205. }
  206. return fmt.Errorf("failed to get hpa for %s %s, err: %s", kind, hpaName, err)
  207. }
  208. // update HPA
  209. existingHPA.ObjectMeta.Labels = generateLabels(object, workerType)
  210. existingHPA.ObjectMeta.OwnerReferences = []metav1.OwnerReference{
  211. *metav1.NewControllerRef(object, object.GroupVersionKind()),
  212. }
  213. existingHPA.Spec.MaxReplicas = hpa.MaxReplicas
  214. existingHPA.Spec.MinReplicas = hpa.MinReplicas
  215. existingHPA.Spec.Metrics = hpa.Metrics
  216. existingHPA.Spec.ScaleTargetRef = autoscalingv2.CrossVersionObjectReference{
  217. APIVersion: "apps/v1",
  218. Kind: kind,
  219. Name: scaleTargetRefName,
  220. }
  221. existingHPA.Spec.Behavior = hpa.Behavior
  222. // update HPA
  223. _, err = client.AutoscalingV2().HorizontalPodAutoscalers(object.GetNamespace()).Update(context.TODO(), existingHPA, metav1.UpdateOptions{})
  224. if err != nil {
  225. return fmt.Errorf("failed to update hpa for %s %s, err: %s", kind, hpaName, err)
  226. }
  227. return nil
  228. }
  229. func DeleteHPA(client kubernetes.Interface, namespace, name string) error {
  230. // check if HPA exists
  231. _, err := client.AutoscalingV2().HorizontalPodAutoscalers(namespace).Get(context.TODO(), name, metav1.GetOptions{})
  232. if err != nil {
  233. // Return nil if HPA not found
  234. if errors.IsNotFound(err) {
  235. return nil
  236. }
  237. return fmt.Errorf("failed to get hpa %s in namespace %s, err: %s", name, namespace, err)
  238. }
  239. // delete HPA
  240. err = client.AutoscalingV2().HorizontalPodAutoscalers(namespace).Delete(context.TODO(), name, metav1.DeleteOptions{})
  241. if err != nil {
  242. return fmt.Errorf("failed to delete hpa %s in namespace %s, err: %s", name, namespace, err)
  243. }
  244. return nil
  245. }
  246. // create/update HPA
  247. func (c *Controller) createOrUpdateWorker(service *sednav1.JointInferenceService, workerType string, bigModelHost string, bigModelPort int32, create bool) error {
  248. ...
  249. var hpa *sednav1.HPA
  250. ...
  251. if create {
  252. ...
  253. // create HPA
  254. if hpa != nil {
  255. return runtime.CreateHPA(c.kubeClient, service, "Deployment", deploymentName, workerType, hpa)
  256. }
  257. } else {
  258. ...
  259. // update HPA
  260. if hpa != nil {
  261. return runtime.UpdateHPA(c.kubeClient, service, "Deployment", deploymentName, workerType, hpa)
  262. } else {
  263. return runtime.DeleteHPA(c.kubeClient, service.GetNamespace(), "hpa-"+deploymentName)
  264. }
  265. }
  266. return err
  267. }
  268. ```
  269. ##### Sedna Joint Inference Example
  270. - Workers at the edge and in the cloud can choose to use or not use HPA. They can be configured simultaneously or separately.
  271. ```yaml
  272. apiVersion: sedna.io/v1alpha1
  273. kind: JointInferenceService
  274. metadata:
  275. name: helmet-detection-inference-example
  276. namespace: default
  277. spec:
  278. edgeWorker:
  279. hpa:
  280. maxReplicas: 2
  281. metrics:
  282. - resource:
  283. name: cpu
  284. target:
  285. averageUtilization: 50
  286. type: Utilization
  287. type: Resource
  288. minReplicas: 1
  289. model:
  290. name: "helmet-detection-inference-little-model"
  291. hardExampleMining:
  292. name: "IBT"
  293. parameters:
  294. - key: "threshold_img"
  295. value: "0.9"
  296. - key: "threshold_box"
  297. value: "0.9"
  298. template:
  299. spec:
  300. nodeName: edge1i70kbjod
  301. hostNetwork: true
  302. dnsPolicy: ClusterFirstWithHostNet
  303. containers:
  304. - image: kubeedge/sedna-example-joint-inference-helmet-detection-little:v0.5.0
  305. imagePullPolicy: IfNotPresent
  306. name: little-model
  307. env: # user defined environments
  308. - name: input_shape
  309. value: "416,736"
  310. - name: "video_url"
  311. value: "rtsp://localhost/video"
  312. - name: "all_examples_inference_output"
  313. value: "/data/output"
  314. - name: "hard_example_cloud_inference_output"
  315. value: "/data/hard_example_cloud_inference_output"
  316. - name: "hard_example_edge_inference_output"
  317. value: "/data/hard_example_edge_inference_output"
  318. resources: # user defined resources
  319. requests:
  320. memory: 64M
  321. cpu: 50m
  322. limits:
  323. memory: 2Gi
  324. cpu: 500m
  325. volumeMounts:
  326. - name: outputdir
  327. mountPath: /data/
  328. volumes: # user defined volumes
  329. - name: outputdir
  330. hostPath:
  331. # user must create the directory in host
  332. path: /joint_inference/output
  333. type: Directory
  334. cloudWorker:
  335. hpa:
  336. maxReplicas: 5
  337. metrics:
  338. - resource:
  339. name: cpu
  340. target:
  341. averageUtilization: 20
  342. type: Utilization
  343. type: Resource
  344. minReplicas: 1
  345. model:
  346. name: "helmet-detection-inference-big-model"
  347. template:
  348. spec:
  349. nodeName: worker-01
  350. dnsPolicy: ClusterFirstWithHostNet
  351. containers:
  352. - image: kubeedge/sedna-example-joint-inference-helmet-detection-big:v0.5.0
  353. name: big-model
  354. imagePullPolicy: IfNotPresent
  355. env: # user defined environments
  356. - name: "input_shape"
  357. value: "544,544"
  358. resources: # user defined resources
  359. requests:
  360. cpu: 1024m
  361. memory: 2Gi
  362. limits:
  363. cpu: 1024m
  364. memory: 2Gi
  365. ```
  366. ##### Actual Demonstration Effect
  367. ```shell
  368. [root@master-01 ~]# kubectl get hpa -w
  369. NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
  370. hpa-helmet-detection-inference-example-deployment-cloud Deployment/helmet-detection-inference-example-deployment-cloud 37%/20% 1 5 3 92s
  371. hpa-helmet-detection-inference-example-deployment-edge Deployment/helmet-detection-inference-example-deployment-edge 348%/50% 1 2 2 92s
  372. hpa-helmet-detection-inference-example-deployment-cloud Deployment/helmet-detection-inference-example-deployment-cloud 37%/20% 1 5 4 106s
  373. hpa-helmet-detection-inference-example-deployment-edge Deployment/helmet-detection-inference-example-deployment-edge 535%/50% 1 2 2 106s
  374. hpa-helmet-detection-inference-example-deployment-cloud Deployment/helmet-detection-inference-example-deployment-cloud 18%/20% 1 5 4 2m1s
  375. hpa-helmet-detection-inference-example-deployment-edge Deployment/helmet-detection-inference-example-deployment-edge 769%/50% 1 2 2 2m1s
  376. hpa-helmet-detection-inference-example-deployment-cloud Deployment/helmet-detection-inference-example-deployment-cloud 12%/20% 1 5 4 2m16s
  377. [root@master-01 jointinference]# kubectl get po
  378. NAME READY STATUS RESTARTS AGE
  379. helmet-detection-inference-example-deployment-cloud-7dffd47c6fl 1/1 Running 0 4m34s
  380. helmet-detection-inference-example-deployment-cloud-7dffd4dpnnh 1/1 Running 0 2m49s
  381. helmet-detection-inference-example-deployment-cloud-7dffd4f4dtw 1/1 Running 0 4m19s
  382. helmet-detection-inference-example-deployment-cloud-7dffd4kcvwd 1/1 Running 0 5m20s
  383. helmet-detection-inference-example-deployment-cloud-7dffd4shk86 1/1 Running 0 5m50s
  384. helmet-detection-inference-example-deployment-edge-7b6575c52s7k 1/1 Running 0 5m50s
  385. helmet-detection-inference-example-deployment-edge-7b6575c59g48 1/1 Running 0 5m20s
  386. ```