You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

Lifelong-learning-Ops-ui.md 10 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214
  1. - [Operation and maintenance UI development](#Operation-and-maintenance-UI-development)
  2. - [Motivation](#motivation)
  3. - [Goals](#goals)
  4. - [Proposal](#proposal)
  5. - [Use Cases](#use-cases)
  6. - [Design Details](#design-details)
  7. - [Log](#Log)
  8. - [Metric](#Metric)
  9. - [Visualization](#Visualization)
  10. - [NetWork](#NetWork)
  11. - [Cache](#Cache)
  12. - [Install](#install)
  13. # Operation and maintenance UI development
  14. ## Motivation
  15. At present, Users can only access information about lifelong learning through the command line, including logs, status, and metrics.It is inconvenient and unfriendly for users. This proposal will provide a UI for metrics, log collection, status monitoring and management based on grafana. This allows the user to get the information by the graphical interface
  16. ### Goals
  17. - Supports metrics collection and visualization in lifelong learning
  18. - Support unified management and search of application logs
  19. - Support the management and status monitoring of lifelong learning in dashboard
  20. ## Proposal
  21. We propose using grafana,loki and prometheus to display the metric,log and statues about the lifelong learning job. The Prometheus is used to collect the metric and status and the loki is used to collect logs.
  22. ### Use cases
  23. - Users can search log by specifying pod and keywords in the log control panel
  24. - User can view the history state or the metric of any component in a sequence diagram
  25. - Users can view the status of any CRD (Model, DataSet, Lifelonglearningjob) by specifying the name
  26. ## Design details
  27. ### Architecture
  28. ![](./images/lifelong-learning-ops-architecture.png)
  29. 1. Log
  30. In Daemonset mode, Promtail will be deployed on each node (Cloud, Edge), and will monitor the log storage directory on the node (eg: /var/logs/contains).
  31. If there is a log update, in Loki's next pull, the updated log will be pulled.
  32. It is the same way as "Kubectl logs" to get logs.
  33. Loki is deployed on Cloud nodes and persistently stores the pulled logs
  34. 2. Metric
  35. kube-state-metrics is about generating metrics from Kubernetes API objects without modification. This ensures that features provided by kube-state-metrics have the same grade of stability as the Kubernetes API objects themselves.
  36. In Kubeedge, it is the same way as "Kubectl get" to get component information. When using Kubectl to obtain information from edge nodes, Kube-State-Metrics can also.
  37. Both Prometheus and Kube-State-Server are deployed on Cloud nodes.Prometheus will pull the metrics in Kube-state-server and store the metrics persistently.
  38. 3. Visualization
  39. Grafana is deployed on the Cloud node, and it visualizes based on the persisted data in Prometheus and Loki that are also located on the Cloud node.
  40. ### Log
  41. At present, the log file is under the directory /var/log/contains.
  42. There are two types of Promtail collection modes. The following table shows the feature comparison:
  43. | | Daemonset | SideCar |
  44. |----------------------------|---------------------------|-------------------------------------------|
  45. | Source | Sysout + Part file | File |
  46. | Log classification storage | Mapped by container/path | Pod can be separately |
  47. | Multi-tenant isolation | Isolated by configuration | Isolated by Pod |
  48. | Resource occupancy | Low | High |
  49. | Customizability | Low | High, each Pod is configured individually |
  50. | Applicable scene | Applicable scene | Large, mixed cluster |
  51. ### Metric
  52. #### CRD metric
  53. 1. LLJob Status
  54. | **Metric** | **Description** |
  55. |----------------------|----------------------------------------------------------------------------|
  56. | JobStatus | Status of each job,Enum(True, False, Unknown) |
  57. | StageConditionStatus | Status of each stage,Enum(Waiting,Ready,Starting,Running,Completed,Failed) |
  58. | LearningStage | Stages of lifelong learning,Enum(Train, Eval, Deploy ) |
  59. 2. Dataset status
  60. | **Metric** | **Description** |
  61. |------------------|-------------------------------------|
  62. | NumberOfSamples | The number of samples |
  63. | StageTrainNumber | The number of samples used by train |
  64. | StageEvalNumber | The number of samples used by eval |
  65. 3. Model
  66. | **Metric** | **Description** |
  67. |------------|-------------------------------------------|
  68. | Key | The value corresponding to the custom Key |
  69. 4. Task
  70. | **Metric** | **Description** |
  71. |-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
  72. | ActiveTasksNumber | The number of running tasks |
  73. | TotalTasksNumber | The number of total tasks |
  74. | TasksInfo | Tasks Info<br/>{ <br/>location: "edge &#124; cloud",<br/>work: "training &#124; infering",<br/>startTime: "xxxx-xx-xx xx:xx:xx",<br/>duration: "xxxx s/min/h",<br/>status: "active &#124; dead"<br/>} |
  75. 5. Worker
  76. | **Metric** | **Description** |
  77. |---------------------|-----------------------------------------------------------------------------------------------------------|
  78. | ActiveWorkersNumber | The number of running worker |
  79. | TotalWorkerNumber | The number of total worker |
  80. | WorkersInfo | Workers Info<br/>{<br/>location: "edge &#124;cloud",<br/>status: "waiting &#124;active &#124; dead"<br/>} |
  81. 6. Number of samples for inference (known tasks, unknown tasks)
  82. | **Metric** | **Description** |
  83. |----------------------|--------------------------------------------------------|
  84. | InferenceDoneNumber | The number of samples for which inference is completed |
  85. | InferenceReadyNumber | The number of samples ready for inference |
  86. | InferenceErrorNumber | The number of samples with inference errors |
  87. | InferenceRate | Inference completion percentage:0-100(%) |
  88. 7. Knowledge Base server
  89. | **Metric** | **Description** |
  90. |------------|-------------------------------|
  91. | KBServer | Address of the knowledge base |
  92. 8. Training stage
  93. | **Metric** | **Description** |
  94. |---------------------|---------------------------------------------|
  95. | LearningRate | The learning rate at training stage |
  96. | EpochNum | The number of epochs at training stage |
  97. | BatchSize | The batch size at training stage |
  98. | TrainSampleNum | The number of samples used at training stage |
  99. | TrainOutputModelUrl | The output url for model after training |
  100. 9. Evaluation stage
  101. | **Metric** | **Description** |
  102. |---------------------|-----------------------------------------------------------|
  103. | EvalSampleNum | The number of samples used at eval stage |
  104. | Score | Score for the task to be eval |
  105. | ModelFilterOperator | Type of operator for threshold judgment,Enum(>,<,=,>=,<=) |
  106. | Threshold | Threshold for judging whether to deploy the model |
  107. | EvalModelUrl | The url of model at eval stage |
  108. 10 Deploy stage
  109. | **Metric** | **Description** |
  110. |----------------|-----------------------------|
  111. | DeployStatus | Enum(Waiting,Ok,NotOk) |
  112. | DeployModelUrl | The url for deploying model |
  113. 10. Evaluation stage
  114. #### Example
  115. ```
  116. kind: CustomResourceStateMetrics
  117. spec:
  118. resources:
  119. -
  120. groupVersionKind:
  121. group: myteam.io
  122. kind: "Foo"
  123. version: "v1"
  124. labelsFromPath:
  125. name: [ metadata, name ]
  126. metrics:
  127. - name: "active_count"
  128. help: "Number Foo Bars active"
  129. each:
  130. path: [status, active]
  131. labelFromKey: type
  132. labelsFromPath:
  133. bar: [bar]
  134. value: [count]
  135. commonLabels:
  136. custom_metric: "yes"
  137. labelsFromPath:
  138. "*": [metadata, labels] # copy all labels from CR labels
  139. foo: [metadata, labels, foo] # copy a single label (overrides *)
  140. - name: "other_count"
  141. each:
  142. path: [status, other]
  143. errorLogV: 5
  144. ```
  145. ### Visualization
  146. Grafana can be used to visualize the monitoring data of Sedna. The visualization in Grafana contains 2 parts: `metrics` and `logs`.
  147. - Visualization of metrics:
  148. ![](./images/lifelong-learning-ops-metrics-grafana.png)
  149. - Visualization of logs:
  150. ![](./images/lifelong-learning-ops-log-grafana.png)
  151. ## Install
  152. Using helm, modified based on the Loki-Stack.
  153. ```yaml
  154. loki:
  155. enabled: true
  156. persistence:
  157. enabled: true
  158. accessModes:
  159. - ReadWriteOnce
  160. size: 10Gi
  161. promtail:
  162. enabled: true
  163. config:
  164. lokiAddress: http://{{ .Release.Name }}:3100/loki/api/v1/push
  165. grafana:
  166. enabled: true
  167. prometheus:
  168. enabled: true
  169. isDefault: false
  170. nodeExporter:
  171. enabled: true
  172. kubeStateMetrics:
  173. enabled: true
  174. ```
  175. There will also be related json files for the Grafana dashboard.
  176. The corresponding dashboards can be imported according to user needs