You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

observability-management.md 8.5 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
  1. * [Observability Management](#observability-management)
  2. * [Motivation](#motivation)
  3. * [Goals](#goals)
  4. * [Proposal](#proposal)
  5. * [Design Details](#design-details)
  6. * [Monitoring Metrics](#monitoring-metrics)
  7. * [Common System Metrics](#common-system-metrics)
  8. * [Algorithm Metrics (Designed Individually For Each Task)](#algorithm-metrics-designed-individually-for-each-task)
  9. * [Joint Inference](#joint-inference)
  10. * [Incremental Learning](#incremental-learning)
  11. * [Federated Learning](#federated-learning)
  12. * [Collecting Logs](#collecting-logs)
  13. * [Display](#display)
  14. * [Key Deliverable](#key-deliverable)
  15. * [Road Map](#roadmap)
  16. # Observability Management
  17. ## Motivation
  18. Currently, users can only check the status, parameters and metrics of tasks via the command line after creating edge-cloud synergy AI tasks by sedna.
  19. This proposal provides observability management for displaying logs and metrics to monitor tasks in real time, so that users can easily check the status, parameters and metrics of tasks.
  20. ### Goals
  21. * The metrics and status of Sedna's components, such as Global Manager, Local Controllers and Workers, can be monitored.
  22. * The parameters of edge-cloud synergy AI tasks like the count of inference can be collected and displayed on the Observability management.
  23. * Logs of all pods created by Sedna can be collected to manage and display.
  24. * Observability data collected can be displayed on Grafana appropriately and aesthetically.
  25. ## Proposal
  26. We propose using Prometheus and Loki to collect observability data like metrics and logs.
  27. And the Observability data can be displayed with functions of Grafana.
  28. ## Design Details
  29. ![](./images/observability-management-architecture.png)
  30. ### Monitoring Metrics
  31. Sedna consists of GlobalManager, LocalControllers and Workers, which ensure Sedna works.
  32. The observability management can monitor these components from different aspects.
  33. #### Common System Metrics
  34. The running status of cluster resource objects like pods, service, deployment on edge and cloud can be monitored by kube-state-metrics.
  35. | **Metric** | **Description** |
  36. |----------------------|-------------------------------------------------------|
  37. | PodStatus | Running / Pending / Failed |
  38. | ContainerStatus | Running / Waiting / Terminated |
  39. | K8sJobStatus | Succeeded / Failed |
  40. Especially for sedna, the running status of CRDs like JointInferenceService, IncrementalLearningJob and FederatedLearningJob can also be monitored.
  41. | **Metric** | **Description** |
  42. |----------------------|-----------------------------------------------------------|
  43. | StageConditionStatus | Waiting / Ready / Starting / Running / Completed / Failed |
  44. | StartTime | The start time of tasks |
  45. | CompletionTime | The completion time of tasks |
  46. | SampleNum | The number of samples |
  47. #### Algorithm Metrics (Designed Individually For Each Task)
  48. For edge-cloud synergy AI tasks, we create customized exporters to collect the metrics we need in different types of tasks.
  49. ##### Joint Inference
  50. | **Metric** | **Description** |
  51. |----------------------|-----------------------------------------------------------|
  52. | EdgeInferenceCount | The count of inference at edge |
  53. | CloudInferenceCount | The count of inference at cloud |
  54. | HardSampleNum | The number of hard samples |
  55. ##### Incremental Learning
  56. | **Metric** | **Description** |
  57. |---------------------------|-----------------------------------------------------------------------------------------------|
  58. | LearningRate | The learning rate at training stage |
  59. | EpochNum | The number of epochs at training stage |
  60. | BatchSize | The batch size at training stage |
  61. | TrainSampleNum | The number of samples used at training stage |
  62. | EvalSampleNum | The number of samples used at eval stage |
  63. | TaskStage | Training Stage / Eval Stage / Deploying Stage |
  64. | CurrentStatus | The status of current stage like running and waiting (only for training stage and eval stage) |
  65. | IterationNo | The current iteration No. of incremental learning |
  66. | IterationInferenceCount | The count of inference at current iteration |
  67. | IterationHardSampleNum | The number of hard samples at current iteration |
  68. | EvalNewModelUrl | The url of new model at eval stage |
  69. | AccuracyMetricForNewModel | mAP / Precision / Recall / F1-score for new model |
  70. | AccuracyMetricForOldModel | mAP / Precision / Recall / F1-score for old model |
  71. | TrainOutputModelUrl | The output url for model after training |
  72. | DeployModelUrl | The url for deploying model |
  73. | DeployStatus | Waiting / OK / Not OK |
  74. | TrainRemainingTime | The remaining time at train stage |
  75. | TrainLoss | Loss at train stage |
  76. | EvalRemainingTime | The remaining time at eval stage |
  77. ##### Federated Learning
  78. | **Metric** | **Description** |
  79. |-------------------------|-----------------------------------------------------------|
  80. | TrainNodes | The nodes participating in training |
  81. | TrainStatus | Current training status |
  82. | NodeSampleNum | The number of samples at each node |
  83. | IterationNo | The current iteration No. of federated learning |
  84. | AggregationNo | The current aggregation No. of federated learning |
  85. ### Collecting Logs
  86. We plan to use Loki to collect logs from all pods created by Sedna like Local Controllers, Global Manager and workers derived from tasks.
  87. ### Display
  88. Grafana supports presentation of customised query based on Prometheus and Loki monitored data via PromQL and LogQL.
  89. For metrics, data can be displayed as line graphs, pie charts, histogram and so on.
  90. For logs, querying can be performed to show all logs matched and the number of matched logs at different times.
  91. ## Key Deliverable
  92. * Open-Source software deployment and usage guide
  93. * Code and configuration files
  94. * Code of edge-cloud synergy AI task exporters
  95. * Component configuration files for Prometheus, Loki and Grafana
  96. * Exporter configuration files for kube-state-metrics
  97. * JSON configuration files for easy-to-use and good-looking panels on Grafana
  98. * End-to-end test cases
  99. ## Roadmap
  100. * July 2022:
  101. * Complete Prometheus configuration files writing for node monitoring and cluster resource objects monitoring.
  102. * Finish Loki configuration files writing for logs management.
  103. * August 2022:
  104. * Create the customized exporters to collect metrics of tasks.
  105. * Design easy-to-use and good-looking panels and display the observability data on Grafana.
  106. * September 2022:
  107. * Design test cases.
  108. * Write the document for deploying observability management.