8.3 kB

Raw Blame History

Observability Management

Observability Management

Observability Management

Motivation

Currently, users can only check the status, parameters and metrics of tasks via the command line after creating edge-cloud synergy AI tasks by sedna.

This proposal provides observability management for displaying logs and metrics to monitor tasks in real time, so that users can easily check the status, parameters and metrics of tasks.

Goals

The metrics and status of Sedna's components, such as Global Manager, Local Controllers and Workers, can be monitored.
The parameters of edge-cloud synergy AI tasks like the count of inference can be collected and displayed on the Observability management.
Logs of all pods created by Sedna can be collected to manage and display.
Observability data collected can be displayed on Grafana appropriately and aesthetically.

Proposal

We propose using Prometheus and Loki to collect observability data like metrics and logs.
And the Observability data can be displayed with functions of Grafana.

Design Details

Monitoring Metrics

Sedna consists of GlobalManager, LocalControllers and Workers, which ensure Sedna works.
The observability management can monitor these components from different aspects.

Common System Metrics

The running status of cluster resource objects like pods, service, deployment on edge and cloud can be monitored by kube-state-metrics.

Metric	Description
PodStatus	Running / Pending / Failed
ContainerStatus	Running / Waiting / Terminated
K8sJobStatus	Succeeded / Failed

Especially for sedna, the running status of CRDs like JointInferenceService, IncrementalLearningJob and FederatedLearningJob can also be monitored.

Metric	Description
StageConditionStatus	Waiting / Ready / Starting / Running / Completed / Failed
StartTime	The start time of tasks
CompletionTime	The completion time of tasks
SampleNum	The number of samples

Algorithm Metrics (Designed Individually For Each Task)

For edge-cloud synergy AI tasks, we create customized exporters to collect the metrics we need in different types of tasks.

Joint Inference

Metric	Description
EdgeInferenceCount	The count of inference at edge
CloudInferenceCount	The count of inference at cloud
HardSampleNum	The number of hard samples

Incremental Learning

Metric	Description
TrainSampleNum	The number of samples used at training stage
EvalSampleNum	The number of samples used at eval stage
CurrentStatus	The status of current stage like running and waiting (only for training stage and eval stage)
IterationNo	The current iteration No. of incremental learning
IterationInferenceCount	The count of inference at current iteration
IterationHardSampleNum	The number of hard samples at current iteration
EvalNewModelUrl	The url of new model at eval stage
AccuracyMetricForNewModel	mAP / Precision / Recall / F1-score for new model
AccuracyMetricForOldModel	mAP / Precision / Recall / F1-score for old model
TrainOutputModelUrl	The output url for model after training
DeployModelUrl	The url for deploying model
DeployStatus	Waiting / OK / Not OK
TrainRemainingTime	The remaining time at train stage
TrainLoss	Loss at train stage
EvalRemainingTime	The remaining time at eval stage

Incremental Learning Task Basic Metrics

Metric	Description
TaskStage	Training Stage / Eval Stage / Deploying Stage

Federated Learning

Metric	Description
TrainNodes	The nodes participating in training
TrainStatus	Current training status
NodeSampleNum	The number of samples at each node
IterationNo	The current iteration No. of federated learning
AggregationNo	The current aggregation No. of federated learning

Collecting Logs

We plan to use Loki to collect logs from all pods created by Sedna like Local Controllers, Global Manager and workers derived from tasks.

Display

Grafana supports presentation of customised query based on Prometheus and Loki monitored data via PromQL and LogQL.
For metrics, data can be displayed as line graphs, pie charts, histogram and so on.
For logs, querying can be performed to show all logs matched and the number of matched logs at different times.

Key Deliverable

Open-Source software deployment and usage guide
Code and configuration files
- Code of edge-cloud synergy AI task exporters
- Component configuration files for Prometheus, Loki and Grafana
- Exporter configuration files for kube-state-metrics
- JSON configuration files for easy-to-use and good-looking panels on Grafana
End-to-end test cases

Roadmap

July 2022:
- Complete Prometheus configuration files writing for node monitoring and cluster resource objects monitoring.
- Finish Loki configuration files writing for logs management.
August 2022:
- Create the customized exporters to collect metrics of tasks.
- Design easy-to-use and good-looking panels and display the observability data on Grafana.
September 2022:
- Design test cases.
- Write the document for deploying observability management.

8.3 kB Raw Blame History