Currently, users can only check the status, parameters and metrics of tasks via the command line after creating edge-cloud synergy AI tasks by sedna.
This proposal provides observability management for displaying logs and metrics to monitor tasks in real time, so that users can easily check the status, parameters and metrics of tasks.
We propose using Prometheus and Loki to collect observability data like metrics and logs.
And the Observability data can be displayed with functions of Grafana.
Sedna consists of GlobalManager, LocalControllers and Workers, which ensure Sedna works.
The observability management can monitor these components from different aspects.
The running status of cluster resource objects like pods, service, deployment on edge and cloud can be monitored by kube-state-metrics.
| Metric | Description |
|---|---|
| PodStatus | Running / Pending / Failed |
| ContainerStatus | Running / Waiting / Terminated |
| K8sJobStatus | Succeeded / Failed |
Especially for sedna, the running status of CRDs like JointInferenceService, IncrementalLearningJob and FederatedLearningJob can also be monitored.
| Metric | Description |
|---|---|
| StageConditionStatus | Waiting / Ready / Starting / Running / Completed / Failed |
| StartTime | The start time of tasks |
| CompletionTime | The completion time of tasks |
| SampleNum | The number of samples |
For edge-cloud synergy AI tasks, we create customized exporters to collect the metrics we need in different types of tasks.
| Metric | Description |
|---|---|
| EdgeInferenceCount | The count of inference at edge |
| CloudInferenceCount | The count of inference at cloud |
| HardSampleNum | The number of hard samples |
| Metric | Description |
|---|---|
| TrainSampleNum | The number of samples used at training stage |
| EvalSampleNum | The number of samples used at eval stage |
| CurrentStatus | The status of current stage like running and waiting (only for training stage and eval stage) |
| IterationNo | The current iteration No. of incremental learning |
| IterationInferenceCount | The count of inference at current iteration |
| IterationHardSampleNum | The number of hard samples at current iteration |
| EvalNewModelUrl | The url of new model at eval stage |
| AccuracyMetricForNewModel | mAP / Precision / Recall / F1-score for new model |
| AccuracyMetricForOldModel | mAP / Precision / Recall / F1-score for old model |
| TrainOutputModelUrl | The output url for model after training |
| DeployModelUrl | The url for deploying model |
| DeployStatus | Waiting / OK / Not OK |
| TrainRemainingTime | The remaining time at train stage |
| TrainLoss | Loss at train stage |
| EvalRemainingTime | The remaining time at eval stage |
| Metric | Description |
|---|---|
| TaskStage | Training Stage / Eval Stage / Deploying Stage |
| Metric | Description |
|---|---|
| TrainNodes | The nodes participating in training |
| TrainStatus | Current training status |
| NodeSampleNum | The number of samples at each node |
| IterationNo | The current iteration No. of federated learning |
| AggregationNo | The current aggregation No. of federated learning |
We plan to use Loki to collect logs from all pods created by Sedna like Local Controllers, Global Manager and workers derived from tasks.
Grafana supports presentation of customised query based on Prometheus and Loki monitored data via PromQL and LogQL.
For metrics, data can be displayed as line graphs, pie charts, histogram and so on.
For logs, querying can be performed to show all logs matched and the number of matched logs at different times.