At present, Users can only access information about lifelong learning through the command line, including logs, status, and metrics.It is inconvenient and unfriendly for users. This proposal will provide a UI for metrics, log collection, status monitoring and management based on grafana. This allows the user to get the information by the graphical interface
We propose using grafana,loki and prometheus to display the metric,log and statues about the lifelong learning job. The Prometheus is used to collect the metric and status and the loki is used to collect logs.
At present, the log file is under the directory /var/log/contains.
There are two types of Promtail collection modes. The following table shows the feature comparison:
| Daemonset | SideCar | |
|---|---|---|
| Source | Sysout + Part file | File |
| Log classification storage | Mapped by container/path | Pod can be separately |
| Multi-tenant isolation | Isolated by configuration | Isolated by Pod |
| Resource occupancy | Low | High |
| Customizability | Low | High, each Pod is configured individually |
| Applicable scene | Applicable scene | Large, mixed cluster |
| Metric | Description |
|---|---|
| JobStatus | Status of each job,Enum(True, False, Unknown) |
| StageConditionStatus | Status of each stage,Enum(Waiting,Ready,Starting,Running,Completed,Failed) |
| LearningStage | Stages of lifelong learning,Enum(Train, Eval, Deploy ) |
| Metric | Description |
|---|---|
| NumberOfSamples | The number of samples |
| StageTrainNumber | The number of samples used by train |
| StageEvalNumber | The number of samples used by eval |
| Metric | Description |
|---|---|
| Key | The value corresponding to the custom Key |
| Metric | Description |
|---|---|
| ActiveTasksNumber | The number of running tasks |
| TotalTasksNumber | The number of total tasks |
| TasksInfo | Tasks Info { location: "edge | cloud", work: "training | infering", startTime: "xxxx-xx-xx xx:xx:xx", duration: "xxxx s/min/h", status: "active | dead" } |
| Metric | Description |
|---|---|
| ActiveWorkersNumber | The number of running worker |
| TotalWorkerNumber | The number of total worker |
| WorkersInfo | Workers Info { location: "edge |cloud", status: "waiting |active | dead" } |
| Metric | Description |
|---|---|
| InferenceDoneNumber | The number of samples for which inference is completed |
| InferenceReadyNumber | The number of samples ready for inference |
| InferenceErrorNumber | The number of samples with inference errors |
| InferenceRate | Inference completion percentage:0-100(%) |
| Metric | Description |
|---|---|
| KBServer | Address of the knowledge base |
| Metric | Description |
|---|---|
| LearningRate | The learning rate at training stage |
| EpochNum | The number of epochs at training stage |
| BatchSize | The batch size at training stage |
| TrainSampleNum | The number of samples used at training stage |
| TrainOutputModelUrl | The output url for model after training |
| Metric | Description |
|---|---|
| EvalSampleNum | The number of samples used at eval stage |
| Score | Score for the task to be eval |
| ModelFilterOperator | Type of operator for threshold judgment,Enum(>,<,=,>=,<=) |
| Threshold | Threshold for judging whether to deploy the model |
| EvalModelUrl | The url of model at eval stage |
10 Deploy stage
| Metric | Description |
|---|---|
| DeployStatus | Enum(Waiting,Ok,NotOk) |
| DeployModelUrl | The url for deploying model |
kind: CustomResourceStateMetrics
spec:
resources:
-
groupVersionKind:
group: myteam.io
kind: "Foo"
version: "v1"
labelsFromPath:
name: [ metadata, name ]
metrics:
- name: "active_count"
help: "Number Foo Bars active"
each:
path: [status, active]
labelFromKey: type
labelsFromPath:
bar: [bar]
value: [count]
commonLabels:
custom_metric: "yes"
labelsFromPath:
"*": [metadata, labels] # copy all labels from CR labels
foo: [metadata, labels, foo] # copy a single label (overrides *)
- name: "other_count"
each:
path: [status, other]
errorLogV: 5
Grafana can be used to visualize the monitoring data of Sedna. The visualization in Grafana contains 2 parts: metrics and logs.
Using helm, modified based on the Loki-Stack.
loki:
enabled: true
persistence:
enabled: true
accessModes:
- ReadWriteOnce
size: 10Gi
promtail:
enabled: true
config:
lokiAddress: http://{{ .Release.Name }}:3100/loki/api/v1/push
grafana:
enabled: true
prometheus:
enabled: true
isDefault: false
nodeExporter:
enabled: true
kubeStateMetrics:
enabled: true
There will also be related json files for the Grafana dashboard.
The corresponding dashboards can be imported according to user needs