10 kB

Raw Blame History

Operation and maintenance UI development
- Motivation
- - Goals
- Proposal
- - Use cases
- Design details
- - Architecture
  - Log
  - Metric
  - - CRD metric
    - Example
  - Visualization
- Install

Operation and maintenance UI development
- Motivation
  - Goals
- Proposal
  - Use Cases
- Design Details
  - Log
  - Metric
  - Visualization
- NetWork
  - Cache
- Install

Operation and maintenance UI development

Motivation

At present, Users can only access information about lifelong learning through the command line, including logs, status, and metrics.It is inconvenient and unfriendly for users. This proposal will provide a UI for metrics, log collection, status monitoring and management based on grafana. This allows the user to get the information by the graphical interface

Goals

Supports metrics collection and visualization in lifelong learning
Support unified management and search of application logs
Support the management and status monitoring of lifelong learning in dashboard

Proposal

We propose using grafana,loki and prometheus to display the metric,log and statues about the lifelong learning job. The Prometheus is used to collect the metric and status and the loki is used to collect logs.

Use cases

Users can search log by specifying pod and keywords in the log control panel
User can view the history state or the metric of any component in a sequence diagram
Users can view the status of any CRD (Model, DataSet, Lifelonglearningjob) by specifying the name

Design details

Architecture

Log
In Daemonset mode, Promtail will be deployed on each node (Cloud, Edge), and will monitor the log storage directory on the node (eg: /var/logs/contains).
If there is a log update, in Loki's next pull, the updated log will be pulled.
It is the same way as "Kubectl logs" to get logs.
Loki is deployed on Cloud nodes and persistently stores the pulled logs
Metric
kube-state-metrics is about generating metrics from Kubernetes API objects without modification. This ensures that features provided by kube-state-metrics have the same grade of stability as the Kubernetes API objects themselves.
In Kubeedge, it is the same way as "Kubectl get" to get component information. When using Kubectl to obtain information from edge nodes, Kube-State-Metrics can also.
Both Prometheus and Kube-State-Server are deployed on Cloud nodes.Prometheus will pull the metrics in Kube-state-server and store the metrics persistently.
Visualization
Grafana is deployed on the Cloud node, and it visualizes based on the persisted data in Prometheus and Loki that are also located on the Cloud node.

Log

At present, the log file is under the directory /var/log/contains.
There are two types of Promtail collection modes. The following table shows the feature comparison:

	Daemonset	SideCar
Source	Sysout + Part file	File
Log classification storage	Mapped by container/path	Pod can be separately
Multi-tenant isolation	Isolated by configuration	Isolated by Pod
Resource occupancy	Low	High
Customizability	Low	High, each Pod is configured individually
Applicable scene	Applicable scene	Large, mixed cluster

Metric

CRD metric

LLJob Status

Metric	Description
JobStatus	Status of each job,Enum(True, False, Unknown)
StageConditionStatus	Status of each stage,Enum(Waiting,Ready,Starting,Running,Completed,Failed)
LearningStage	Stages of lifelong learning,Enum(Train, Eval, Deploy )

Dataset status

Metric	Description
NumberOfSamples	The number of samples
StageTrainNumber	The number of samples used by train
StageEvalNumber	The number of samples used by eval

Model

Metric	Description
Key	The value corresponding to the custom Key

Task

Metric	Description
ActiveTasksNumber	The number of running tasks
TotalTasksNumber	The number of total tasks
TasksInfo	Tasks Info { location: "edge \| cloud", work: "training \| infering", startTime: "xxxx-xx-xx xx:xx:xx", duration: "xxxx s/min/h", status: "active \| dead" }

Worker

Metric	Description
ActiveWorkersNumber	The number of running worker
TotalWorkerNumber	The number of total worker
WorkersInfo	Workers Info { location: "edge \|cloud", status: "waiting \|active \| dead" }

Number of samples for inference (known tasks, unknown tasks)

Metric	Description
InferenceDoneNumber	The number of samples for which inference is completed
InferenceReadyNumber	The number of samples ready for inference
InferenceErrorNumber	The number of samples with inference errors
InferenceRate	Inference completion percentage：0-100（%）

Knowledge Base server

Metric	Description
KBServer	Address of the knowledge base

Training stage

Metric	Description
LearningRate	The learning rate at training stage
EpochNum	The number of epochs at training stage
BatchSize	The batch size at training stage
TrainSampleNum	The number of samples used at training stage
TrainOutputModelUrl	The output url for model after training

Evaluation stage

Metric	Description
EvalSampleNum	The number of samples used at eval stage
Score	Score for the task to be eval
ModelFilterOperator	Type of operator for threshold judgment,Enum(>,<,=,>=,<=)
Threshold	Threshold for judging whether to deploy the model
EvalModelUrl	The url of model at eval stage

10 Deploy stage

Metric	Description
DeployStatus	Enum(Waiting,Ok,NotOk)
DeployModelUrl	The url for deploying model

Evaluation stage

Example

kind: CustomResourceStateMetrics
spec:
  resources:
    -
      groupVersionKind:
        group: myteam.io
        kind: "Foo"
        version: "v1"
      labelsFromPath:
        name: [ metadata, name ]
      metrics:
        - name: "active_count"
          help: "Number Foo Bars active"
          each:
            path: [status, active]
            labelFromKey: type
            labelsFromPath:
              bar: [bar]
            value: [count]
          commonLabels:
            custom_metric: "yes"

          labelsFromPath:
            "*": [metadata, labels]  # copy all labels from CR labels
            foo: [metadata, labels, foo]  # copy a single label (overrides *)

        - name: "other_count"
          each:
            path: [status, other]
          errorLogV: 5

Visualization

Grafana can be used to visualize the monitoring data of Sedna. The visualization in Grafana contains 2 parts: metrics and logs.

Visualization of metrics:

Visualization of logs:

Install

Using helm, modified based on the Loki-Stack.

loki:
  enabled: true
  persistence:
    enabled: true
    accessModes:
      - ReadWriteOnce
    size: 10Gi

promtail:
  enabled: true
  config:
    lokiAddress: http://{{ .Release.Name }}:3100/loki/api/v1/push

grafana:
  enabled: true

prometheus:
  enabled: true
  isDefault: false
  nodeExporter:
    enabled: true
  kubeStateMetrics:
    enabled: true

There will also be related json files for the Grafana dashboard.
The corresponding dashboards can be imported according to user needs

10 kB Raw Blame History