diff --git a/sig-ai/KubeEdge AI.md b/sig-ai/KubeEdge AI.md new file mode 100644 index 0000000000000000000000000000000000000000..02a09a54df8b112650b81aca40a04b0881889ded --- /dev/null +++ b/sig-ai/KubeEdge AI.md @@ -0,0 +1,64 @@ +# Edge Cloud Collaborative AI Framework + +## Motivation + +Currently, "Edge AI" in the industry is at an early stage of training on the cloud and inference on the edge. However, the future trend has emerged, and related research and practice are booming, bringing new value growth points for edge computing and AI. Also, edge AI applications have much room for optimization in terms of cost, model effect, and privacy protection. For example: + + +This proposal provides a basic framework for edge-cloud collaborative training and inference, so that AI applications running at the edge can benefit from cost reduction, model performance improvement, and data privacy protection. + + +### Goals + +For AI applications running at the edge, the goals of edge cloud collaborative framework are: +* reducing resource cost on the edge +* improving model performance +* protecting data privacy + + +## Proposal +* What we propose: + * an edge-cloud collaborative AI framework based on KubeEdge + * with embed collaborative training and joint inferencing algorithm + * working with existing AI framework like Tensorflow, etc + +* 3 Features锛� + * joint inference + * incremental learning + * federated learning + +* Targeting Users锛� + * Domain-specific AI Developers: build and publish edge-cloud collaborative AI services/functions easily + * Application Developers: use edge-cloud collaborative AI capabilities. + +* We are NOT: + * to re-invent existing ML framework, i.e., tensorflow, pytorch, mindspore, etc. + * to re-invent existing edge platform, i.e., kubeedge, etc. + * to offer domain/application-specific algorithms, i.e., facial recognition, text classification, etc. + + + +## Design Details + + +### Architecture + + + +* GlobalCoordinator: implements the Edge AI features controllers based on the [k8s operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/) + * Federated Learning Controller: Implements the federated learning feature based on user created CRDs + * Incremental Learning Controller: Implements the incremental learning feature based on user created CRDs + * Joint Inference Controller: Implements the joint inference feature based on user created CRDs + +* LocalController: manages the Edge AI features, the extra dataset/model resources on the edge nodes + +* Workers: includes the training/evaluation/inference/aggregator + * do inference or training, based on existing ML framework + * launch on demand, imagine they are docker containers + * different workers for different features + * could run on edge or cloud + +* Lib: exposes the Edge AI features to applications, i.e. training or inference programs + + + diff --git a/sig-ai/README.md b/sig-ai/README.md new file mode 100644 index 0000000000000000000000000000000000000000..871d8b9b2f6451d23c8ccbcc6bd76ab2aae3d664 --- /dev/null +++ b/sig-ai/README.md @@ -0,0 +1,17 @@ +# SIG AI + +SIG AI is responsible to provide general platform capabilities based on KubeEdge so that AI applications running at the edge can benefit from cost reduction, model performance improvement, and data privacy protection. + + +The [charter](./charter.md) defines the scope and governance of the AI Special Interest Group. + +## Meetings + + +* Regular SIG Meeting: [Thursday at 10:00 UTC+8](https://zoom.us/my/kubeedge) (weekly, starts on Nov. 12th 2020). [Convert to your timezone](https://www.thetimezoneconverter.com/?t=10%3A00%20am&tz=GMT%2B8&). + * [Meeting notes and Agenda](https://docs.google.com/document/d/e/2PACX-1vSKVQSQ4tmBsApIHVdH3sLJluzqSitRpRSr88VEgBMeTMCYczgPnuKhTfdF9srE0Obk9cTygHTste-N/pub). + * [Meeting Calendar](https://calendar.google.com/calendar/u/0?cid=Y19nODluOXAwOG05MzFiYWM3NmZsajgwZzEwOEBncm91cC5jYWxlbmRhci5nb29nbGUuY29t) | [Subscribe](https://calendar.google.com/calendar?cid=OHJqazhvNTE2dmZ0ZTIxcWlidmxhZTNsajRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ) + +## Contact +- Slack: [#sig-ai](to be filled) +- [Mailing list](https://groups.google.com/forum/#!forum/kubeedge) \ No newline at end of file diff --git a/sig-ai/charter.md b/sig-ai/charter.md new file mode 100644 index 0000000000000000000000000000000000000000..1e9747e7b4c7fcdee028772c124fb37a0384a408 --- /dev/null +++ b/sig-ai/charter.md @@ -0,0 +1,32 @@ +# SIG AI Charter + +This charter adheres to the conventions described in [KubeEdge Open Governance](https://github.com/kubeedge/community/blob/master/GOVERNANCE.md) and uses the Roles and Organization Management outlined in the governance doc. + +## Scope + +SIG AI is responsible to provide general platform capabilities based on KubeEdge so that AI applications running at the edge can benefit from cost reduction, model performance improvement, and data privacy protection. + +### In scope + +#### Areas of Focus + +- an edge-cloud collaborative AI framework based on KubeEdge +- with embed collaborative training and joint inference algorithm +- working with existing AI framework like Tensorflow, etc. + +### Out of scope +- to re-invent existing ML framework, i.e., tensorflow, pytorch, mindspore, etc. +- to re-invent existing edge platform, i.e., kubeedge, etc. +- to offer domain/application-specific algorithms, i.e., facial recognition, text classification, etc. + +## Roles and Organization Management + +This SIG follows and adheres to the Roles and Organization Management outlined in KubeEdge Open Governance and opts-in to updates and modifications to KubeEdge Open Governance. + +### Additional responsibilities of Chairs + +- Manage and curate the project boards associated with all sub-projects ahead of every SIG meeting so they may be discussed +- Ensure the agenda is populated 24 hours in advance of the meeting, or the meeting is then cancelled +- Report the SIG status at events and community meetings wherever possible +- Actively promote diversity and inclusion in the SIG +- Uphold the KubeEdge Code of Conduct especially in terms of personal behavior and responsibility \ No newline at end of file diff --git a/sig-ai/dataset-and-model.md b/sig-ai/dataset-and-model.md new file mode 100644 index 0000000000000000000000000000000000000000..fcfb38df07a067d667d003441edeec84d8f0e81e --- /dev/null +++ b/sig-ai/dataset-and-model.md @@ -0,0 +1,297 @@ +* [Dataset and Model](#dataset-and-model) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non\-goals](#non-goals) + * [Proposal](#proposal) + * [Use Cases](#use-cases) + * [Design Details](#design-details) + * [CRD API Group and Version](#crd-api-group-and-version) + * [CRDs](#crds) + * [Type definition](#crd-type-definition) + * [Crd sample](#crd-samples) + * [Controller Design](#controller-design) + +# Dataset and Model + +## Motivation + +Currently, the EdgeAI features depend on the object `dataset` and `model` + + +This proposal provides the definitions of dataset and model as the first class of k8s resources. + +### Goals + +* Metadata of `dataset` and `model` objects. +* Used by the EdgeAI features + +### Non-goals +* The truly format of the AI `dataset`, such as `imagenet`, `coco` or `tf-record` etc. +* The truly format of the AI `model`, such as `ckpt`, `saved_model` of tensorflow etc. + +* The truly operations of the AI `dataset`, such as `shuffle`, `crop` etc. +* The truly operations of the AI `model`, such as `train`, `inference` etc. + + +## Proposal +We propose using Kubernetes Custom Resource Definitions (CRDs) to describe +the dataset/model specification/status and a controller to synchronize these updates between edge and cloud. + + + +### Use Cases + +* Users can create the dataset resource, by providing the `dataset url`, `format` and the `nodeName` which owns the dataset. +* Users can create the model resource by providing the `model url` and `format`. +* Users can show the information of dataset/model. +* Users can delete the dataset/model. + + +## Design Details + +### CRD API Group and Version +The `Dataset` and `Model` CRDs will be namespace-scoped. +The tables below summarize the group, kind and API version details for the CRDs. + +* Dataset + +| Field | Description | +|-----------------------|-------------------------| +|Group | edgeai.io | +|APIVersion | v1alpha1 | +|Kind | Dataset | + +* Model + +| Field | Description | +|-----------------------|-------------------------| +|Group | edgeai.io | +|APIVersion | v1alpha1 | +|Kind | Model | + +### CRDs + +- `Dataset` crd + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: datasets.edgeai.io +spec: + group: edgeai.io + names: + kind: Dataset + plural: datasets + scope: Namespaced + versions: + - name: v1alpha1 + subresources: + # status enables the status subresource. + status: {} + served: true + storage: true + schema: + openAPIV3Schema: + type: object + properties: + spec: + type: object + properties: + dataUrl: + type: string + format: + type: string + nodeName: + type: string + status: + type: object + properties: + numberOfSamples: + type: integer + updateTime: + type: string + format: datatime + + + additionalPrinterColumns: + - name: NumberOfSamples + type: integer + description: The number of samples in the dataset + jsonPath: ".status.numberOfSamples" + - name: Node + type: string + description: The node name of the dataset + jsonPath: ".spec.nodeName" + - name: spec + type: string + description: The spec of the dataset + jsonPath: ".spec" +``` + +- `Model` crd +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: models.edgeai.io +spec: + group: edgeai.io + names: + kind: Model + plural: models + scope: Namespaced + versions: + - name: v1alpha1 + subresources: + # status enables the status subresource. + status: {} + served: true + storage: true + schema: + openAPIV3Schema: + type: object + properties: + spec: + type: object + properties: + modelUrl: + type: string + status: + type: object + properties: + updateTime: + type: string + format: datetime + metrics: + type: array + items: + type: object + properties: + key: + type: string + value: + type: string + + additionalPrinterColumns: + - name: updateAGE + type: date + description: The update age + jsonPath: ".status.updateTime" + - name: metrics + type: string + description: The metrics + jsonPath: ".status.metrics" +``` + +### CRD type definition +- `Dataset` + +```go +type Dataset struct { + metav1.TypeMeta `json:",inline"` + + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec DatasetSpec `json:"spec"` + Status DatasetStatus `json:"status"` +} + +type DatasetSpec struct { + DataUrl string `json:"dataUrl"` + Format string `json:"format"` + NodeName string `json:"nodeName"` +} + +type DatasetStatus struct { + UpdateTime *metav1.Time `json:"updateTime,omitempty"` + NumberOfSamples int `json:"numberOfSamples"` +} + +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +type DatasetList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata"` + + Items []Dataset `json:"items"` +} +``` + +- `Model` + +```go +// +genclient +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +type Model struct { + metav1.TypeMeta `json:",inline"` + + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec ModelSpec `json:"spec"` + Status ModelStatus `json:"status"` +} + +type ModelSpec struct { + ModelUrl string `json:"modelUrl"` + Format string `json:"format"` +} + +type ModelStatus struct { + UpdateTime *metav1.Time `json:"updateTime,omitempty"` + Metrics []ModelMetric `json:"metrics,omitempty"` +} + +type ModelMetric struct { + Key string `json:"key"` + Value string `json:"value"` +} + +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +type ModelList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata"` + + Items []Model `json:"items"` +} +``` + +### Crd samples +- `Dataset` + +```yaml +apiVersion: edgeai.io/v1alpha1 +kind: Dataset +metadata: + name: "dataset-examp" +spec: + dataUrl: "/code/data" + format: "txt" + nodeName: "edge0" +``` + +- `Model` + +```yaml +apiVersion: edgeai.io/v1alpha1 +kind: Model +metadata: + name: model-examp +spec: + modelUrl: "/model/frozen.pb" + format: pb +``` + + +## Controller Design +In the current design there is a controller for `dataset`, no controller for `model`.<br/> + +The dataset controller synchronizes the dataset between the cloud and edge. +- downstream: synchronize the dataset info from the cloud to the edge node. +- upstream: synchronize the dataset status from the edge to the cloud node, such as the information how many samples the dataset has. +<br/> + +Here is the flow of the dataset creation: + \ No newline at end of file diff --git a/sig-ai/federated-learning.md b/sig-ai/federated-learning.md new file mode 100644 index 0000000000000000000000000000000000000000..bb762bcd627d5d462997c9b1b0cd14049b6fa810 --- /dev/null +++ b/sig-ai/federated-learning.md @@ -0,0 +1,588 @@ +* [Federated Learning](#federated-learning) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non\-goals](#non-goals) + * [Proposal](#proposal) + * [Use Cases](#use-cases) + * [Design Details](#design-details) + * [CRD API Group and Version](#crd-api-group-and-version) + * [Federated learning CRD](#federated-learning-crd) + * [Federated learning type definition](#federated-learning-type-definition) + * [Federated learning sample](#federated-learning-sample) + * [Validation](#validation) + * [Controller Design](#controller-design) + * [Federated Learning Controller](#federated-learning-controller) + * [Downstream Controller](#downstream-controller) + * [Upstream Controller](#upstream-controller) + * [Details of api between GC(cloud) and LC(edge)](#details-of-api-between-gccloud-and-lcedge) + * [Workers Communication](#workers-communication) + +# Federated Learning +## Motivation + +For edge AI, data is naturally generated at the edge. based on these assumptions: +* Users are unwilling to upload raw data to the cloud because of data privacy. +* Users do not want to purchase new devices for centralized training at the edge. +* The sample size at the edge is usually small, and it is often difficult to train a good model at a single edge node. + +Therefore, we propose a edge cloud federated learning framework to help to train a model **without uploading raw data**, and **higher precision** and **less convergence time** are also benefits. + + + + +### Goals + +* The framework can combine data on multiple edge nodes to complete training. +* The framework provides the functions of querying the training status and result. +* The framework integrates some common aggregation algorithms, FedAvg and so on. +* The framework integrates some common weight/gradient compression algorithm to reduce the cloud-edge traffic required for aggregation operations. +* The framework integrates some common multi-task migration algorithms to resolve the problem of low precision caused by small size samples. + + +## Proposal +We propose using Kubernetes Custom Resource Definitions (CRDs) to describe +the federated learning specification/status and a controller to synchronize these updates between edge and cloud. + + + +### Use Cases + + +* User can create a federated learning task, with providing a training script, specifying the aggregation algorithm, configuring training hyperparameters, configuring training datasets. + +* Users can get the federated learning status, including the nodes participating in training, current training status, samples size of each node, current iteration times, and current aggregation times. + +* Users can get the saved aggregated model. The model file can be stored on the cloud or edge node. + + + +## Design Details +### CRD API Group and Version +The `FederatedLearningTask` CRD will be namespace-scoped. +The tables below summarize the group, kind and API version details for the CRD. + +* FederatedLearningTask + +| Field | Description | +|-----------------------|-------------------------| +|Group | edgeai.io | +|APIVersion | v1alpha1 | +|Kind | FederatedLearningTask | + + +### Federated learning CRD + + +Notes: +1. We use `WorkerSpec` to represent the worker runtime config which all EdgeAI features use. +1. Currently `WorkerSpec` limits to the code directory on host path or s3-like storage. +We will extend it to the support with `pod template` like k8s deployment. +1. We will add the [resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) support in the future. + +Below is the CustomResourceDefinition yaml for `FederatedLearningTask`: + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: federatedlearningtasks.edgeai.io +spec: + group: edgeai.io + names: + kind: FederatedLearningTask + plural: federatedlearningtasks + shortNames: + - federatedtask + - ft + scope: Namespaced + versions: + - name: v1alpha1 + subresources: + # status enables the status subresource. + status: {} + served: true + storage: true + schema: + openAPIV3Schema: + type: object + properties: + spec: + type: object + properties: + aggregationWorker: + type: object + properties: + name: + type: string + model: + type: object + properties: + name: + type: string + nodeName: + type: string + workerSpec: + type: object + properties: + scriptDir: + type: string + scriptBootFile: + type: string + frameworkType: + type: string + frameworkVersion: + type: string + parameters: + type: array + items: + type: object + required: + - key + - value + properties: + key: + type: string + value: + type: string + trainingWorkers: + type: array + items: + type: object + properties: + name: + type: string + model: + type: object + properties: + name: + type: string + nodeName: + type: string + workerSpec: + type: object + properties: + dataset: + type: object + properties: + name: + type: string + scriptDir: + type: string + scriptBootFile: + type: string + frameworkType: + type: string + frameworkVersion: + type: string + parameters: + type: array + items: + type: object + required: + - key + - value + + properties: + key: + type: string + value: + type: string + status: + type: object + properties: + conditions: + type: array + items: + type: object + properties: + type: + type: string + status: + type: string + lastProbeTime: + type: string + format: date-time + lastTransitionTime: + type: string + format: date-time + reason: + type: string + message: + type: string + startTime: + type: string + format: date-time + completionTime: + type: string + format: date-time + active: + type: integer + succeeded: + type: integer + failed: + type: integer + phase: + type: string + + + additionalPrinterColumns: + - name: status + type: string + description: The status of the federated learning task + jsonPath: ".status.phase" + - name: Age + type: date + jsonPath: .metadata.creationTimestamp +``` + +### Federated learning type definition +```go +// +genclient +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object +// FederatedLearningTask defines the federatedlearning task which describes the +// federated learning task +type FederatedLearningTask struct { + metav1.TypeMeta `json:",inline"` + + metav1.ObjectMeta `json:"metadata,omitempty"` + + Spec FederatedLearningTaskSpec `json:"spec"` + Status FederatedLearningTaskStatus `json:"status,omitempty"` +} + +// FederatedLearningTaskSpec describes the details configuration of federatedlearningtask +type FederatedLearningTaskSpec struct { + AggregationWorker AggregationWorker `json:"aggregationWorker"` + TrainingWorkers []TrainingWorker `json:"trainingWorkers"` +} + +// AggregationWorker describes the aggregation worker +type AggregationWorker struct { + Name string `json:"name"` + Model modelRefer `json:"model"` + NodeName string `json:"nodeName"` + WorkerSpec AggregationWorkerSpec `json:"workerSpec"` +} + +// TrrainingWorker describes the training worker of each node +type TrainingWorker struct { + Name string `json:"name"` + NodeName string `json:"nodeName"` + Dataset datasetRefer `json:"dataset"` + WorkerSpec TrainingWorkerSpec `json:"workerSpec"` +} + +type AggregationWorkerSpec struct { + ScriptDir string `json:"scriptDir"` + ScriptBootFile string `json:"scriptBootFile"` + FrameworkType string `json:"frameworkType"` + FrameworkVersion string `json:"frameworkVersion"` + Parameters []ParaSpec `json:"parameters"` +} + +type TrainingWorkerSpec struct { + ScriptDir string `json:"scriptDir"` + ScriptBootFile string `json:"scriptBootFile"` + FrameworkType string `json:"frameworkType"` + FrameworkVersion string `json:"frameworkVersion"` + Parameters []ParaSpec `json:"parameters"` +} + +type ParaSpec struct { + Key string `json:"key"` + Value string `json:"value"` +} + +type datasetRefer struct { + Name string `json:"name"` +} + +type modelRefer struct { + Name string `json:"name"` +} + +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +// FederatedLearningTaskList is a list of federated learning tasks. +type FederatedLearningTaskList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata"` + Items []FederatedLearningTask `json:"items"` +} + +// FederatedLearningTaskStatus represents the current state of a federated learning task. +type FederatedLearningTaskStatus struct { + + // The latest available observations of a federated learning task's current state. + // +optional + Conditions []FederatedLearningTaskCondition `json:"conditions,omitempty"` + + // Represents time when the task was acknowledged by the task controller. + // It is not guaranteed to be set in happens-before order across separate operations. + // It is represented in RFC3339 form and is in UTC. + // +optional + StartTime *metav1.Time `json:"startTime,omitempty"` + + // Represents time when the task was completed. It is not guaranteed to + // be set in happens-before order across separate operations. + // It is represented in RFC3339 form and is in UTC. + // +optional + CompletionTime *metav1.Time `json:"completionTime,omitempty"` + + // The number of actively running pods. + // +optional + Active int32 `json:"active,omitempty"` + + // The number of pods which reached phase Succeeded. + // +optional + Succeeded int32 `json:"succeeded,omitempty"` + + // The number of pods which reached phase Failed. + // +optional + Failed int32 `json:"failed,omitempty"` + + // The phase of the federated learning task. + // +optional + Phase FederatedLearningTaskPhase `json:"phase,omitempty"` +} + +type FederatedLearningTaskConditionType string + +// These are valid conditions of a task. +const ( + // FederatedLearningTaskComplete means the task has completed its execution. + FederatedLearningTaskCondComplete FederatedLearningTaskConditionType = "Complete" + // FederatedLearningTaskFailed means the task has failed its execution. + FederatedLearningTaskCondFailed FederatedLearningTaskConditionType = "Failed" + // FederatedLearningTaskTraining means the task has been training. + FederatedLearningTaskCondTraining FederatedLearningTaskConditionType = "Training" +) + +// FederatedLearningTaskCondition describes current state of a task. +type FederatedLearningTaskCondition struct { + // Type of task condition, Complete or Failed. + Type FederatedLearningTaskConditionType `json:"type"` + // Status of the condition, one of True, False, Unknown. + Status v1.ConditionStatus `json:"status"` + // Last time the condition was checked. + // +optional + LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"` + // Last time the condition transit from one status to another. + // +optional + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` + // (brief) reason for the condition's last transition. + // +optional + Reason string `json:"reason,omitempty"` + // Human readable message indicating details about last transition. + // +optional + Message string `json:"message,omitempty"` +} + +// FederatedLearningTaskPhase is a label for the condition of a task at the current time. +type FederatedLearningTaskPhase string + +// These are the valid statuses of tasks. +const ( + // FederatedLearningTaskPending means the task has been accepted by the system, but one or more of the pods + // has not been started. This includes time before being bound to a node, as well as time spent + // pulling images onto the host. + FederatedLearningTaskPending FederatedLearningTaskPhase = "Pending" + // FederatedLearningTaskRunning means the task has been bound to a node and all of the pods have been started. + // At least one container is still running or is in the process of being restarted. + FederatedLearningTaskRunning FederatedLearningTaskPhase = "Running" + // FederatedLearningTaskSucceeded means that all pods in the task have voluntarily terminated + // with a container exit code of 0, and the system is not going to restart any of these pods. + FederatedLearningTaskSucceeded FederatedLearningTaskPhase = "Succeeded" + // FederatedLearningTaskFailed means that all pods in the task have terminated, and at least one container has + // terminated in a failure (exited with a non-zero exit code or was stopped by the system). + FederatedLearningTaskFailed FederatedLearningTaskPhase = "Failed" +) +``` + +#### Validation +[Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests. +Invalid values for fields ( example string value for a boolean field etc) can be validated using this. + +Here is a list of validations we need to support : +1. The `dataset` specified in the crd should exist in k8s. +1. The `model` specified in the crd should exist in k8s. +1. The edgenode name specified in the crd should exist in k8s. + +### federated learning sample +```yaml +apiVersion: edgeai.io/v1alpha1 +kind: FederatedLearningTask +metadata: + name: magnetic-tile-defect-detection +spec: + aggregationWorker: + name: "aggregationworker" + model: + name: "model-demo1" + nodeName: "solar-corona-cloud" + workerSpec: + scriptDir: "/code" + scriptBootFile: "aggregate.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "exit_round" + value: "3" + trainingWorkers: + - name: "work0" + nodeName: "edge0" + workerSpec: + dataset: + name: "dataset-demo0" + scriptDir: "/code" + scriptBootFile: "train.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "batch_size" + value: "32" + - key: "learning_rate" + value: "0.001" + - key: "epochs" + value: "1" + - name: "work1" + nodeName: "edge1" + workerSpec: + dataset: + name: "dataset-demo1" + scriptDir: "/code" + scriptBootFile: "train.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "batch_size" + value: "32" + - key: "learning_rate" + value: "0.001" + - key: "epochs" + value: "1" + - key: "min_sample_number_per" + value: "500" + - key: "min_node_number" + value: "3" + - key: "rounds_between_valida" + value: "3" + + - name: "work2" + nodeName: "edge2" + workerSpec: + dataset: + name: "dataset-demo2" + scriptDir: "/code" + scriptBootFile: "train.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "batch_size" + value: "32" + - key: "learning_rate" + value: "0.001" + - key: "epochs" + value: "1" + - key: "min_sample_number_per" + value: "500" + - key: "min_node_number" + value: "3" + - key: "rounds_between_valida" + value: "3" + +``` + +### Creation of the federated learning task + +## Controller Design +The federated learning controller starts three separate goroutines called `upstream`, `downstream` and `federated-learning`controller. These are not separate controllers as such but named here for clarity. +- federated learning: watch the updates of federated-learning-task crds, and create the workers to complete the task. +- downstream: synchronize the federated-learning updates from the cloud to the edge node. +- upstream: synchronize the federated-learning updates from the edge to the cloud node. + +### Federated Learning Controller + + +The federated-learning controller watches for the updates of federated-learning tasks and the corresponding pods against the K8S API server.<br/> +Updates are categorized below along with the possible actions: + +| Update Type | Action | +|-------------------------------|---------------------------------------------- | +|New Federated-learning-task Created |Create the aggregation worker and these local-training workers| +|Federated-learning-task Deleted | NA. These workers will be deleted by [k8s gc](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/).| +|The corresponding pod created/running/completed/failed | Update the status of federated-learning task.| + + +### Downstream Controller + + +The downstream controller watches for federated-learning updates against the K8S API server.<br/> +Updates are categorized below along with the possible actions that the downstream controller can take: + +| Update Type | Action | +|-------------------------------|---------------------------------------------- | +|New Federated-learning-task Created |Sends the task information to LCs.| +|Federated-learning-task Deleted | The controller sends the delete event to LCs.| + +### Upstream Controller + + +The upstream controller watches for federated-learning-task updates from the edge node and applies these updates against the API server in the cloud. +Updates are categorized below along with the possible actions that the upstream controller can take: + +| Update Type | Action | +|------------------------------- |---------------------------------------------- | +|Federated-learning-task Reported State Updated | The controller appends the reported status of the Federated-learning-task in the cloud. | + +### Details of api between GC(cloud) and LC(edge) +1. GC(downstream controller) syncs the task info to LC: + ```go + // POST <namespace>/federatedlearningtasks/<job-name> + // body same to the task crd of k8s api, omitted here. + ``` + +1. LC uploads the task status which reported by the worker to GC(upstream controller): + ```go + // POST <namespace>/federatedlearningtasks/<job-name>/status + + // WorkerMessage defines the message from that the training worker. It will send to GC. + type WorkerMessage struct { + Phase string `json:"phase"` + Status string `json:"status"` + Output *WorkerOutput `json:"output"` + } + // + type WorkerOutput struct { + Models []*Model `json:"models"` + TaskInfo *TaskInfo `json:"taskInfo"` + } + + // Model defines the model information + type Model struct { + Format string `json:"format"` + URL string `json:"url"` + // Including the metrics, e.g. precision/recall + Metrics map[string]float64 `json:"metrics"` + } + + // TaskInfo defines the task information + type TaskInfo struct { + // Current training round + CurrentRound int `json:"currentRound"` + UpdateTime string `json:"updateTime"` + } + ``` + +### The flow of federated learning task creation + +The federated-learning controller watches the creation of federatedlearningtask crd in the cloud, syncs them to lc via the cloudhub-to-edgehub channel, +and creates the aggregator worker on the cloud nodes and the training workers on the edge nodes specified by the user.<br/> +The aggregator worker is started by the native k8s at the cloud nodes. +These training workers are started by the kubeedge at the edge nodes. + + +## Workers Communication + + +Todo: complete the two restful apis. \ No newline at end of file diff --git a/sig-ai/images/architecture.png b/sig-ai/images/architecture.png new file mode 100644 index 0000000000000000000000000000000000000000..110bdfd5c94063e9af4400e972dbeb83ed13824e Binary files /dev/null and b/sig-ai/images/architecture.png differ diff --git a/sig-ai/images/dataset-creation-flow.png b/sig-ai/images/dataset-creation-flow.png new file mode 100644 index 0000000000000000000000000000000000000000..5192b81e64187ebe28c87cc3510641b08b4716a2 Binary files /dev/null and b/sig-ai/images/dataset-creation-flow.png differ diff --git a/sig-ai/images/dataset-model-crd.png b/sig-ai/images/dataset-model-crd.png new file mode 100644 index 0000000000000000000000000000000000000000..03f36c30216fdb234e1c93b50b6ac5d0835b9510 Binary files /dev/null and b/sig-ai/images/dataset-model-crd.png differ diff --git a/sig-ai/images/federated-learning-controller.png b/sig-ai/images/federated-learning-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..d8a9c2e7bf5bdb855f569d8d02e52e4d957e8e5b Binary files /dev/null and b/sig-ai/images/federated-learning-controller.png differ diff --git a/sig-ai/images/federated-learning-creation-flow.png b/sig-ai/images/federated-learning-creation-flow.png new file mode 100644 index 0000000000000000000000000000000000000000..7c48cfcda9736880db2d6e5f2892b869a73a4831 Binary files /dev/null and b/sig-ai/images/federated-learning-creation-flow.png differ diff --git a/sig-ai/images/federated-learning-downstream-controller.png b/sig-ai/images/federated-learning-downstream-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..902dde4c12f56b8c0df2773cf0a3d054374bbfca Binary files /dev/null and b/sig-ai/images/federated-learning-downstream-controller.png differ diff --git a/sig-ai/images/federated-learning-task-crd-details.png b/sig-ai/images/federated-learning-task-crd-details.png new file mode 100644 index 0000000000000000000000000000000000000000..06385fd423c697f8295cd46a33f7a2f1260072c9 Binary files /dev/null and b/sig-ai/images/federated-learning-task-crd-details.png differ diff --git a/sig-ai/images/federated-learning-task-crd.png b/sig-ai/images/federated-learning-task-crd.png new file mode 100644 index 0000000000000000000000000000000000000000..30e37a674e88dd97f2732f8e570ae7f33ca207c4 Binary files /dev/null and b/sig-ai/images/federated-learning-task-crd.png differ diff --git a/sig-ai/images/federated-learning-upstream-controller.png b/sig-ai/images/federated-learning-upstream-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..dbd9c6d992ceb255c7e4e84473f67e9ab1168896 Binary files /dev/null and b/sig-ai/images/federated-learning-upstream-controller.png differ diff --git a/sig-ai/images/federated-learning-worker-communication.png b/sig-ai/images/federated-learning-worker-communication.png new file mode 100644 index 0000000000000000000000000000000000000000..e4ec3bbff2296edbf6ed3a2b24d0273b9ff79947 Binary files /dev/null and b/sig-ai/images/federated-learning-worker-communication.png differ diff --git a/sig-ai/images/framework.png b/sig-ai/images/framework.png new file mode 100644 index 0000000000000000000000000000000000000000..6cc848badaa72dfe9a44948739467c6b64a669b8 Binary files /dev/null and b/sig-ai/images/framework.png differ diff --git a/sig-ai/images/incremental-learning-controller.png b/sig-ai/images/incremental-learning-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..910acfce632fe819359539f16ba558db77fe0ef2 Binary files /dev/null and b/sig-ai/images/incremental-learning-controller.png differ diff --git a/sig-ai/images/incremental-learning-creation-flow-deploy-stage.png b/sig-ai/images/incremental-learning-creation-flow-deploy-stage.png new file mode 100644 index 0000000000000000000000000000000000000000..1543014da40300fc0b2f3a77ac401d5cae11e937 Binary files /dev/null and b/sig-ai/images/incremental-learning-creation-flow-deploy-stage.png differ diff --git a/sig-ai/images/incremental-learning-creation-flow-eval-stage.png b/sig-ai/images/incremental-learning-creation-flow-eval-stage.png new file mode 100644 index 0000000000000000000000000000000000000000..b729201873d84f29712e504bfeee84ff08ae1359 Binary files /dev/null and b/sig-ai/images/incremental-learning-creation-flow-eval-stage.png differ diff --git a/sig-ai/images/incremental-learning-creation-flow-train-stage.png b/sig-ai/images/incremental-learning-creation-flow-train-stage.png new file mode 100644 index 0000000000000000000000000000000000000000..38ac949a345e3de60bf1c1fc60c75d06f4b056a8 Binary files /dev/null and b/sig-ai/images/incremental-learning-creation-flow-train-stage.png differ diff --git a/sig-ai/images/incremental-learning-creation-flow.png b/sig-ai/images/incremental-learning-creation-flow.png new file mode 100644 index 0000000000000000000000000000000000000000..a1755d4768a154177a6616aea3fe360b2d9a4fcc Binary files /dev/null and b/sig-ai/images/incremental-learning-creation-flow.png differ diff --git a/sig-ai/images/incremental-learning-downstream-controller.png b/sig-ai/images/incremental-learning-downstream-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..f1846f3775c80cd173b31388e394346f90b79190 Binary files /dev/null and b/sig-ai/images/incremental-learning-downstream-controller.png differ diff --git a/sig-ai/images/incremental-learning-job-crd-details.png b/sig-ai/images/incremental-learning-job-crd-details.png new file mode 100644 index 0000000000000000000000000000000000000000..df73e4bbcd53297557f0d75c750182d818dbe383 Binary files /dev/null and b/sig-ai/images/incremental-learning-job-crd-details.png differ diff --git a/sig-ai/images/incremental-learning-job-crd.png b/sig-ai/images/incremental-learning-job-crd.png new file mode 100644 index 0000000000000000000000000000000000000000..c6c6361cf7b0eabec81018443ee10f1410b82751 Binary files /dev/null and b/sig-ai/images/incremental-learning-job-crd.png differ diff --git a/sig-ai/images/incremental-learning-state-machine.png b/sig-ai/images/incremental-learning-state-machine.png new file mode 100644 index 0000000000000000000000000000000000000000..b3685a5056656dd3f187466c67d3cf01b4c53a1c Binary files /dev/null and b/sig-ai/images/incremental-learning-state-machine.png differ diff --git a/sig-ai/images/incremental-learning-upstream-controller.png b/sig-ai/images/incremental-learning-upstream-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..4921e030a4c90c4608fcedac5df8385dbd84fd9b Binary files /dev/null and b/sig-ai/images/incremental-learning-upstream-controller.png differ diff --git a/sig-ai/images/joint-inference-controller.png b/sig-ai/images/joint-inference-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..73e52b186c0b43fdcd03ec63d852ab02ac189c5c Binary files /dev/null and b/sig-ai/images/joint-inference-controller.png differ diff --git a/sig-ai/images/joint-inference-downstream-controller.png b/sig-ai/images/joint-inference-downstream-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..1473eacefd2c3e3a67273c8ae37de0839a706b48 Binary files /dev/null and b/sig-ai/images/joint-inference-downstream-controller.png differ diff --git a/sig-ai/images/joint-inference-service-crd-details.png b/sig-ai/images/joint-inference-service-crd-details.png new file mode 100644 index 0000000000000000000000000000000000000000..6769cf7bfd4aa5cda3f7f70d35a1475d9c1af9d1 Binary files /dev/null and b/sig-ai/images/joint-inference-service-crd-details.png differ diff --git a/sig-ai/images/joint-inference-service-crd.png b/sig-ai/images/joint-inference-service-crd.png new file mode 100644 index 0000000000000000000000000000000000000000..09e94174964fa9cefbae6dcd3710d1cf8f537a71 Binary files /dev/null and b/sig-ai/images/joint-inference-service-crd.png differ diff --git a/sig-ai/images/joint-inference-upstream-controller.png b/sig-ai/images/joint-inference-upstream-controller.png new file mode 100644 index 0000000000000000000000000000000000000000..0e798c3d44e6743cad159a85c3fba479de56600d Binary files /dev/null and b/sig-ai/images/joint-inference-upstream-controller.png differ diff --git a/sig-ai/images/joint-inference-worker-communication.png b/sig-ai/images/joint-inference-worker-communication.png new file mode 100644 index 0000000000000000000000000000000000000000..444237b0dc23abd90ee78e92a8f56b93bdc6795a Binary files /dev/null and b/sig-ai/images/joint-inference-worker-communication.png differ diff --git a/sig-ai/incremental-learning.md b/sig-ai/incremental-learning.md new file mode 100644 index 0000000000000000000000000000000000000000..5498ed19dcc3acedee58242eb2e7147b4e454239 --- /dev/null +++ b/sig-ai/incremental-learning.md @@ -0,0 +1,594 @@ +* [Incremental Learning](#incremental-learning) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non\-goals](#non-goals) + * [Proposal](#proposal) + * [Use Cases](#use-cases) + * [Design Details](#design-details) + * [CRD API Group and Version](#crd-api-group-and-version) + * [Incremental learning CRD](#incremental-learning-crd) + * [Incremental learning type definition](#incremental-learning-job-type-definition) + * [Incremental learning sample](#incremental-learning-job-sample) + * [Validation](#validation) + * [Controller Design](#controller-design) + * [Incremental Learning Controller](#incremental-learning-controller) + * [Downstream Controller](#downstream-controller) + * [Upstream Controller](#upstream-controller) + * [Details of api between GC(cloud) and LC(edge)](#details-of-api-between-gccloud-and-lcedge) + * [Workers Communication](#workers-communication) + +# Incremental Learning +## Motivation + + +Data is continuously generated on the edge side. Traditionally, the data is collected manually and periodically retrained on the cloud to improve the model effect. This method wastes a lot of human resources, and the model update frequency is slow. Incremental learning allows users to continuously monitor the newly generated data and by configuring some triggering rules to determine whether to start training, evaluation, and deployment automatically, and continuously improve the model performance. + + + + +### Goals + + +* Automatically retrains, evaluates, and updates models based on the data generated at the edge. +* Support time trigger, sample size trigger, and precision-based trigger. +* Support manual triggering of training, evaluation, and model update. +* support hard sample discovering of unlabeled data, for reducing the manual labeling workload. +* Support lifelong learning that reserves historical knowledge to avoid frequent re-training/ re-fine-tuning, and tackles samples uncovered in historical knowledge base. + + +## Proposal +We propose using Kubernetes Custom Resource Definitions (CRDs) to describe +the incremental learning specification/status and a controller to synchronize these updates between edge and cloud. + + + +### Use Cases + +* Users can create the incremental learning jobs, by providing training scripts, configuring training hyperparameters, providing training datasets, configuring training and deployment triggers. + + + +## Design Details +We use the `job` word to represent the **Constantly Iterative Update** tasks including the train/eval/deploy task.<br/> +There are three stages in a job: train/eval/deploy. + +Each stage contains these below states: +1. Waiting: wait to trigger satisfied, i.e. wait to train/eval/deploy +1. Ready: the corresponding trigger satisfied, now ready to train/eval/deploy +1. Starting: the corresponding stage is starting +1. Running: the corresponding stage is running +1. Failed: the corresponding stage failed +1. Completed: the corresponding stage completed + + + +### CRD API Group and Version +The `IncrementalLearningJob` CRD will be namespace-scoped. +The tables below summarize the group, kind and API version details for the CRD. + +* IncrementalLearningJob + +| Field | Description | +|-----------------------|-------------------------| +|Group | edgeai.io | +|APIVersion | v1alpha1 | +|Kind | IncrementalLearningJob | + +### Incremental learning CRD + + +Below is the CustomResourceDefinition yaml for `IncrementalLearningJob`: +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: incrementallearningjobs.edgeai.io +spec: + group: edgeai.io + names: + kind: IncrementalLearningJob + plural: incrementallearningjobs + shortNames: + - incrementaljob + - ij + scope: Namespaced + versions: + - name: v1alpha1 + subresources: + # status enables the status subresource. + status: {} + served: true + storage: true + schema: + openAPIV3Schema: + type: object + properties: + spec: + type: object + properties: + dataset: + type: object + properties: + name: + type: string + trainProb: + type: number + nodeName: + type: string + outputDir: + type: string + initialModel: + type: object + properties: + name: + type: string + trainSpec: + type: object + properties: + workerSpec: + type: object + properties: + scriptDir: + type: string + scriptBootFile: + type: string + frameworkType: + type: string + frameworkVersion: + type: string + parameters: + type: array + items: + type: object + properties: + key: + type: string + value: + type: string + trigger: + type: object + properties: + checkPeriodSeconds: + type: integer + timer: + type: object + properties: + start: + type: string + end: + type: string + condition: + type: object + properties: + operator: + type: string + enum: [">=",">","=","==","<=","<","ge","gt","eq","le","lt"] + threshold: + type: number + metric: + type: string + evalSpec: + type: object + properties: + workerSpec: + type: object + properties: + scriptDir: + type: string + scriptBootFile: + type: string + frameworkType: + type: string + frameworkVersion: + type: string + parameters: + type: array + items: + type: object + properties: + key: + type: string + value: + type: string + deploySpec: + type: object + properties: + model: + type: object + properties: + name: + type: string + trigger: + type: object + properties: + checkPeriodSeconds: + type: integer + timer: + type: object + properties: + start: + type: string + end: + type: string + condition: + type: object + properties: + operator: + type: string + enum: [">=",">","=","==","<=","<","ge","gt","eq","le","lt"] + threshold: + type: number + metric: + type: string + + status: + type: object + properties: + conditions: + type: array + items: + type: object + properties: + type: + type: string + status: + type: string + lastProbeTime: + type: string + format: date-time + lastTransitionTime: + type: string + format: date-time + reason: + type: string + message: + type: string + data: + type: string + stage: + type: string + startTime: + type: string + format: date-time + completionTime: + type: string + format: date-time + active: + type: integer + succeeded: + type: integer + failed: + type: integer + + + additionalPrinterColumns: + - name: stage + type: string + description: The stage of the incremental learning job + jsonPath: ".status.conditions[-1].stage" + - name: status + type: string + description: The status of the incremental learning job + jsonPath: ".status.conditions[-1].type" + - name: Age + type: date + jsonPath: .metadata.creationTimestamp +``` + +### Incremental learning job type definition +```go +// +genclient +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object +// IncrementalLearningJob defines the incremental learing job crd +type IncrementalLearningJob struct { + metav1.TypeMeta `json:",inline"` + + metav1.ObjectMeta `json:"metadata"` + + Spec IncrementalLearningJobSpec `json:"spec"` + Status IncrementalLearningJobStatus `json:"status"` +} + +//IncrementalLearningJobSpec describes the details configuration of incremental learning job +type IncrementalLearningJobSpec struct { + Dataset IncrementalDataset `json:"dataset"` + OutputDir string `json:"outputDir"` + NodeName string `json:"nodeName"` + InitialModel InitialModel `json:"initialModel"` + TrainSpec TrainSpec `json:"trainSpec"` + EvalSpec EvalSpec `json:"evalSpec"` + DeploySpec DeploySpec `json:"deploySpec"` +} + +// TrainSpec describes the train worker +type TrainSpec struct { + WorkerSpec WorkerSpec `json:"workerSpec"` + Trigger Trigger `json:"trigger"` +} + +// EvalSpec describes the train worker +type EvalSpec struct { + WorkerSpec WorkerSpec `json:"workerSpec"` +} + +// DeploySpec describes the deploy model to be updated +type DeploySpec struct { + Model DeployModel `json:"model"` + Trigger Trigger `json:"trigger"` +} + +// WorkerSpec describes the details to run the worker +type WorkerSpec struct { + ScriptDir string `json:"scriptDir"` + ScriptBootFile string `json:"scriptBootFile"` + FrameworkType string `json:"frameworkType"` + FrameworkVersion string `json:"frameworkVersion"` + Parameters []ParaSpec `json:"parameters"` +} + +type Trigger struct { + CheckPeriodSeconds int `json:"checkPeriodSeconds,omitempty"` + Timer *Timer `json:"timer,omitempty"` + Condition Condition `json:"condition"` +} + +type Timer struct { + Start string `json:"start"` + End string `json:"end"` +} + +type Condition struct { + Operator string `json:"operator"` + Threshold float64 `json:"threshold"` + Metric string `json:"metric"` +} + +type IncrementalDataset struct { + Name string `json:"name"` + TrainProb float64 `json:"trainProb"` +} + +type InitialModel struct { + Name string `json:"name"` +} + +type DeployModel struct { + Name string `json:"name"` +} + +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +// IncrementalLearningJobList is a list of incremental learning jobs. +type IncrementalLearningJobList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata"` + Items []IncrementalLearningJob `json:"items"` +} + +// IncrementalLearningJobStatus represents the current state of a incremental learning job +type IncrementalLearningJobStatus struct { + // The latest available observations of a incrementl job's current state. + // +optional + Conditions []IncrementalLearningJobCondition `json:"conditions,omitempty"` + + // Represents time when the job was acknowledged by the job controller. + // It is not guaranteed to be set in happens-before order across separate operations. + // It is represented in RFC3339 form and is in UTC. + // +optional + StartTime *metav1.Time `json:"startTime,omitempty"` + + // Represents time when the job was completed. It is not guaranteed to + // be set in happens-before order across separate operations. + // It is represented in RFC3339 form and is in UTC. + // +optional + CompletionTime *metav1.Time `json:"completionTime,omitempty"` +} + +type IncrementalStageConditionType string + +// These are valid stage conditions of a job. +const ( + IncrementalStageCondWaiting IncrementalStageConditionType = "Waiting" + IncrementalStageCondReady IncrementalStageConditionType = "Ready" + IncrementalStageCondStarting IncrementalStageConditionType = "Starting" + IncrementalStageCondRunning IncrementalStageConditionType = "Running" + IncrementalStageCondCompleted IncrementalStageConditionType = "Completed" + IncrementalStageCondFailed IncrementalStageConditionType = "Failed" +) + +// IncrementalLearningJobCondition describes current state of a job. +type IncrementalLearningJobCondition struct { + // Type of job condition, Complete or Failed. + Type IncrementalStageConditionType `json:"type"` + // Status of the condition, one of True, False, Unknown. + Status v1.ConditionStatus `json:"status"` + // Stage of the condition + Stage IncrementalLearningJobStage `json:"stage"` + // Last time the condition was checked. + // +optional + LastProbeTime metav1.Time `json:"lastProbeTime,omitempty"` + // Last time the condition transit from one status to another. + // +optional + LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"` + // (brief) reason for the condition's last transition. + // +optional + Reason string `json:"reason,omitempty"` + // Human readable message indicating details about last transition. + // +optional + Message string `json:"message,omitempty"` + // The json data related to this condition + // +optional + Data string `json:"data,omitempty"` +} + +// IncrementalLearningJobStage is a label for the stage of a job at the current time. +type IncrementalLearningJobStage string + +const ( + IncrementalLearningJobTrain IncrementalLearningJobStage = "Train" + IncrementalLearningJobEval IncrementalLearningJobStage = "Eval" + IncrementalLearningJobDeploy IncrementalLearningJobStage = "Deploy" +) + + +``` + +#### Validation +[Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests. +Invalid values for fields (example string value for a boolean field etc) can be validated using this. + +Here is a list of validations we need to support : +1. The `dataset` specified in the crd should exist in k8s. +1. The `model` specified in the crd should exist in k8s. +1. The edgenode name specified in the crd should exist in k8s. + +### Incremental learning job sample +```yaml +apiVersion: edgeai.io/v1alpha1 +kind: IncrementalLearningJob +metadata: + name: helmet-detection-demo +spec: + initialModel: + name: "initial-model" + dataset: + name: "incremental-dataset" + trainProb: 0.8 + trainSpec: + workerSpec: + scriptDir: "/model_train/yolov3_algorithms/" + scriptBootFile: "train.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "batch_size" + value: "32" + - key: "learning_rate" + value: "0.001" + - key: "max_epochs" + value: "100" + + trigger: + checkPeriodSeconds: 60 + timer: + start: 02:00 + end: 04:00 + condition: + operator: ">" + threshold: 500 + metric: num_of_samples + evalSpec: + workerSpec: + scriptDir: "/model_train/yolov3_algorithms/" + scriptBootFile: "eval.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + + deploySpec: + model: + name: "deploy-model" + trigger: + condition: + operator: ">" + threshold: 0.1 + metric: precision_delta + + nodeName: edge1 + outputDir: "/helmet-detection/" +``` + +## Controller Design + +The incremental learning controller starts three separate goroutines called `upstream`, `downstream` and `incrementallearningjob`controller.<br/> +These are not separate controllers as such but named here for clarity. +- incremental learning: watch the updates of incremental-learning job crds, and create the workers depending on the state machine. +- downstream: synchronize the incremental-learning-job updates from the cloud to the edge node. +- upstream: synchronize the incremental-learning-job updates from the edge to the cloud node. + +### Incremental Learning Controller + + +The incremental-learning controller watches for the updates of incremental-learning jobs and the corresponding pods against the K8S API server.<br/> +Updates are categorized below along with the possible actions: + +| Update Type | Action | +|-------------------------------|---------------------------------------------- | +|New Incremental-learning-job Created | Wait to train trigger satisfied| +|Incremental-learning-job Deleted | NA. These workers will be deleted by [k8s gc](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/).| +|The Status of Incremental-learning-job Updated | Create the train/eval worker if it's ready.| +|The corresponding pod created/running/completed/failed | Update the status of incremental-learning job.| + +### Downstream Controller + + +The downstream controller watches for the incremental-learning job updates against the K8S API server.<br/> +Updates are categorized below along with the possible actions that the downstream controller can take: + +| Update Type | Action | +|-------------------------------|---------------------------------------------- | +|New Incremental-learning-job Created |Sends the job information to LCs.| +|Incremental-learning-job Deleted | The controller sends the delete event to LCs.| + +### Upstream Controller + + +The upstream controller watches for the incremental-learning job updates from the edge node and applies these updates against the API server in the cloud.<br/> +Updates are categorized below along with the possible actions that the upstream controller can take: + +| Update Type | Action | +|------------------------------- |---------------------------------------------- | +|Incremental-learning-job Reported State Updated | The controller appends the reported status of the job by LC in the cloud. | + +### Details of api between GC(cloud) and LC(edge) +1. GC(downstream controller) syncs the job info to LC: + ```go + // POST <namespace>/incrementallearningjobs/<job-name> + // body same to the job crd of k8s api, omitted here. + ``` + +1. LC uploads the job status which reported by the worker to GC(upstream controller): + ```go + // POST <namespace>/incrementallearningjobs/<job-name>/status + + // WorkerMessage defines the message from that the training worker. It will send to GC. + type WorkerMessage struct { + Phase string `json:"phase"` + Status string `json:"status"` + Output *WorkerOutput `json:"output"` + } + // + type WorkerOutput struct { + Models []*Model `json:"models"` + TaskInfo *TaskInfo `json:"taskInfo"` + } + + // Model defines the model information + type Model struct { + Format string `json:"format"` + URL string `json:"url"` + // Including the metrics, e.g. precision/recall + Metrics map[string]float64 `json:"metrics"` + } + + // TaskInfo defines the task information + type TaskInfo struct { + // Current training round + CurrentRound int `json:"currentRound"` + UpdateTime string `json:"updateTime"` + } + ``` + +### The flows of incremental learning job +- Flow of the `train` stage + + +- Flow of the `eval` stage + + + +- Flow of the `deploy` stage + + + +## Workers Communication +No need to communicate between workers. \ No newline at end of file diff --git a/sig-ai/joint-inference.md b/sig-ai/joint-inference.md new file mode 100644 index 0000000000000000000000000000000000000000..983c0c56d9ed142915b05505df14699ca464672e --- /dev/null +++ b/sig-ai/joint-inference.md @@ -0,0 +1,383 @@ +* [Joint Inference](#joint-inference) + * [Motivation](#motivation) + * [Goals](#goals) + * [Non\-goals](#non-goals) + * [Proposal](#proposal) + * [Use Cases](#use-cases) + * [Design Details](#design-details) + * [CRD API Group and Version](#crd-api-group-and-version) + * [Joint inference CRD](#joint-inference-crd) + * [Joint inference type definition](#joint-inference-type-definition) + * [Joint inference sample](#joint-inference-sample) + * [Validation](#validation) + * [Controller Design](#controller-design) + * [Joint Inference Controller](#joint-inference-controller) + * [Downstream Controller](#downstream-controller) + * [Upstream Controller](#upstream-controller) + * [Details of api between GC(cloud) and LC(edge)](#details-of-api-between-gccloud-and-lcedge) + * [Workers Communication](#workers-communication) + +# Joint Inference +## Motivation + +Inference on the edge can get a shorter latency and a higher throughput, and inference on the cloud can get better inference precision. +The collaborative inference technology detects hard samples on the edge and sends them to the cloud for inference. +**In this way, simple samples inference on the edge ensures latency and throughput, while hard samples inference on the cloud improves the overall precision.** + + + +### Goals +* Joint inference improves the inference precision without significantly reducing the time and throughput. + + +## Proposal +We propose using Kubernetes Custom Resource Definitions (CRDs) to describe +the joint inference specification/status and a controller to synchronize these updates between edge and cloud. + + + +### Use Cases + +* User can create a joint inference service with providing a training script, + specifying the aggregation algorithm, configuring training hyper parameters, + configuring training datasets. + +* Users can get the joint inference status, including the counts of inference at the edge/cloud. + + + +## Design Details + +### CRD API Group and Version +The `JointInferenceService` CRD will be namespace-scoped. +The tables below summarize the group, kind and API version details for the CRD. + +* JointInferenceService + +| Field | Description | +|-----------------------|-------------------------| +|Group | edgeai.io | +|APIVersion | v1alpha1 | +|Kind | JointInferenceService | + +### Joint inference CRD + + +Below is the CustomResourceDefinition yaml for `JointInferenceService`: + +```yaml +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: jointinferenceservices.edgeai.io +spec: + group: edgeai.io + names: + kind: JointInferenceService + plural: jointinferenceservices + shortNames: + - jointinferenceservice + - jis + scope: Namespaced + versions: + - name: v1alpha1 + subresources: + # status enables the status subresource. + status: {} + served: true + storage: true + schema: + openAPIV3Schema: + type: object + properties: + spec: + type: object + properties: + edgeWorker: + type: object + properties: + name: + type: string + model: + type: object + properties: + name: + type: string + nodeName: + type: string + hardExampleAlgorithm: + type: object + properties: + name: + type: string + workerSpec: + type: object + properties: + scriptDir: + type: string + scriptBootFile: + type: string + frameworkType: + type: string + frameworkVersion: + type: string + parameters: + type: array + items: + type: object + properties: + key: + type: string + value: + type: string + cloudWorker: + type: object + properties: + name: + type: string + model: + type: object + properties: + name: + type: string + nodeName: + type: string + workerSpec: + type: object + properties: + scriptDir: + type: string + scriptBootFile: + type: string + frameworkType: + type: string + frameworkVersion: + type: string + parameters: + type: array + items: + type: object + properties: + key: + type: string + value: + type: string + status: + type: object + properties: + conditions: + type: array + items: + type: object + properties: + type: + type: string + status: + type: string + lastProbeTime: + type: string + format: date-time + lastTransitionTime: + type: string + format: date-time + reason: + type: string + message: + type: string + startTime: + type: string + format: date-time + completionTime: + type: string + format: date-time + uploadCount: + type: integer + + additionalPrinterColumns: + - name: Age + type: date + jsonPath: .metadata.creationTimestamp +``` + +### Joint inference type definition +```go +// +genclient +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object +// JointInferenceService defines the joint inference service crd +type JointInferenceService struct { + metav1.TypeMeta `json:",inline"` + + metav1.ObjectMeta `json:"metadata"` + + Spec JointInferenceServiceSpec `json:"spec"` +} + +// JointInferenceServiceSpec describes the details configuration +type JointInferenceServiceSpec struct { + EdgeWorker EdgeWorker `json:"edgeWorker"` + CloudWorker CloudWorker `json:"cloudWorker"` +} + +// EdgeWorker describes the edge worker +type EdgeWorker struct { + Name string `json:"name"` + Model modelRefer `json:"model"` + NodeName string `json:"nodeName"` + HardExampleAlgorithm HardExampleAlgorithm `json:"hardExampleAlgorithm"` + EdgeWorkerSpec CommonWorkerSpec `json:"workerSpec"` +} + +// EdgeWorker describes the cloud worker +type CloudWorker struct { + Name string `json:"name"` + Model modelRefer `json:"model"` + NodeName string `json:"nodeName"` + CloudWorkerSpec CommonWorkerSpec `json:"workerSpec"` +} + +type modelRefer struct { + Name string `json:"name"` +} + +type HardExampleAlgorithm struct { + Name string `json:"name"` +} + +type CommonWorkerSpec struct { + ScriptDir string `json:"scriptDir"` + ScriptBootFile string `json:"scriptBootFile"` + FrameworkType string `json:"frameworkType"` + FrameworkVersion string `json:"frameworkVersion"` + Parameters ParaSpec `json:"parameters"` +} + +// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object + +// JointInferenceServiceList is a list of joint inference services. +type JointInferenceServiceList struct { + metav1.TypeMeta `json:",inline"` + metav1.ListMeta `json:"metadata"` + Items []JointInferenceService `json:"items"` +} +``` + +#### Validation +[Open API v3 Schema based validation](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#validation) can be used to guard against bad requests. +Invalid values for fields ( example string value for a boolean field etc) can be validated using this. + +Here is a list of validations we need to support : +1. The `dataset` specified in the crd should exist in k8s. +1. The `model` specified in the crd should exist in k8s. +1. The edgenode name specified in the crd should exist in k8s. + +### joint inference sample +```yaml +apiVersion: edgeai.io/v1alpha1 +kind: JointInferenceService +metadata: + name: helmet-detection-demo + namespace: default +spec: + edgeWorker: + name: "edgeworker" + model: + name: "small-model" + nodeName: "edge0" + hardExampleAlgorithm: + name: "IBT" + workerSpec: + scriptDir: "/code" + scriptBootFile: "edge_inference.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "nms_threshold" + value: "0.6" + cloudWorker: + name: "work" + model: + name: "big-model" + nodeName: "solar-corona-cloud" + workerSpec: + scriptDir: "/code" + scriptBootFile: "cloud_inference.py" + frameworkType: "tensorflow" + frameworkVersion: "1.18" + parameters: + - key: "nms_threshold" + value: "0.6" +``` + +## Controller Design +The joint inference controller starts three separate goroutines called `upstream`, `downstream` and `joint-inference`controller. These are not separate controllers as such but named here for clarity. +- joint inference: watch the updates of joint-inference-task crds, and create the workers to complete the task. +- downstream: synchronize the joint-inference updates from the cloud to the edge node. +- upstream: synchronize the joint-inference updates from the edge to the cloud node. + +### Joint Inference Controller + + +The joint-inference controller watches for the updates of joint-inference tasks and the corresponding pods against the K8S API server. +Updates are categorized below along with the possible actions: + +| Update Type | Action | +|-------------------------------|---------------------------------------------- | +|New Joint-inference-service Created |Create the aggregation worker and these local-training workers| +|Joint-inference-service Deleted | NA. These workers will be deleted by gc.| +|The corresponding pod created/running/completed/failed | Update the status of joint-inference task.| + + +### Downstream Controller + + +The downstream controller watches for joint-inference updates against the K8S API server. +Updates are categorized below along with the possible actions that the downstream controller can take: + +| Update Type | Action | +|-------------------------------|---------------------------------------------- | +|New Joint-inference-service Created |Sends the task information to LCs.| +|Joint-inference-service Deleted | The controller sends the delete event to LCs.| + +### Upstream Controller + + +The upstream controller watches for joint-inference-task updates from the edge node and applies these updates against the API server in the cloud. +Updates are categorized below along with the possible actions that the upstream controller can take: + +| Update Type | Action | +|------------------------------- |---------------------------------------------- | +|Joint-inference-service Reported State Updated | The controller appends the reported status of the Joint-inference-service in the cloud. | + +### Details of api between GC(cloud) and LC(edge) +1. GC(downstream controller) syncs the task info to LC: + ```go + // POST <namespace>/jointinferenceservices/<name> + // body same to the task crd of k8s api, omitted here. + ``` + +1. LC uploads the task status which reported by the worker to GC(upstream controller): + ```go + // POST <namespace>/jointinferenceservices/<name>/status + + // WorkerMessage defines the message from that the training worker. It will send to GC. + type WorkerMessage struct { + Phase string `json:"phase"` + Status string `json:"status"` + Output *WorkerOutput `json:"output"` + } + // + type WorkerOutput struct { + TaskInfo *TaskInfo `json:"taskInfo"` + } + + // TaskInfo defines the task information + type TaskInfo struct { + // Current training round + Uploaded int `json:"uploaded"` + } + ``` + +## Workers Communication + + +Todo: complete the restful api. \ No newline at end of file