You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

spec.rst 9.4 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
  1. .. _spec:
  2. ================================
  3. Specification
  4. ================================
  5. Learnware specification is the central component of the learnware paradigm, linking all processes related to learnwares, including uploading, organizing, searching, deploying, and reusing.
  6. In this section, we will introduce the concept and design of learnware specification within the ``learnware`` package.
  7. We will then explore ``regular specification``\ s covering data types including tables, images, and texts.
  8. Lastly, we introduce a ``system specification`` specifically generated for tabular learnwares by the learnware doc system using its knowledge, enhancing learnware management and further characterizing their capabilities.
  9. Concepts & Types
  10. ==================
  11. The learnware specification describes the model's specialty and utility in a certain format, allowing the model to be identified and reused by future users who may have no prior knowledge of the learnware.
  12. The ``learnware`` package employs a highly extensible specification design, which consists of two parts:
  13. - **Semantic specification** describes the model's type and functionality through a set of descriptions and tags. Learnwares with similar semantic specifications reside in the same specification island
  14. - **Statistical specification** characterizes the statistical information contained in the model using various machine learning techniques. It plays a crucial role in locating the appropriate place for the model within the specification island.
  15. When searching in the learnware doc system, the system first locates specification islands based on the semantic specification of the user's task,
  16. then pinpoints potentially beneficial learnwares on these islands based on the statistical specification of the user's task.
  17. Statistical Specification
  18. ---------------------------
  19. We employ the ``Reduced Kernel Mean Embedding (RKME) Specification`` as the basis for implementing statistical specification for diverse data types,
  20. with adjustments made according to the characteristics of each data type.
  21. The RKME specification is a recent development in learnware specification design, which captures the data distribution while not disclosing the raw data .
  22. There are two types of statistical specifications within the ``learnware`` package: ``regular specification`` and ``system specification``. The former is generated locally
  23. by users to express their model's statistical information. In contrast, the latter is generated by the learnware doc system to enhance learnware management and further characterizing the learnwares' capabilities.
  24. Semantic Specification
  25. -----------------------
  26. The semantic specification consists of a "dict" structure that includes keywords "Data", "Task", "Library", "Scenario", "License", "Description", and "Name".
  27. In the case of table learnwares, users should additionally provide descriptions for each feature dimension and output dimension through the "Input" and "Output" keywords.
  28. - If "data_type" is "Table", you need to specify the semantics of each dimension of the model's input data for compatibility with tasks in heterogeneous feature spaces.
  29. - If "task_type" is "Classification", you need to provide the semantics of model output labels (prediction labels start from 0) for use in classification tasks with heterogeneous output spaces.
  30. - If "task_type" is "Regression", you need to specify the semantics of each dimension of the model output, making the uploaded learnware suitable for regression tasks with heterogeneous output spaces.
  31. Regular Specification
  32. ======================================
  33. The ``learnware`` package provides a unified interface, ``generate_stat_spec``, for generating ``regular specification``\ s across different data types.
  34. Users can use the training data ``train_x`` (supported types include numpy.ndarray, pandas.DataFrame, and torch.Tensor) as input to generate the ``regular specification`` of the model,
  35. as shown in the following code:
  36. .. code:: python
  37. for learnware.specification import generate_stat_spec
  38. data_type = "table" # supported data types: ["table", "image", "text"]
  39. regular_spec = generate_stat_spec(type=data_type, x=train_x)
  40. regular_spec.save("stat.json")
  41. It is worth noting that the above code only runs on the user's local computer and does not interact with cloud servers or leak local raw data.
  42. .. note::
  43. In cases where the model's training data is too large, causing the above code to fail, you can consider sampling the training data to ensure it's of a suitable size before proceeding with reduction generation.
  44. Table Specification
  45. --------------------------
  46. ``RKMETableSpecification`` implements the RKME specification, which is the basis of tabular learnwares. It facilitates learnware identification and reuse for homogeneous tasks with identical input and output domains.
  47. Image Specification
  48. --------------------------
  49. Image data lives in a higher dimensional space than other data types. Unlike lower dimensional spaces,
  50. metrics defined based on Euclidean distances (or similar distances) will fail in higher dimensional spaces.
  51. This means that measuring the similarity between image samples becomes difficult.
  52. The specification for image data ``RKMEImageSpecification`` introduces a new kernel function that transforms images implicitly before RKME calculation.
  53. It employs the Neural Tangent Kernel (NTK) [1]_, a theoretical tool that characterizes the training dynamics of deep neural networks in the infinite width limit, to enhance the measurement of image similarity in high-dimensional spaces.
  54. Usage & Example
  55. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  56. In this part, we show that how to generate Image Specification for the training set of the CIFAR-10 dataset.
  57. Note that the Image Specification is generated on a subset of the CIFAR-10 dataset with ``generate_rkme_image_spec``.
  58. Then, it is saved to file "cifar10.json" using ``spec.save``.
  59. In many cases, it is difficult to construct Image Specification on the full dataset.
  60. By randomly sampling a subset of the dataset, we can efficiently construct Image Specification based on it, with a strong enough statistical description of the full dataset.
  61. .. tip::
  62. Typically, sampling 3,000 to 10,000 images is sufficient to generate the Image Specification.
  63. .. code-block:: python
  64. import torchvision
  65. from torch.utils.data import DataLoader
  66. from learnware.specification import generate_rkme_image_spec
  67. cifar10 = torchvision.datasets.CIFAR10(
  68. root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor()
  69. )
  70. X, _ = next(iter(DataLoader(cifar10, batch_size=len(cifar10))))
  71. spec = generate_rkme_image_spec(X, sample_size=5000)
  72. spec.save("cifar10.json")
  73. Raw Data Protection
  74. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  75. In the third row of the figure, we show the eight pseudo-data with the largest weights :math:`\beta` in the ``RKMEImageSpecification`` generated on the CIFAR-10 dataset.
  76. Notice that the ``RKMEImageSpecification`` generated based on Neural Tangent Kernel (NTK) doesn't compromise raw data security.
  77. In contrast, we show the performance of the RBF kernel on image data in the first row of the figure below.
  78. The RBF not only exposes the original data (plotted in the corresponding position in the second row) but also fails to fully utilize the weights :math:`\beta`.
  79. .. image:: ../_static/img/image_spec.png
  80. :align: center
  81. Text Specification
  82. --------------------------
  83. Different from tabular data, each text input is a string of different length, so we should first transform them to equal-length arrays. Sentence embedding is used here to complete this transformation. We choose the model ``paraphrase-multilingual-MiniLM-L12-v2``, a lightweight multilingual embedding model. Then, we calculate the RKME specification on the embedding, just like we do with tabular data. Besides, we use the package ``langdetect`` to detect and store the language of the text inputs for further search. We hope to search for the learnware that supports the language of the user task.
  84. System Specification
  85. ======================================
  86. In addition to ``regular specification``\ s, the learnware doc system leverages its knowledge to generate new ``system specification``\ s for learnwares.
  87. The ``system specification`` module is automatically generated by the doc system. For newly inserted learnwares, the ``organizer`` generates new system specifications based on existing learnware statistical specifications to facilitate search operations and expand the search scope.
  88. Currently, the ``learnware`` package has implemented the ``HeteroMapTableSpecification`` which enables learnwares organized by the ``Hetero Market`` to support tasks with varying feature and prediction spaces.
  89. This specification is derived by mapping the ``RKMETableSpecification`` to a unified semantic embedding space, utilizing the heterogenous engine which is a tabular network trained on feature semantics of all tabular learnwares.
  90. Please refer to `COMPONENTS: Hetero Market <../components/market.html#hetero-market>`_ for implementation details.
  91. References
  92. -----------
  93. .. [1] Adrià Garriga-Alonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional networks as shallow gaussian processes. In: *International Conference on Learning Representations (ICLR'19)*, 2019.