You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

datasets.md 4.4 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
  1. # Dataset Development Guide
  2. ## Introduction
  3. The Sedna provides interfaces and public methods related to data conversion and sampling in the Dataset class. The user data processing class can inherit from the Dataset class and use these public capabilities.
  4. ### 1. Example
  5. The following describes how to use the Dataset by using a `txt-format contain sets of images` as an example. The procedure is as follows:
  6. - 1.1. All dataset classes of Sedna are inherited from the base class `sedna.datasources.BaseDataSource`. The base class BaseDataSource defines the interfaces required by the dataset, provides attributes such as data_parse_func, save, and concat, and provides default implementation. The derived class can reload these default implementations as required.
  7. ```python
  8. class BaseDataSource:
  9. """
  10. An abstract class representing a :class:`BaseDataSource`.
  11. All datasets that represent a map from keys to data samples should subclass
  12. it. All subclasses should overwrite parse`, supporting get train/eval/infer
  13. data by a function. Subclasses could also optionally overwrite `__len__`,
  14. which is expected to return the size of the dataset.overwrite `x` for the
  15. feature-embedding, `y` for the target label.
  16. Parameters
  17. ----------
  18. data_type : str
  19. define the datasource is train/eval/test
  20. func: function
  21. function use to parse an iter object batch by batch
  22. """
  23. def __init__(self, data_type="train", func=None):
  24. self.data_type = data_type # sample type: train/eval/test
  25. self.process_func = None
  26. if callable(func):
  27. self.process_func = func
  28. elif func:
  29. self.process_func = ClassFactory.get_cls(
  30. ClassType.CALLBACK, func)()
  31. self.x = None # sample feature
  32. self.y = None # sample label
  33. self.meta_attr = None # special in lifelong learning
  34. def num_examples(self) -> int:
  35. return len(self.x)
  36. def __len__(self):
  37. return self.num_examples()
  38. def parse(self, *args, **kwargs):
  39. raise NotImplementedError
  40. @property
  41. def is_test_data(self):
  42. return self.data_type == "test"
  43. def save(self, output=""):
  44. return FileOps.dump(self, output)
  45. class TxtDataParse(BaseDataSource, ABC):
  46. """
  47. txt file which contain image list parser
  48. """
  49. def __init__(self, data_type, func=None):
  50. super(TxtDataParse, self).__init__(data_type=data_type, func=func)
  51. def parse(self, *args, **kwargs):
  52. pass
  53. ```
  54. - 1.2. Defining Dataset parse function
  55. ```python
  56. def parse(self, *args, **kwargs):
  57. x_data = []
  58. y_data = []
  59. use_raw = kwargs.get("use_raw")
  60. for f in args:
  61. with open(f) as fin:
  62. if self.process_func:
  63. res = list(map(self.process_func, [
  64. line.strip() for line in fin.readlines()]))
  65. else:
  66. res = [line.strip().split() for line in fin.readlines()]
  67. for tup in res:
  68. if not len(tup):
  69. continue
  70. if use_raw:
  71. x_data.append(tup)
  72. else:
  73. x_data.append(tup[0])
  74. if not self.is_test_data:
  75. if len(tup) > 1:
  76. y_data.append(tup[1])
  77. else:
  78. y_data.append(0)
  79. self.x = np.array(x_data)
  80. self.y = np.array(y_data)
  81. ```
  82. ### 2. Commissioning
  83. The preceding implementation can be directly used in the PipeStep in Sedna or independently invoked. The code for independently invoking is as follows:
  84. ```python
  85. import os
  86. import unittest
  87. def _load_txt_dataset(dataset_url):
  88. # use original dataset url,
  89. # see https://github.com/kubeedge/sedna/issues/35
  90. return os.path.abspath(dataset_url)
  91. class TestDataset(unittest.TestCase):
  92. def test_txtdata(self):
  93. train_data = TxtDataParse(data_type="train", func=_load_txt_dataset)
  94. train_data.parse(train_dataset_url, use_raw=True)
  95. self.assertEqual(len(train_data), 1)
  96. if __name__ == "__main__":
  97. unittest.main()
  98. ```