datasets.Dataset(
arrow_table: Table,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
indices_table: Optional[Table] = None,
fingerprint: Optional[str] = None,
)
根据ArrowTable产生一个Dataset对象。
datasets.Dataset.from_dict(
mapping: dict,
features: Optional[Features] = None,
info: Optional[Any] = None,
split: Optional[Any] = None,
)
根据字典来创建Dataset数据集对象。
d = {"text": [1, 2, 3, 4], "labels": [0, 0, 1, 1]}
dataset = datasets.Dataset.from_dict(d)
datasets.Dataset.from_pandas(
df: pd.DataFrame,
features: Optional[Features] = None,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
preserve_index: Optional[bool] = None,
)
根据pandas中的DataFrame对象创建Dataset对象。
a = np.reshape(np.linspace(1, 10, 10), (5, 2))
df = pd.DataFrame(a)
dataset = datasets.Dataset.from_pandas(df)
datasets.Dataset.from_csv(
path_or_paths: Union[PathLike, List[PathLike]],
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
**kwargs,
)
根据csv文件来创建Dataset对象。
datasets.Dataset.from_json(
path_or_paths: Union[PathLike, List[PathLike]],
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
field: Optional[str] = None,
**kwargs,
)
根据json文件来创建Dataset对象。
datasets.Dataset.from_text(
path_or_paths: Union[PathLike, List[PathLike]],
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
**kwargs,
)
根据txt文件来创建Dataset对象。
datasets.Dataset.from_parquet(
path_or_paths: Union[PathLike, List[PathLike]],
split: Optional[NamedSplit] = None,
features: Optional[Features] = None,
cache_dir: str = None,
keep_in_memory: bool = False,
columns: Optional[List[str]] = None,
**kwargs,
)
根据parquet文件来创建Dataset对象。
datasets.Dataset.from_file(
filename: str,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
indices_filename: Optional[str] = None,
in_memory: bool = False,
)
根据arrow文件来创建Dataset对象。
datasets.Dataset.from_buffer(
cls,
buffer: pa.Buffer,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
indices_buffer: Optional[pa.Buffer] = None,
)
根据arrow buffer对象来创建Dataset对象。
获取Dataset中的数据。
获取Dataset中数据所在的缓存文件。
获取Dataset中数据的列数、行数。
获取Dataset中的列名称。
获取Dataset中的数据形状。
dataset.unique(column)
在指定列中返回一个不重复的列表。
from datasets import load_dataset
ds = load_dataset("rotten_tomatoes", split="validation")
ds.unique('label')
dataset.add_column(name, column, new_fingerprint)
在Dataset数据集中增加一列数据。
from datasets import load_dataset
ds = load_dataset("rotten_tomatoes", split="validation")
more_text = ds["text"]
ds.add_column(name="text_2", column=more_text)
dataset.add_item(item, new_fingerprint)
在Dataset数据集中增加一行数据。
from datasets import load_dataset
ds = load_dataset("rotten_tomatoes", split="validation")
new_review = {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'}
ds = ds.add_item(new_review)
ds[-1]
dataset.cast(features)
将Dataset中特征features装换成新的features。
from datasets import load_dataset, ClassLabel, Value
ds = load_dataset("rotten_tomatoes", split="validation")
ds.features
new_features = ds.features.copy()
new_features['label'] = ClassLabel(names=['bad', 'good'])
new_features['text'] = Value('large_string')
ds = ds.cast(new_features)
ds.features
dataset.cast_column(column, feature)
转换features中的某一列。
from datasets import load_dataset
ds = load_dataset("rotten_tomatoes", split="validation")
ds.features
ds = ds.cast_column('label', ClassLabel(names=['bad', 'good']))
ds.features