诸神缄默不语-个人CSDN博文目录
本文是作者在使用huggingface的datasets包时,出现无法加载数据集和指标的问题,故撰写此博文以记录并分享这一问题的解决方式。以下将依次介绍我的代码和环境、报错信息、错误原理和解决方案。首先介绍数据集的,后面介绍指标的。
系统环境:
操作系统:Linux
Python版本:3.8.12
代码编辑器:VSCode+Jupyter Notebook
datasets版本:2.0.0
数据集的:
代码:
import datasets
dataset=datasets.load_dataset("yelp_review_full")
报错信息:
ConnectionError Traceback (most recent call last)
/tmp/ipykernel_21708/3707219471.py in
----> 1 dataset=datasets.load_dataset("yelp_review_full")
myenv/lib/python3.8/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, **config_kwargs)
1658
1659 # Create a dataset builder
-> 1660 builder_instance = load_dataset_builder(
1661 path=path,
1662 name=name,
myenv/lib/python3.8/site-packages/datasets/load.py in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, use_auth_token, **config_kwargs)
1484 download_config = download_config.copy() if download_config else DownloadConfig()
1485 download_config.use_auth_token = use_auth_token
-> 1486 dataset_module = dataset_module_factory(
1487 path,
1488 revision=revision,
myenv/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
1236 f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
1237 ) from None
-> 1238 raise e1 from None
1239 else:
1240 raise FileNotFoundError(
myenv/lib/python3.8/site-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, data_dir, data_files, **download_kwargs)
1173 if path.count("/") == 0: # even though the dataset is on the Hub, we get it from GitHub for now
1174 # TODO(QL): use a Hub dataset module factory instead of GitHub
-> 1175 return GithubDatasetModuleFactory(
1176 path,
1177 revision=revision,
myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
531 revision = self.revision
532 try:
--> 533 local_path = self.download_loading_script(revision)
534 except FileNotFoundError:
535 if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None:
myenv/lib/python3.8/site-packages/datasets/load.py in download_loading_script(self, revision)
511 if download_config.download_desc is None:
512 download_config.download_desc = "Downloading builder script"
--> 513 return cached_path(file_path, download_config=download_config)
514
515 def download_dataset_infos_file(self, revision: Optional[str]) -> str:
myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
232 if is_remote_url(url_or_filename):
233 # URL, so get it from the cache (downloading if necessary)
--> 234 output_path = get_from_cache(
235 url_or_filename,
236 cache_dir=cache_dir,
myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
581 if head_error is not None:
--> 582 raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
583 elif response is not None:
584 raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/yelp_review_full/yelp_review_full.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))
很明显这是上不了raw.githubusercontent.com的问题。
如果你可以使用代理,最好的解决方式就是直接挂代理运行全程。
对于不方便直接使用代理的情况,以下介绍我使用的解决方案:在本机使用代理,然后将文件上传到运行环境的解决方案。(注意本机和服务器可以是不同操作系统的)
我试过直接把这个Python文件下载下来,然后上传到服务器上,但是操作了半天也不行,因为这个Python文件里面给出的数据下载链接在谷歌云,但是直接把那个数据下下来上传还是不行,修改数据下载链接到S3文件也不行。总之不行,如果有可行的方法请直接给我讲一下。
大略来说,我的成功做法就是现在本地加载数据集,然后储存到磁盘,然后将文件夹上传至服务器,并从磁盘直接加载数据集。
在本地加载数据集并储存到本地磁盘(注意这个路径是Windows系统的路径):
import datasets
dataset=datasets.load_dataset("yelp_review_full",cache_dir='mypath\data\huggingfacedatasetscache')
dataset.save_to_disk('mypath\\data\\yelp_review_full_disk')
将路径文件夹上传到服务器:
可以使用bypy和百度网盘来进行操作,参考我之前撰写的博文bypy:使用Linux命令行上传及下载百度云盘文件(远程服务器大文件传输必备)_诸神缄默不语的博客-CSDN博客_bypy 命令。
先上传到我的应用数据-bypy文件夹中,然后在服务器上下载文件夹(注意下载文件夹是将远程文件夹里的所有文件下载到本地文件夹,而不是直接下载整个文件夹):bypy downdir yelp_full_review_disk mypath/datasets/yelp_full_review_disk
然后在服务器上从磁盘加载数据集:
dataset=datasets.load_from_disk("mypath/datasets/yelp_full_review_disk")
就可以正常使用数据集了:
注意,根据datasets的文档,这个数据集也可以直接存储到S3FileSystem(https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.filesystems.S3FileSystem)上。我觉得这大概也是个类似谷歌云或者百度云那种可公开下载文件的API?感觉会比存储到本地然后转储到服务器更方便。
我没有研究过这个功能,所以没有使用这个。
指标的:
代码:
metric=datasets.load_metric('accuracy')
报错信息:
ConnectionError Traceback (most recent call last)
/tmp/ipykernel_24141/2186493793.py in
----> 1 metric=datasets.load_metric('accuracy')
myenv/lib/python3.8/site-packages/datasets/load.py in load_metric(path, config_name, process_id, num_process, cache_dir, experiment_id, keep_in_memory, download_config, download_mode, revision, **metric_init_kwargs)
1390 """
1391 download_mode = DownloadMode(download_mode or DownloadMode.REUSE_DATASET_IF_EXISTS)
-> 1392 metric_module = metric_module_factory(
1393 path, revision=revision, download_config=download_config, download_mode=download_mode
1394 ).module_path
myenv/lib/python3.8/site-packages/datasets/load.py in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs)
1322 except Exception as e2: # noqa: if it's not in the cache, then it doesn't exist.
1323 if not isinstance(e1, FileNotFoundError):
-> 1324 raise e1 from None
1325 raise FileNotFoundError(
1326 f"Couldn't find a metric script at {relative_to_absolute_path(combined_path)}. "
myenv/lib/python3.8/site-packages/datasets/load.py in metric_module_factory(path, revision, download_config, download_mode, force_local_path, dynamic_modules_path, **download_kwargs)
1310 elif is_relative_path(path) and path.count("/") == 0 and not force_local_path:
1311 try:
-> 1312 return GithubMetricModuleFactory(
1313 path,
1314 revision=revision,
myenv/lib/python3.8/site-packages/datasets/load.py in get_module(self)
598 revision = self.revision
599 try:
--> 600 local_path = self.download_loading_script(revision)
601 revision = self.revision
602 except FileNotFoundError:
myenv/lib/python3.8/site-packages/datasets/load.py in download_loading_script(self, revision)
592 if download_config.download_desc is None:
593 download_config.download_desc = "Downloading builder script"
--> 594 return cached_path(file_path, download_config=download_config)
595
596 def get_module(self) -> MetricModule:
myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in cached_path(url_or_filename, download_config, **download_kwargs)
232 if is_remote_url(url_or_filename):
233 # URL, so get it from the cache (downloading if necessary)
--> 234 output_path = get_from_cache(
235 url_or_filename,
236 cache_dir=cache_dir,
myenv/lib/python3.8/site-packages/datasets/utils/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, local_files_only, use_etag, max_retries, use_auth_token, ignore_url_params, download_desc)
580 _raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
581 if head_error is not None:
--> 582 raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
583 elif response is not None:
584 raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/metrics/accuracy/accuracy.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))
指标的简单一点,只要把这个Python文件下载到本地(这个可以不用代理。免代理下载GitHub文件的方法我没有专门撰写博文,但是可以参考我之前写的类似主题的博文:PyG的Planetoid无法直接下载Cora等数据集的3个解决方式_诸神缄默不语的博客-CSDN博客_planetoid数据集),然后改为调用这个文件即可:
metric=datasets.load_metric('mypath/accuracy.py')
本文撰写过程中所使用的参考资料:
datasets.save_to_disk()
的文档:https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.Dataset.save_to_disk