I would like to read a file from a repository:
In [1]: import pandas as pd
In [2]: url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
In [3]: df = pd.read_json(url)
[ ... clipped ... ]
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json
The above exception was the direct cause of the following exception:
[ ... clipped ... ]
GatedRepoError: 403 Client Error. (Request ID: Root=1-677e5a26-2b5127be18a8627d7ade2b28;1bb7097f-b2b1-4e2d-bb9f-fe47b4b0b984)
Cannot access gated repo for url https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/resolve/main/Qwen__Qwen2-7B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json.
Access to dataset open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details is restricted and you are not in the authorized list. Visit https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details to ask for access.
Is there a programmatic way around this error? I can of course manually visit the website suggested and click the “accept” button, but it would be more convenient to do the same thing via an API – is that possible?
I’m logged in (huggingface-cli login) and my token is in my environment.
That’s right. What I’m after is a programmatic way to accept the agreement, rather than having to visit the HuggingFace website (or using Selenium to do it for me).
I’ve found datasets.load_dataset difficult to add to automated workflows. Accessing files directly has been much more straightforward for me.
I see, it is easier to create your own logic. Anyway, I would say it is only possible to access this file via load_dataset method.
Additionally, you can filter the data that you are looking for using data_files parameter; for example:
from datasets import load_dataset
subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz")
subset = load_dataset("allenai/c4", data_dir="en")
It looks like manually requesting access is by design. From the Gated datasets documentation:
Requesting access can only be done from your browser.
I’ve made a feature request for API access here.
The current suggestion from the Hugging Face team is to use the method outlined here. There’s an undocumented endpoint from which we can “ask access.”
Hi there!
If you encounter the error GatedRepoError while trying to access a gated dataset on Hugging Face, it indicates that you don’t yet have access to the dataset, or you haven’t accepted its terms and conditions.
Steps to Resolve:
-
Manually Accept Access:
Visit the dataset page, e.g., open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details, and click the “Access Repository” button. Once you’ve done this, you should be able to access the dataset programmatically.
-
Programmatic Access:
Hugging Face currently doesn’t provide a direct API to accept dataset terms programmatically. However, here’s a way you can simplify the process:
- Ensure Your Token is Set: Log in with
huggingface-cli login, or manually set the HUGGINGFACE_TOKEN environment variable.
- Check Access:
You can use the requests library in Python to confirm your access programmatically:import requests
url = 'https://huggingface.co/datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details'
headers = {'Authorization': f'Bearer {your_huggingface_token}'}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print("Access granted")
else:
print("Access denied. Visit the page to request access.")
Replace your_huggingface_token with your actual Hugging Face token.
-
Access After Authorization:
Once you’ve been granted access (manually or after approval), you can proceed with:
import pandas as pd
url = 'hf://datasets/open-llm-leaderboard/Qwen__Qwen2-7B-Instruct-details/Qwen__Qwen27B-Instruct/samples_leaderboard_bbh_causal_judgement_2024-06-15T15-35-08.515878.json'
df = pd.read_json(url)
print(df.head())
Hope this help!