Why I got a list instead of `datasets.arrow_dataset.Column`?

I think it’s the design of datasets library. If you explicitly want to convert, you can also use the .to_*** functions.

# deps: pip install datasets pyarrow pandas
# docs:
# - Column return on column-name indexing: https://huggingface.co/docs/datasets/en/access
# - New Column object in releases: https://github.com/huggingface/datasets/releases
# - Access underlying Arrow table: https://huggingface.co/proxy/discuss.huggingface.co/t/datasets-arrow-help/18880
# - pyarrow.Table.column API: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html

from datasets import Dataset
import pyarrow as pa

def to_conversations(batch):
    convs = []
    for p, s in zip(batch["problem"], batch["generated_solution"]):
        convs.append(
            [{"role": "user", "content": p},
             {"role": "assistant", "content": s}]
        )
    return {"conversations": convs}

# --- minimal toy data ---
base = Dataset.from_dict({
    "problem": ["P1", "P2", "P3"],
    "generated_solution": ["S1", "S2", "S3"],
})

ds = base.map(to_conversations, batched=True)

print("=== REPRO: column-name indexing returns Column ===")
col = ds["conversations"]
print("type(ds['conversations']) =", type(col))              # datasets.arrow_dataset.Column
print("col[0] =", col[0])                                    # first conversation
assert "datasets.arrow_dataset.Column" in str(type(col))     # expected on modern versions

print("\n=== FIX 1: get Python list when you need it ===")
as_list = list(ds["conversations"])                          # materialize as plain list
print("type(list(ds['conversations'])) =", type(as_list))
print("as_list[0] =", as_list[0])
assert isinstance(as_list, list)

print("\n=== FIX 2: get the Arrow column when you need it ===")
arrow_col = ds.data.column("conversations")                  # pyarrow.ChunkedArray
print("type(ds.data.column('conversations')) =", type(arrow_col))
assert isinstance(arrow_col, pa.ChunkedArray)

print("\n=== Reference: row-first vs column-first access ===")
print("row-first type:", type(ds[0]["conversations"]))       # Python object for a single row
print("column-first type:", type(ds["conversations"]))       # Column wrapper
"""
=== REPRO: column-name indexing returns Column ===
type(ds['conversations']) = <class 'datasets.arrow_dataset.Column'>
col[0] = [{'content': 'P1', 'role': 'user'}, {'content': 'S1', 'role': 'assistant'}]

=== FIX 1: get Python list when you need it ===
type(list(ds['conversations'])) = <class 'list'>
as_list[0] = [{'content': 'P1', 'role': 'user'}, {'content': 'S1', 'role': 'assistant'}]

=== FIX 2: get the Arrow column when you need it ===
type(ds.data.column('conversations')) = <class 'pyarrow.lib.ChunkedArray'>

=== Reference: row-first vs column-first access ===
row-first type: <class 'list'>
column-first type: <class 'datasets.arrow_dataset.Column'>
"""

I don’t very understand, will it convert to Column on the earth? what determines the type of newly added columns?

what determines the type of newly added columns?

Basically by added data('s type) itself or Features if specified.


ds["formatted_conversations"] returns a Column view. Nothing is converted; it exposes the Arrow-backed column. Hugging Face documents that column-name indexing returns a Column object you can index like a list. (Hugging Face)

Type of a newly added column is set as follows:

  • You specify it. Pass a Features schema when creating or mapping. That schema becomes the column’s Arrow type. You can later change it with cast or cast_column. (Hugging Face)
  • If you do not specify it, Datasets infers the type from the Python values your map returns. Inference is Arrow-based. (Hugging Face)
  • Complex returns like list[dict{...}] become nested features such as Sequence({...}). Features define column names and types. (Hugging Face)
  • The dataset is backed by a PyArrow Table; low-level access is via ds.data.column("col") which yields a ChunkedArray. (Hugging Face)

Minimal patterns:

# control the new column type explicitly
from datasets import Features, Sequence, Value
features = Features({
    "formatted_conversations": Sequence({"text": Value("string"), "length": Value("int32")})
})
ds = ds.map(fn, batched=False, features=features)  # schema fixed by you

# if already created, change just one column's feature
ds = ds.cast_column("formatted_conversations",
                    Sequence({"text": Value("string"), "length": Value("int32")}))  # cast if compatible

# access choices
col_view = ds["formatted_conversations"]      # Column view
arrow_arr = ds.data.column("formatted_conversations")  # pyarrow.ChunkedArray
py_list  = list(ds["formatted_conversations"])         # plain list

Sources: column access and Column view, features and schema control, casting columns, Arrow backing and column() API. (Hugging Face)

def generate_conversation(examples):
    problems  = examples["problem"]
    solutions = examples["generated_solution"]
    conversations = []
    for problem, solution in zip(problems, solutions):
        conversations.append([
            {"role" : "user",      "content" : problem},
            {"role" : "assistant", "content" : solution},
        ])
    return { "conversations": conversations, }
print(type(reasoning_dataset.map(generate_conversation,batched=True)[conversations ]))

that’s clear, but it did not use features parameter in the code above, why it still got a list instead of Column

Oh… It seems the behavior depends on the version of the datasets library…


It’s version behavior, not features=.

  • Why you saw a list: older datasets returned a Python list for ds["col"]. Newer versions return a datasets.arrow_dataset.Column view. The features= argument never controls this accessor; it only sets schema. (Hugging Face)

  • What it is now: ds["col"]Column view backed by a PyArrow table. The dataset is Arrow-backed. (Hugging Face)

  • What sets the type of new columns:

    1. Explicit schema you pass via features= in map or later via cast_column/cast.
    2. Otherwise Arrow infers from the Python values your function returns. This becomes dataset.features. (Hugging Face)
  • Make results consistent regardless of version:

    import datasets, pyarrow as pa
    print(datasets.__version__)
    
    col_view = ds["conversations"]              # Column on new versions, list on old
    as_list  = list(ds["conversations"])        # always a Python list
    arrow_ca = ds.data.column("conversations")  # always a pyarrow.ChunkedArray
    

    The Arrow interop is stable because the dataset is a PyArrow table underneath. (Hugging Face)

  • Tip for nested data: Returning list[dict] from map yields a nested Sequence(struct{...}) feature unless you override with features=. Check with ds.features. (Hugging Face)

If you need Column everywhere, upgrade datasets; if you need lists, wrap with list(...).

The datasets’s version is “4.0.0”, I use it in unsloth’s official notebook, is that very old?
in myself’s environment the version is 4.2.0
so u mean if the same code was run in myself’s environment, I’ll get a Column
***
I’ve tried, u r right

wait, I tried it in unsloth’s notebook again and it become Column!

but I really remember that I’ve ever got a list type and the top screenshot can prove

god, maybe there is some halloween ghost who’s trick me

If you want to avoid ambiguity in data types, it’s probably better to explicitly cast them…
It’s too random.:sweat_smile: