Skip to content

Regex objects not supported #783

Description

@aulemahal

Description

Docstring of esm_datastore.search has an example using re.compile(...). However, this support seems to have broken in the last updates.

import intake

cat = intake.open_esm_datastore('intake-esm/tutorial-catalogs/AWS-CMIP6.json')

# Get institutions that do not start with M
cat.search(institution_id=re.compile('^(?!M.*)')

Fails with :

TypeError                                 Traceback (most recent call last)
Cell In[37], line 1
----> 1 cat2.search(institution_id=re.compile('^(?!M.*)')).df

File ~/miniforge3/envs/intesm-dev/lib/python3.14/site-packages/pydantic/_internal/_validate_call.py:40, in update_wrapper_attributes.<locals>.wrapper_function(*args, **kwargs)
     38 @functools.wraps(wrapped)
     39 def wrapper_function(*args, **kwargs):
---> 40     return wrapper(*args, **kwargs)

File ~/miniforge3/envs/intesm-dev/lib/python3.14/site-packages/pydantic/_internal/_validate_call.py:137, in ValidateCallWrapper.__call__(self, *args, **kwargs)
    134 if not self.__pydantic_complete__:
    135     self._create_validators()
--> 137 res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
    138 if self.__return_pydantic_validator__:
    139     return self.__return_pydantic_validator__(res)

File ~/Projets/intake-esm/intake_esm/core.py:462, in esm_datastore.search(self, require_all_on, **query)
    406 """Search for entries in the catalog.
    407 
    408 Parameters
   (...)    458 4    landCoverFrac
    459 """
    461 # step 1: Search in the base/main catalog
--> 462 esmcat_results = self.esmcat.search(require_all_on=require_all_on, query=query)
    464 # step 2: Search for entries required to derive variables in the derived catalogs
    465 # This requires a bit of a hack i.e. the user has to specify the variable in the query
    466 derivedcat_results = []

File ~/Projets/intake-esm/intake_esm/cat.py:443, in ESMCatalogModel.search(self, query, require_all_on)
    415 """
    416 Search for entries in the catalog.
    417 
   (...)    432 
    433 """
    435 _query = (
    436     query
    437     if isinstance(query, QueryModel)
   (...)    440     )
    441 )
--> 443 results = search(
    444     df=self.df, query=_query.query, columns_with_iterables=self.columns_with_iterables
    445 )
    446 if _query.require_all_on is not None and not results.empty:
    447     results = search_apply_require_all_on(
    448         df=results,
    449         query=_query.query,
    450         require_all_on=_query.require_all_on,
    451         columns_with_iterables=self.columns_with_iterables,
    452     )

File ~/Projets/intake-esm/intake_esm/_search.py:46, in search(df, query, columns_with_iterables)
     42 column_is_stringtype = isinstance(
     43     df[column].dtype, object | pd.core.arrays.string_.StringDtype
     44 )
     45 column_has_iterables = column in columns_with_iterables
---> 46 for value in values:
     47     if column_has_iterables:
     48         mask = df[column].str.contains(value, regex=False)

TypeError: 're.Pattern' object is not iterable

Case with PyArrow

However, I don't think this is fixable. When opening a catalog from a csv file, the resulting dataframe has string columns with a large_string[pyarrow] dtype. Pandas will then delegate the pattern matching to pyarrow, which doesn't support re objects.

cat = intake.open_esm_datastore('intake-esm/tests/sample-catalogs/cesm1-lens-netcdf.json')
cat.df.experiment.str.contains(re.compile('^C.*'))

fails with TypeError: expected bytes, re.Pattern found

Moreover, pyarrow uses a different regex than python. It uses Google RE2. A major difference (to me atleast) is the absence of negative matches in Google RE2. For example, ^(?!CCCma.*) (match strings not starting with "CCCma") is not a valid pattern.

How to fix

I think we could simply remove the example using re.compile from the documentation and note somewhere that because of PyArrow usage, intake-esm's search function only officially supports the intersection between python's and Google RE2's regex syntaxes.

Version information: output of intake_esm.show_versions()

Details

Paste the output of intake_esm.show_versions() here:

INSTALLED VERSIONS
------------------

cftime: 1.6.5
dask: 2026.3.0
fastprogress: 1.1.3
fsspec: 2026.2.0
gcsfs: 2026.2.0
intake: 2.0.9
intake_esm: 2025.12.12.post7+g414c4cfc1
netCDF4: 1.7.4
pandas: 3.0.2
requests: 2.33.1
s3fs: 2026.2.0
xarray: 2026.4.0
zarr: 3.1.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions