Description
Docstring of esm_datastore.search has an example using re.compile(...). However, this support seems to have broken in the last updates.
import intake
cat = intake.open_esm_datastore('intake-esm/tutorial-catalogs/AWS-CMIP6.json')
# Get institutions that do not start with M
cat.search(institution_id=re.compile('^(?!M.*)')
Fails with :
TypeError Traceback (most recent call last)
Cell In[37], line 1
----> 1 cat2.search(institution_id=re.compile('^(?!M.*)')).df
File ~/miniforge3/envs/intesm-dev/lib/python3.14/site-packages/pydantic/_internal/_validate_call.py:40, in update_wrapper_attributes.<locals>.wrapper_function(*args, **kwargs)
38 @functools.wraps(wrapped)
39 def wrapper_function(*args, **kwargs):
---> 40 return wrapper(*args, **kwargs)
File ~/miniforge3/envs/intesm-dev/lib/python3.14/site-packages/pydantic/_internal/_validate_call.py:137, in ValidateCallWrapper.__call__(self, *args, **kwargs)
134 if not self.__pydantic_complete__:
135 self._create_validators()
--> 137 res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
138 if self.__return_pydantic_validator__:
139 return self.__return_pydantic_validator__(res)
File ~/Projets/intake-esm/intake_esm/core.py:462, in esm_datastore.search(self, require_all_on, **query)
406 """Search for entries in the catalog.
407
408 Parameters
(...) 458 4 landCoverFrac
459 """
461 # step 1: Search in the base/main catalog
--> 462 esmcat_results = self.esmcat.search(require_all_on=require_all_on, query=query)
464 # step 2: Search for entries required to derive variables in the derived catalogs
465 # This requires a bit of a hack i.e. the user has to specify the variable in the query
466 derivedcat_results = []
File ~/Projets/intake-esm/intake_esm/cat.py:443, in ESMCatalogModel.search(self, query, require_all_on)
415 """
416 Search for entries in the catalog.
417
(...) 432
433 """
435 _query = (
436 query
437 if isinstance(query, QueryModel)
(...) 440 )
441 )
--> 443 results = search(
444 df=self.df, query=_query.query, columns_with_iterables=self.columns_with_iterables
445 )
446 if _query.require_all_on is not None and not results.empty:
447 results = search_apply_require_all_on(
448 df=results,
449 query=_query.query,
450 require_all_on=_query.require_all_on,
451 columns_with_iterables=self.columns_with_iterables,
452 )
File ~/Projets/intake-esm/intake_esm/_search.py:46, in search(df, query, columns_with_iterables)
42 column_is_stringtype = isinstance(
43 df[column].dtype, object | pd.core.arrays.string_.StringDtype
44 )
45 column_has_iterables = column in columns_with_iterables
---> 46 for value in values:
47 if column_has_iterables:
48 mask = df[column].str.contains(value, regex=False)
TypeError: 're.Pattern' object is not iterable
Case with PyArrow
However, I don't think this is fixable. When opening a catalog from a csv file, the resulting dataframe has string columns with a large_string[pyarrow] dtype. Pandas will then delegate the pattern matching to pyarrow, which doesn't support re objects.
cat = intake.open_esm_datastore('intake-esm/tests/sample-catalogs/cesm1-lens-netcdf.json')
cat.df.experiment.str.contains(re.compile('^C.*'))
fails with TypeError: expected bytes, re.Pattern found
Moreover, pyarrow uses a different regex than python. It uses Google RE2. A major difference (to me atleast) is the absence of negative matches in Google RE2. For example, ^(?!CCCma.*) (match strings not starting with "CCCma") is not a valid pattern.
How to fix
I think we could simply remove the example using re.compile from the documentation and note somewhere that because of PyArrow usage, intake-esm's search function only officially supports the intersection between python's and Google RE2's regex syntaxes.
Version information: output of intake_esm.show_versions()
Details
Paste the output of intake_esm.show_versions() here:
INSTALLED VERSIONS
------------------
cftime: 1.6.5
dask: 2026.3.0
fastprogress: 1.1.3
fsspec: 2026.2.0
gcsfs: 2026.2.0
intake: 2.0.9
intake_esm: 2025.12.12.post7+g414c4cfc1
netCDF4: 1.7.4
pandas: 3.0.2
requests: 2.33.1
s3fs: 2026.2.0
xarray: 2026.4.0
zarr: 3.1.6
Description
Docstring of
esm_datastore.searchhas an example usingre.compile(...). However, this support seems to have broken in the last updates.Fails with :
Case with PyArrow
However, I don't think this is fixable. When opening a catalog from a csv file, the resulting dataframe has string columns with a
large_string[pyarrow]dtype. Pandas will then delegate the pattern matching to pyarrow, which doesn't support re objects.fails with
TypeError: expected bytes, re.Pattern foundMoreover, pyarrow uses a different regex than python. It uses Google RE2. A major difference (to me atleast) is the absence of negative matches in Google RE2. For example,
^(?!CCCma.*)(match strings not starting with "CCCma") is not a valid pattern.How to fix
I think we could simply remove the example using
re.compilefrom the documentation and note somewhere that because of PyArrow usage, intake-esm's search function only officially supports the intersection between python's and Google RE2's regex syntaxes.Version information: output of
intake_esm.show_versions()Details
Paste the output of
intake_esm.show_versions()here: