Indexing reference
In your helix-query.yaml#, you can define one or more index definitions. A sample index definition looks as follows: https://gist.github.com/dominique-pfister/92cb67b6f95e1edee6a7d6508b124039
The include and exclude section dictates what documents get indexed. Everything that is included but not excluded gets indexed. The double asterisk ** matches everything under a prefix, including the prefix, so in the example above, the path /en gets indexed as well. If you leave out that section entirely, everything gets indexed.
The first index called english defines itself as default (using the ampersand). The second index called french uses that default definition and overrides some attributes.
The select property is a CSS selector that grabs all matching HTML elements out of your document. The indexer will apply your selectors on the HTML markup, not on the rendered DOM, so you must write them accordingly. (right click -> View Page Source, on the page you want to extract information from, to see the exact HTML the indexer will work on).
If you just want the first matching element to be returned, use selectFirst instead of select:
first-img:
selectFirst: img
value: attribute(el, "src")
To verify that a CSS selector entered is selecting what you expect, you can use the aem cli - aem up -–print-index, navigate to the page where the selector should extract a meaningful value and check the console. The cli will use the helix-query.yaml file from your local filesystem and will print the extracted values, or an empty string if it cannot find the information it is looking for.
aem up --print-index
...
info: Index information for /my/test/page
info: Index: mysite
info: author: "John Smith"
Note, that not all CSS selectors are supported. Internally, we use a library called hast-util-select, and the list of supported selectors can be found here: https://github.com/syntax-tree/hast-util-select#support
The value or values property contains an expression to apply to all HTML elements selected. The property name value is preferred when you need a string, values on the other hand provides you with an array of all the matches found. The expression can contain a combination of functions and variables:
innerHTML(el)
Returns the HTML content of an element.
# Preserve links and formatting in a rich text snippet
bio:
select: main > div:first-child p:first-of-type
value: innerHTML(el)
textContent(el)
Returns the text content of the selected element, and all its descendents.
# Extract the text of the first h1 on the page
headline:
select: main h1
value: textContent(el)
attribute(el, name)
Returns the value of the attribute with the specified name of an element.
title:
select: head > meta[property="og:title"]
value: attribute(el, "content")
match(el, re)
Matches a regular expression containing parentheses to capture items in the passed element. In the author example above, the actual contents of the <p> element selected might contain by John Smith, so it would capture everything following by .
# Extract year from a date string like "02/15/2025"
year:
select: head > meta[name="publication-date"]
value: match(attribute(el, "content"), "\\d{2}\\/\\d{2}\\/(\\d{4})")
characters(el, start, end)
Returns the substring from start to end of the given element or text. If start or end are negative, they address the position counted from the end of the text. end is optional, and defaults to the length of the text.
# First 200 characters of a blog post for card previews
snippet:
select: main > div p:first-of-type
value: characters(textContent(el), 0, 200)
words(el, start, end)
Useful for teasers, this selects a range of words out of an HTML element.
# First 50 words of page content as a teaser
description:
select: main > div p
value: words(textContent(el), 0, 50)
replace(el, substr, newSubstr)
Replaces the first occurrence of a substring in a text with a replacement.
# Remove a prefix from a title
title:
select: head > meta[property="og:title"]
value: replace(attribute(el, "content"), "ACME Corp | ", "")
replaceAll(el, substr, newSubstr)
Replaces all occurrences of a substring in a text with a replacement.
# Replace all underscores with spaces
category:
select: head > meta[name="category"]
value: replaceAll(attribute(el, "content"), "_", " ")
parseTimestamp(el, format)
Parses a timestamp given as string in a custom format, and returns its value as number of seconds since 1 Jan 1970.
# Parse an authored date in MM/DD/YYYY format
publicationDate:
select: head > meta[name="publication-date"]
value: parseTimestamp(attribute(el, "content"), "MM/DD/YYYY")
dateValue(el, format)
Parses a timestamp given as string, and returns its value as serial number, where January 1, 1900 is serial number 1. For more information see DATEVALUE function
# Returns an Excel serial date number (useful for spreadsheet sorting/filtering)
lastUpdated:
select: head > meta[name="last-updated"]
value: dateValue(attribute(el, "content"), "YYYY-MM-DD")
el
Returns the HTML elements selected by the select property.
# Use el directly when select already targets the value you need
heading:
select: main h1
value: textContent(el)
path
Returns the path of the HTML document being indexed.
headers[name]
Returns the value of the HTTP response header with the specified name, at the time the HTML document was fetched.
lastModified:
select: none
value: parseTimestamp(headers["last-modified"], "ddd, DD MMM YYYY hh:mm:ss GMT")
helix-query.yaml is available here: https://github.com/adobe/helix-shared/blob/main/docs/indexconfig.md