Indexing
Adobe Experience Manager offers a way to keep an index of all the published pages in a particular section of your website. This is commonly used to build lists, feeds, and enable search and filtering use cases for your pages or content fragments.
AEM keeps this index in a spreadsheet when using Google Drive or Sharepoint as backend, and offers access to it using JSON. Please see the document Spreadsheets and JSON for more information.
Setting up an initial index with the Index Admin Tool
The easiest way to create and manage your query index is via the Index Admin tool.
- Open the Index Admin tool in your browser.
- Enter your organization and site to connect to your project.
- Click Add Index to create an initial index configuration.
- Enter a Name for your index.
- Locate the Properties section.
- Add each property you want extracted from the rendered HTML page — for example,
title,image,description, orlastModified. - When you’re done, click Save.
The following table summarizes the properties that are available and from where in the HTML page they’re extracted.
| Name | Description |
|---|---|
author |
Returns the content of the meta tag named author in the head element. |
title |
Returns the content of the og:title meta property in the head element. |
date |
Returns the content of the meta tag named publication-date in the head element. |
image |
Returns the content of the og:image meta property in the head element. |
category |
Returns the content of the meta tag named category in the head element. |
tags |
Returns the content of the meta tag named See the document Spreadsheets and JSON for more information on array-handling. |
description |
Returns the content of the meta tag named description in the head element. |
robots |
Returns the content of the meta tag named robots in the head element. |
lastModified |
Returns the value of the Last-Modified response header for the document. |
Reindexing
After setting up or updating your properties, click Reindex in the Index Admin tool. This triggers a full reindex of your content against the new configuration.
Pages are indexed when they are published. To remove a page from the index, unpublish it.
Setting up an initial index via Admin API
It’s also possible to create an index using the Admin API. For more information, visit: Update Indexing Configuration.
Troubleshooting
Check your index
The Admin Service has an API endpoint where you can check the index representation of your page. Given your organization, site and branch, and a resource path to a page, its endpoint is:
https://admin.hlx.page/index/<org>/<site>/<branch>/<path>
You should get a JSON response where the data node contains the index representation of the page.
Debug your index configuration
The AEM CLI has a feature where it will print the index record whenever you change your query configuration, which assists in finding the correct CSS selectors:
$ aem up --print-index
Please see the AEM CLI GitHub documentation for more information and watch this video to learn more about this feature.
Inspect the audit log
Using the Log Viewer Tool you can check whether the indexer reports any error related to your configuration. If you filter logs by Indexer, and see lines with a red dot, you can expand the lines and inspect the error reported.
Custom index definitions
See the Indexing reference for the full syntax of index definitions, including extraction functions and examples.
Omitting published pages from the index
A common use case is to not index pages that have noindex in their robots metadata section. It is possible to filter those out for SharePoint and Google Drive content sources by defining a FILTER expression in a sheet called helix-default that is based on the raw_index sheet. With BYOM content sources, it is not possible to omit pages from the index. Instead, you can filter them client-side based on the value of their robots property.
Note, that a sitemap will always ignore published pages with noindex, provided a column named robots exists in the index source.
Previous
Forms
Up Next