Documentation

Learn how to build, publish, and launch your site with Adobe Experience Manager.

Resources

Indexing

Adobe Experience Manager offers a way to keep an index of all the published pages in a particular section of your website. This is commonly used to build lists, feeds, and enable search and filtering use cases for your pages or content fragments.

AEM keeps this index in a spreadsheet when using Google Drive or Sharepoint as backend, and offers access to it using JSON. Please see the document Spreadsheets and JSON for more information.

Setting up an initial index with the Index Admin Tool

The easiest way to create and manage your query index is via the Index Admin tool.

  1. Open the Index Admin tool in your browser.
  2. Enter your organization and site to connect to your project.
  3. Click Add Index to create an initial index configuration.
  4. Enter a Name for your index.
  5. Locate the Properties section.
  6. Add each property you want extracted from the rendered HTML page — for example, title, image, description, or lastModified.
  7. When you’re done, click Save.


The following table summarizes the properties that are available and from where in the HTML page they’re extracted.

Name Description
author Returns the content of the meta tag named author in the head element.
title Returns the content of the og:title meta property in the head element.
date Returns the content of the meta tag named publication-date in the head element.
image Returns the content of the og:image meta property in the head element.
category Returns the content of the meta tag named category in the head element.
tags

Returns the content of the meta tag named article:tag in the head element as an array.

See the document Spreadsheets and JSON for more information on array-handling.

description Returns the content of the meta tag named description in the head element.
robots Returns the content of the meta tag named robots in the head element.
lastModified Returns the value of the Last-Modified response header for the document.

Reindexing

After setting up or updating your properties, click Reindex in the Index Admin tool. This triggers a full reindex of your content against the new configuration.

Pages are indexed when they are published. To remove a page from the index, unpublish it.

Setting up an initial index via Admin API

It’s also possible to create an index using the Admin API. For more information, visit: Update Indexing Configuration.

Troubleshooting

Check your index

The Admin Service has an API endpoint where you can check the index representation of your page. Given your organization, site and branch, and a resource path to a page, its endpoint is:

https://admin.hlx.page/index/<org>/<site>/<branch>/<path>

You should get a JSON response where the data node contains the index representation of the page.

Debug your index configuration

The AEM CLI has a feature where it will print the index record whenever you change your query configuration, which assists in finding the correct CSS selectors:

$ aem up --print-index

Please see the AEM CLI GitHub documentation for more information and watch this video to learn more about this feature.

Inspect the audit log

Using the Log Viewer Tool you can check whether the indexer reports any error related to your configuration. If you filter logs by Indexer, and see lines with a red dot, you can expand the lines and inspect the error reported.

Custom index definitions

See the Indexing reference for the full syntax of index definitions, including extraction functions and examples.

Omitting published pages from the index

A common use case is to not index pages that have noindex in their robots metadata section. It is possible to filter those out for SharePoint and Google Drive content sources by defining a FILTER expression in a sheet called helix-default that is based on the raw_index sheet. With BYOM content sources, it is not possible to omit pages from the index. Instead, you can filter them client-side based on the value of their robots property.

Note, that a sitemap will always ignore published pages with noindex, provided a column named robots exists in the index source.