Indexing
Adobe Experience Manager offers a way to keep an index of all the published pages in a particular section of your website. This is commonly used to build lists, feeds, and enable search and filtering use cases for your pages or content fragments.
AEM keeps this index in a spreadsheet and offers access to it using JSON. Please see the document Spreadsheets and JSON for more information.
Setting Up an Initial Query Index
In this section we’ll create a query index in the root folder that will index all documents in your backend.
- After setting up your
fstab.yaml
with a mountpoint that points into your SharePoint site or Google Drive, go to the root folder. - Depending on your backend, create either a workbook named
query-index.xlsx
for SharePoint or a spreadsheet namedquery-index
for Google Drive. - In that spreadsheet or workbook, create a sheet named
raw_index
.
Setting Up Properties to be Added to the Index
- In your
query-index
document, add a header line and in the first column addpath
as the header name. - In the following columns of the header line, add all other properties you need extracted from the rendered HTML page.
Pages are indexed when they are published, to remove pages they have to be unpublished. If you haven’t defined a custom index definition in helix-query.yaml,
by default pages that have robots
set to noindex
will automatically be omitted from indexing. (see robots
below).
In the following example in Google Drive, the extracted fields are title
, image
, description
, and lastModified
.
The following table summarizes the properties that are available and from where in the HTML page they’re extracted.
Name | Description |
---|---|
author |
Returns the content of the meta tag named author in the head element. |
title |
Returns the content of the og:title meta property in the head element. |
date |
Returns the content of the meta tag named publication-date in the head element. |
image |
Returns the content of the og:image meta property in the head element. |
category |
Returns the content of the meta tag named category in the head element. |
tags |
Returns the content of the meta tag named See the document Spreadsheets and JSON for more information on array-handling. |
description |
Returns the content of the meta tag named description in the head element. |
robots |
Returns the content of the meta tag named Note: It is recommended to not add a |
lastModified |
Returns the value of the Last-Modified response header for the document. |
For every other header added, the indexer will try to find a meta tag with a corresponding name.
Activate Your Index
To activate your index, preview the spreadsheet using the sidekick. This will create an index configuration from the spreadsheet that can be viewed at the following location:
https://<branch>--<repo>--<owner>.hlx.page/helix-query.yaml
Checking Your Index
The Admin Service has an API endpoint where you can check the index representation of your page. Given your GitHub owner, repository, branch and owner, and a resource path to a page, its endpoint is:
https://admin.hlx.page/index/<owner>/<repo>/<branch>/<path>
You should get a JSON response where the data node contains the index representation of the page.
Debugging Your Index Configuration
The AEM CLI has a feature where it will print the index record whenever you change your query configuration, which assists in finding the correct CSS selectors:
$ aem up --print-index
Please see the AEM CLI GitHub documentation for more information and watch this video to learn more about this feature.
Setting Up More Index Configurations
You can define your own custom index configurations by creating your own helix-query.yaml.
This allows you to have more than one index configuration in the same helix-query.yaml
, where parts of your sites are indexed into different Excel workbooks or Google spreadsheets. See the document Indexing reference for more information.
Enforcing “noindex” configuration with custom index definitions
If you have defined your own custom index definitions in helix-query.yaml
,setting the robots
property to noindex
is not effective in preventing the pages from getting indexed. In order to enforce noindex
configuration is such situations, do the following:
- Create a sheet named “
helix-default
” in yourquery-index.xlsx
. After this, yourquery-index.xlsx
spreadsheet should have 2 sheets“raw_index
” and“helix-default
”. The“raw_index
” sheet is there to have all the raw indexed data. - Modify your custom
helix-query.yaml
(it must be in your project’s Github repository) and add therobots
property so that it gets indexed. - Now set up your
“helix-default
” sheet in thequery-index.xlsx
spreadsheet to get automatically filled up using Excel formula which ensures that all the rows inraw_index
which haverobots
property set asnoindex
, do not get copied over to thehelix-default
sheet. This can be done by using an Excel formula like this=FILTER(Table1,NOT(Table1[robots]="noindex"))
- Now your helix-default sheet has only the rows from
raw_index
that do not haverobots
property set tonoindex
. - Ensure that you publish the pages that you want to get indexed.
- Now if you fetch the index as usual like:
https://<branch>--<repo>-<org>.hlx.page/query-index.json
, you’d only get data fromhelix-default
sheet i.e. entries that are not explicitly prevented from getting indexed through therobot
property set asnoindex
.
Previous
Forms
Up Next