πŸŽ‰ Try the public beta of the new docs site at algolia.com/doc-beta! πŸŽ‰
Tools / Crawler / APIs / Configuration
Type: Action[]
Required
Parameter syntax
{
  actions: [
    {
      indexName: 'index_name',
      pathsToMatch: ['url_path', ...]
      fileTypesToMatch: ['file_type', ...],
      autoGenerateObjectIDs: true|false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
      }
    },
  ],
}

About this parameter# A

Determines which web pages are translated into Algolia records and in what way.

A single action defines:

  1. the subset of your crawler’s websites it targets,
  2. the extraction process for those websites,
  3. and the indices to which the extracted records are pushed.

A single web page can match multiple actions. In this case, your crawler creates a record for each matched action.

Examples# A

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
  actions: [
    {
      indexName: 'dev_blog_algolia',
      pathsToMatch: ['https://blog.algolia.com/**'],
      fileTypesToMatch: ['pdf'],
      autoGenerateObjectIDs: false,
      schedule: 'every 1 day',
      recordExtractor: ({ url, $, contentLength, fileType, dataSources })  => {
        ...
      }
    },
  ],
}

Parameters # A

Action #

Parameter Description
name #
type: string
Optional

The unique identifier of this action (useful for debugging). Required if schedule is set.

indexName #
type: string
Required

The index name targeted by this action. This value is appended to the indexPrefix, when specified.

schedule #
type: string
Optional

How often to perform a complete crawl for this action. See main property schedule for more information.

pathsToMatch #
type: string[]
Required

Determines which webpages match for this action. This list is checked against the url of webpages using micromatch. You can use negation, wildcards and more.

selectorsToMatch #
type: string
Optional

Checks for the presence of DOM nodes matching the given selectors: if the page doesn’t contain any node matching the selectors, it’s ignored. You can also check for the absence of selectors by using negation: if you want to ignore pages that contain a .main class, you can put !.main in the list.

fileTypesToMatch #
type: string[]
default: ["html"]
Optional

Set this value if you want to index documents. Chosen file types will be converted to HTML using Tika, then treated as a normal HTML page. See the documents guide for a list of available fileTypes.

autoGenerateObjectIDs #
type: bool
default: true

Generate an objectID for records that don’t have one. Setting this parameter to false means we’ll raise an error if an extracted record doesn’t have an objectID.

recordExtractor #
type: function
Required

A recordExtractor is a custom Javascript function that lets you execute your own code and extract what you want from a page. Your record extractor should return either an array of JSON or an empty array. If the function returns an empty array, the page is skipped.

1
2
3
4
5
6
7
8
9
10
recordExtractor: ({ url, $, contentLength, fileType})  => {
  return [
    {
      url: url.href,
      text: $('p').html()
      ... /* anything you want */
    }
  ];
  // return []; skips the page
}

action βž” recordExtractor #

Parameter Description
$ #
type: object (Cheerio instance)
Optional

A Cheerio instance containing the HTML of the crawled page.

url #
type: Location object
Optional

A Location object containing the URL and metadata for the crawled page.

fileType #
type: string
Optional

The fileType of the crawled page (e.g.: html, pdf, …).

contentLength #
type: number
Optional

The number of bytes in the crawled page.

dataSources #
type: object
Optional

Object containing the external data sources of the current URL. Each key of the object corresponds to an externalData

1
2
3
4
5
6
{
  dataSources: {
    dataSourceId1: { data1: 'val1', data2: 'val2' },
    dataSourceId2: { data1: 'val1', data2: 'val2' },
  }
}
helpers #
type: object
Optional

Collection of functions to help you extract content and generate records.

recordExtractor βž” helpers #

Parameter Description
docsearch #
type: function
Optional

You can call the helpers.docsearch() function from your recordExtractor. It automatically extracts content and formats it to be compatible with DocSearch. It produces an optimized number of records for relevancy and hierarchy, and you can use it without DocSearch or to index non-documentation content.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
recordExtractor: ({ url, $, helpers }) => {
  return helpers.docsearch({
    aggregateContent: true,
    indexHeadings: true,
    recordVersion: 'v3',
    recordProps: {
      lvl0: {
        selectors: "header h1",
      },
      lvl1: "article h2",
      lvl2: "article h3",
      lvl3: "article h4",
      lvl4: "article h5",
      lvl5: "article h6",
      content: "main p, main li",
    },
  });
}

You can find more examples in the DocSearch documentation

splitContentIntoRecords #
type: function
Optional

The helpers.splitContentIntoRecords() function is callable from your recordExtractor. It extracts textual content from the resource (i.e. HTML page or document) and splits it into in one or more records. It can be used to index the textual content exhaustively and in a way to prevent record_too_big errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
recordExtractor: ({ url, $, helpers }) => {
  const baseRecord = {
    url,
    title: $('head title').text().trim(),
  };
  const records = helpers.splitContentIntoRecords({
    baseRecord,
    $elements: $('body'),
    maxRecordBytes: 1000,
    textAttributeName: 'text',
    orderingAttributeName: 'part',
  });
  // You can still alter produced records
  // afterwards, if needed.
  return records;
}

In the preceding recordExtractor() example function, crawling a long HTML page will return an array of records that will never exceed the limit of 1000 bytes per record. The records, extracted by the splitContentIntoRecords method, would look similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 0
    text: 'Welcome on test.com, the best resource to',
  },
  {
    url: 'http://test.com/index.html',
    title: 'Home - Test.com',
    part: 1
    text: 'find interesting content online.',
  }
]

Assuming that the automatic generation of objectIDs is enabled in your configuration, the crawler generates an objectID for each of the generated records.

In order to prevent duplicate results when searching for a word that appears in multiple records belonging to the same resource (page), we recommend that you enable distinct in your index settings, set the attributeForDistinct, searchableAttributes, and add a custom ranking from first record on your page to the last:

1
2
3
4
5
6
7
8
initialIndexSettings: {
  'my-index': {
    distinct: true,
    attributeForDistinct: 'url'
    searchableAttributes: [ 'title', 'text' ],
    customRanking: [ 'asc(part)' ],
  }
}

Please be aware that using distinct comes with some specificities.

helpers βž” splitContentIntoRecords #

Parameter Description
$elements #
type: string
default: $("body")
Optional

A Cheerio selector that determines from which elements textual content will be extracted and turned into records.

baseRecord #
type: object
default: {}
Optional

Attributes (and their values) to add to all resulting records.

maxRecordBytes #
type: number
default: 10000
Optional

Maximum number of bytes allowed per record, on the resulting Algolia index. You can refer to the record size limits for your plan to prevent any errors regarding record size.

textAttributeName #
type: string
default: text
Optional

Name of the attribute in which to store the text of each record.

orderingAttributeName #
type: string
Optional

Name of the attribute in which to store the number of each record.

helpers βž” docsearch #

Parameter Description
recordProps #
type: object
Required

Main docsearch configuration.

aggregateContent #
type: boolean
default: true
Optional

Whether the helpers automatically merge sibling elements and separate them by a line break.

For: <p>Foo</p><p>Bar<p>

1
2
3
4
{
  aggregateContent: false,
  # creates 2 records
}
1
2
3
4
{
  aggregateContent: true,
  # creates 1 records
}
indexHeadings #
type: boolean | object
default: true
Optional

Whether the helpers create records for headings.

When false, only records for the content level are created. When from, to is provided, only records for the lvlX to lvlY are created.

1
2
3
4
5
6
{
  recordProps: {
    indexHeadings: false
    indexHeadings: { from: 4, to: 6 }
  }
}
recordVersion #
type: string
default: v2
Optional

Change the version of the extracted records. It’s not correlated with the DocSearch version and can be incremented independently.

  • v2: compatible with DocSearch >= @2
  • v3: compatible with DocSearch >= @3

docsearch βž” recordProps #

Parameter Description
lvl0 #
type: object
Required

Select the main category of the page. You should index the title and h1 of the page in lvl1.

1
2
3
4
5
6
{
  lvl0: {
    selectors: '.page-category',
    defaultValue: 'documentation'
  }
}
lvl1 #
type: string | string[]
Required

Select the main title of the page.

1
2
3
{
  lvl1: 'head > title'
}
content #
type: string | string[]
Required

Select the content elements of the page.

1
2
3
{
  lvl1: 'body > p, main li'
}
pageRank #
type: string
Optional

Add an attribute pageRank to the extracted records that you can use to boost the relevance of associated records in the index settings. Note that you can pass any numeric value as a string, including negative values.

1
2
3
{
  pageRank: "30"
}
lv2, lvl3, lvl4, lvl5, lvl6 #
type: string | string[]
Optional

Select other headings of the page.

1
2
3
4
5
{
  lvl2: "main h2",
  lvl3: "footer h3"
  lvl4: ["h4", "div.important"],
}
* #
type: string | string[] | object
Optional

All extra keys are added to the extracted records.

1
2
3
4
5
6
7
{
  myCustomAttribute: '.myCustomClass',
  ogDesc: {
    selectors: 'head meta[name="og:desc"]',
    defaultValue: 'Default description'
  }
}
Did you find this page helpful?