scroll-oldScrolling searches

The Data Lake Scroll API lets you retrieve very large result sets (hundreds of thousands or millions of documents) reliably and in batches.

A normal search is optimized for interactive queries and returns only the first page; the Scroll API, instead:

  • Creates a point-in-time snapshot of the matching results (as of the initial search).

  • Returns results in fixed-size pages (e.g., 1,000 docs per call).

  • Gives you a _scroll_id to request the next page repeatedly until no hits remain.

Use scrolling when you need to export, backfill, or reprocess many documents. For interactive UI queries, prefer normal search (or PIT + search_after).

Steps to query an index and iterate results

Step A — Run the initial search and start a scroll context

  1. Set a scroll keep-alive (e.g., 1m = one minute).

  2. Choose a page size (size) that fits your memory/network (1,000–5,000 is common).

  3. Sort by "_doc" for efficient scrolling.

POST /datalakeapi/index_name/_search?scroll=1m
{
  "size": 1000,
  "sort": ["_doc"],
  "query": { "match_all": {} }
}

Response:

  • Save the returned _scroll_id.

  • Process the first page’s hits.

circle-info

TIP

If you only need the _source fields, add "_source": ["fieldA","fieldB",... ] to reduce payload.

Step B — Request the next page using the last _scroll_id

  • Always send the latest _scroll_id you received.

  • Keep the same scroll keep-alive in each call (it refreshes the timeout).

Response:

  • Process this page of events.

  • Replace your stored _scroll_id with the new one from this response.

  • Repeat Step B until events is an empty array [] (no more results).

When you finish (or on error), clear the server-side scroll context(s):

You can also pass multiple IDs in the array if you tracked more than one.

Last updated