A deep dive into cursor-based pagination in MongoDB

Exploring cursor-based pagination in MongoDB and the tricky parts.

A deep dive into cursor-based pagination in MongoDB

One of the improvements we had to make to Engage as we continue to scale is changing the way we paginate some result sets from offset-based pagination to cursor-based pagination. This Mixmax article is a good introduction but in this piece, I explore more details and talk about navigating some tricky parts of cursor-based pagination.

Outline

Let’s start by defining the two types of pagination and why cursor-based pagination.

What is offset-based pagination?

Offset-based pagination lets us paginate by skipping a number of records (the offset). If we have a table with 100 items and want to show 20 items per page (this is called the limit), our first result page will be 1 to 20. To show the next page, i.e 21 to 40, we skip the first 20 items. This is what that will look like in SQL

select * from items limit 20 offset 20;

And in MongoDB:

db.items.find({}, { limit: 20, skip: 20 })

What is cursor-based pagination?

Instead of skipping results to move across pages, cursor-based pagination lets us use a record as a pointer (this is called the cursor) so we can get items after or before that record. Using our 100 items example, our first result page is 1 to 20. To get the next page, we use item 20 (the last item of the page) as our pointer (i.e, the cursor) and get items after it.

select * from items where id > 20 limit 20;
db.items.find({ id: { $gt: 20 } }, { limit: 20 })

But this is putting it in a simple way. We are making the assumption that our items have an id field that is 1 to 100 so we can get items with ids greater than or less than the pointer id (20). This probably won’t be the case in a real-world scenario. But it doesn’t matter. As long as the field is immutable, sortable and unique, we are fine. IDs and date fields are good candidates for this. Even though MongoDB IDs are not auto-increment IDs, you can use them because they are [vaguely] based on timestamp.

// Where 507f191e810c19729de860ea is the _id
// of last item in previous page
db.items.find({
  _id: { $gt: new ObjectId('507f191e810c19729de860ea') }
}, {
  limit: 20
})

Why cursor-based pagination?

Scalability

The challenge with offset-based pagination is that the database needs to scan skipped records before getting to the requested data. This means that on page 5 of our example, the database will still scan items 1 to 80 before returning results for 80 to 100. Scanning 80 items may look small but as the offset increases, the number of records to scan increases.

With cursor-based pagination, the database can directly jump to the “position” of the needed result set. Note that to achieve this desired advantage though, the cursor field has to be indexed. IDs are indexed by default so that is fine. If you are using another field, a date field, for example, ensure it is indexed.

Guarantees accurate results

Another challenge with offset-based pagination is that if there are changes to the dataset (a data addition or removal) during pagination, it may cause wrong results to be shown.

Let’s look at another example with a smaller dataset of 1 to 10 items at 5 items per page.

If between when I was on page 1 and navigating to page 2, items 4 and 5, were removed, page 2 will only show me 8 to 10 instead of 6 to 10

This is because I specified an offset of 5 (as I should) and that tells the database to skip 5 items – 1, 2, 3, 6, 7.

The same thing happens if new items are added. This doesn’t happen in cursor-based pagination as the exact pointer to the last or previous item is specified.

Challenges of cursor-based pagination

Cursor-based pagination is not without its challenges. With offset-based pagination, you can list out the number of pages and allow people easily jump across the pages. In fact, the URL can also be directly modified to jump to a particular page.

offset-based pagination on Github

You can’t do this with cursor-based pagination. This makes it difficult to have an idea of the current position of a result set in the entire dataset. Are we close to the end of the dataset or just at the beginning? Or at the middle?

Implementing offset-based pagination

Getting data for offset-based pagination is simple. This is one of its pros.

const data = await db.collection('items').find({}, {
  limit: itemsPerPage,
  skip: offset
}).toArray()

Considering how the pagination on the client or frontend will be built, we will need to return more than just data. We would at least need to send:

  • The offset that returned the current data so we know our current page and if there is a next or previous page.
  • The total number of items so we know if there is a next page. If offset + count of current items is less than the total number of items, then there is a next page
  • The limit used. The current offset minus limit will be used to get data for the previous page. This is also useful if we need to show the total number of pages (i.e, total divided by limit).

Putting it together, we will have something like this:

getPaginatedResult = async (query) => {
  query = query || {}
  // If a limit is specified, use it. Else default to 20
  let limit = +query.limit || 20
  if (limit < 1) { limit = 20 }
  if (limit > 50) { limit = 50 }
  // If an offset is specified, use it. Else default to 0
  // i.e no offset -> first page
  let skip = +query.offset || 0
  if (skip < 1) { skip = 0 }
  const data = await db.collection('items').find({}, {
    limit, // short for limit: limit
    skip // short for skip: skip
  }).toArray()
  const total = await db.collection('items').countDocuments({})

  return {
    offset: skip,
    limit,
    total,
    data
  }
}

On the frontend or client, we can then do this:

<div v-if="result.offset > 0">
  <a :href="'?offset=' + (result.offset - result.limit) ">Previous</a>
</div>
<div v-if="(result.offset + result.data.length) < result.total">
  <a :href="'?offset=' + (result.offset + result.limit) ">Next</a>
</div>

A form of offset pagination uses page for navigation instead of offset . Behind the scene, it’s the same thing.

getPaginatedResult = async (query) => {
  // ...
  let page = +query.page || 1
  if (page < 1) { page = 1 }
  // Convert page to offset
  const offset = limit * page
  // ...
  return {
    page,
    limit,
    total,
    data
  }
}
<div v-if="result.page > 1">
  <a :href="'?page=' + (result.page - 1) ">Previous</a>
</div>
<div v-if="(result.page * result.limit) < result.total">
  <a :href="'?page=' + (result.page + 1) ">Next</a>
</div>

Getting total

Let’s digress a bit to take a look at getting total. Here is how to count data in a MongoDB collection.

const total = await db.collection('items').countDocuments({ query })

This statement counts all data that matches the query in the collection [by running an aggregation]. It looks simple, but on a collection with a lot of data, this will cause performance issues.

There is a count alternative that uses the collection's metadata for count but has some limitations. For one, you cannot use it with a query. It has to be for the whole collection. It is also not always accurate.

To avoid the performance issue with countDocument, one trick is to limit the count to a number that is not too large. (That number will depend on your database’s performance, so you will need to experiment to know what won’t choke your DB).

let total = await db.collection('items').countDocuments({}, { limit: 10000 })
if (total === 10000) {
  total = -1
}
<div>Showing {{ result.offset + 1 }} to {{ result.offset + result.data.length }} of {{ result.total == -1 ? 'many' : result.total }}</div>
<div class="flex justify-between">
  <div v-if="result.offset > 0">
    <a :href="'?offset=' + (result.offset - result.limit) ">Previous</a>
  </div>
  <div v-if="result.total === -1 || (result.offset + result.data.length) < result.total">
    <a :href="'?offset=' + (result.offset + limit) ">Next</a>
  </div>
</div>

Implementing cursor-based pagination

const data = await db.collection('items').find({}, {
  sort: { _id: -1 },
  limit
}).toArray()

It can be this simple. The frontend can figure out what next cursor to pass to get to the next page or previous page.

<a :href="'?prev_cursor=' + result.data[0]._id ">Previous</a>
<a :href="'?next_cursor=' + result.data[result.data.length - 1]._id ">Next</a>

But we can do better. We can help the frontend know if there is a previous page or if there is a next page. There are a couple of ways to do this. One common pattern is returning a previous_cursor parameter if there is a previous page and returning a next_cursor parameter if there is a next page.

Let’s do that.

getPaginatedResult = async (query) => {
  query = query || {}

  let limit = +query.limit || 20
  if (limit < 1) { limit = 20 }
  if (limit > 50) { limit = 50 }

  const data = await db.collection('items').find({}, {
    sort: { _id: -1 },
    limit
  }).toArray()

  let hasNext, hasPrev, lastItem, firstItem
  if (data.length) {
    lastItem = data[data.length - 1]._id
    firstItem = data[0]._id

    // If there is an item with id less than last item (remember, sort is in desc _id), there is a next page
    const q = { _id: { $lt: lastItem } }
    const r = await db.collection('items').findOne(q)
    if (r) {
      hasNext = true
    }
    // Short form:
    // hasNext = !!await db.collection('items').findOne(q)

    // If there is an item with id greater than first item (remember, sort is in desc _id), there is a previous page
    q._id = {
      $gt: firstItem
    }
    hasPrev = !!await db.collection('items').findOne(q)
  }
  const response = {
    data
  }
  if (hasNext) {
    response.next_cursor = `${lastItem}`
  }
  if (hasPrev) {
    response.previous_cursor = `${firstItem}`
  }
  return response
}

You can add total  to response if you want to. If you are dealing with a large data set, you probably don’t need to unless showing the total number of results is important.

On the frontend:

<div v-if="result.previous_cursor">
  <a :href="'?prev_cursor=' + result.previous_cursor ">Previous</a>
</div>
<div v-if="result.next_cursor">
  <a :href="'?next_cursor=' + result.next_cursor ">Next</a>
</div>

Shaping up well but getPaginatedResult has no support for the cursor parameters yet. Let’s add it.

getPaginatedResult = async (query) => {
  // ...
  const q = {}
  const sort = { _id: -1 }
  if (query.previous_cursor) {
    q._id = { $gt: new ObjectId(query.previous_cursor) }
    sort._id = 1
  } else if (query.next_cursor) {
    q._id = { $lt: new ObjectId(query.next_cursor) }
  }

  const data = await db.collection('items').find(q, {
    sort,
    limit
  }).toArray()
  if (query.previous_cursor) data.reverse()
  // ...
}

Note that if it is previous_cursor, we sort _id in ascending order and reverse the result.

Sorting and paginating with a non-unique field

Sometimes you want to allow users sort the results by a specific field that is not the id field. For example, “last used date” or “event count”.

Let’s take a look at an example where we sort by last_used.

getPaginatedResult = async (query) => {
  // ...
  const sort = {
    last_used: -1
  }
  // ...
  const data = await db.collection('items').find(q, {
    sort,
    limit
  }).toArray()
  if (query.previous_cursor) data.reverse()
  // ...
  if (hasNext) {
    response.next_cursor = new Date(data[data.length - 1].last_used).toISOString()
  }
  if (hasPrev) {
    response.previous_cursor = new Date(data[0].last_used).toISOString()
  }
  // ...
}

(I converted the date to ISO format so that there are no spaces. Makes it easier to work with at the frontend/client-side).

There is a small issue here. If last_used is not unique, our results will be inaccurate. If for example the value of our cursor is 22/6/6 and there are 2 records with that value, our pagination query will omit records with that value.

To solve this, we use two fields as our cursor – in this case _id and last_used.

getPaginatedResult = async (query) => {
  // ...
  if (hasNext) {
    response.next_cursor = data[data.length - 1]._id + '_' + new Date(data[data.length - 1].last_used).toISOString()
  }
  if (hasPrev) {
    response.previous_cursor = data[0]._id + '_' + new Date(data[0].last_used).toISOString()
  }
  // ...
}

This also means we need to sort and query by both fields.

getPaginatedResult = async (query) => {
  // ...
  if (query.previous_cursor) {
    const [cursorId, cursorDate] = query.previous_cursor.split('_')
    q.$or = [
      { last_used: { $lt: new Date(cursorDate) } },
      {
        last_used: new Date(cursorDate),
        _id: { $lt: new ObjectId(cursorId) }
      }
    ]
  } else if (query.next_cursor) {
    const [cursorId, cursorDate] = query.next_cursor.split('_')
    q.$or = [
      { last_used: { $gt: new Date(cursorDate) } },
      {
        last_used: new Date(cursorDate),
        _id: { $gt: new ObjectId(cursorId) }
      }
    ]
  }

  const sort = {
    last_used: -1,
    _id: -1
  }
  // ...
  const data = await db.collection('items').find(q, {
    sort,
    limit
  }).toArray()
  if (query.previous_cursor) data.reverse()
  // ...
}

There is an existing $or operator

We have seen that paginating on two fields requires adding an $or operator to our query. What happens if the query already has an $or operator? For example, imagine we want to paginate this:

const q = {
  $or: [
    { 'owner': user },
    { 'added_by': user }
  ]
}
const data = await db.collection('items').find(q).toArray()

What we do in this case is to use $and to combine the existing $or and the $or for our pagination.

const queryOr = [
  { 'owner': user },
  { 'added_by': user }
]
const paginationOr = [
  { last_used: { $gt: new Date(cursorDate) } },
  {
    last_used: new Date(cursorDate),
    _id: { $gt: new ObjectId(cursorId) }
  }
]
const q = {
  $and: [
    { $or: queryOr },
    { $or: paginationOr }
  ]
}
const data = await db.collection('items').find(q).toArray()

Ascending and descending order sorting

We have been sorting in one direction all along. Can we allow users sort the result by a custom field in both descending and ascending order? Why not?

getPaginatedResult = async (query) => {
  // ...
  const order = {
    query: 'desc',
    key: '',
    sort_order: -1
  }
  if (query.order === 'asc') {
    order.query = 'asc'
  }
  let cursor
  if (query.previous_cursor) {
    cursor = query.previous_cursor
    if (order.query === 'desc') {
      order.key = '$gt'
      order.sort_order = 1
    } else {
      order.key = '$lt'
      order.sort_order = -1
    }
  } else if (query.next_cursor) {
    cursor = query.next_cursor
    if (order.query === 'desc') {
      order.key = '$lt'
      order.sort_order = -1
    } else {
      order.key = '$gt'
      order.sort_order = 1
    }
  }

  if (cursor) {
    const [cursorId, cursorDate] = cursor.split('_')
    q.$or = [
      { last_used: { [order.key]: new Date(cursorDate) } },
      {
        last_used: new Date(cursorDate),
        _id: { [order.key]: new ObjectId(cursorId) }
      }
    ]
  }

  const sort = {
    last_used: order.sort_order,
    _id: order.sort_order
  }
  // ...
  const data = await db.collection('items').find(q, {
    sort,
    limit
  }).toArray()
  if (query.previous_cursor) data.reverse()

  const response = {
    data,
    order: order.query
  }
  // ...
}
<div v-if="result.previous_cursor">
  <a :href="'?order=' + result.order + '&prev_cursor=' + result.previous_cursor ">Previous</a>
</div>
<div v-if="result.next_cursor">
  <a :href="'?order=' + result.order + '&next_cursor=' + result.next_cursor ">Next</a>
</div>

All together now

Multiple field sorting in both directions? Let's do it.

getPaginatedResult = async (query) => {
  const q = {}
  query = query || {}
  let limit = +query.limit || 20
  if (limit < 1) { limit = 20 }
  if (limit > 50) { limit = 50 }

  // What field is our cursor?
  let cursorField = '_id'
  const allowedFields = ['_id', 'last_used', 'pageviews']
  if (query.field && allowedFields.includes(query.field)) {
    cursorField = query.field
  }

  const order = {
    query: 'desc',
    key: '',
    sort_order: -1
  }
  if (query.order === 'asc') {
    order.query = 'asc'
  }

  let cursor
  if (query.previous_cursor) {
    cursor = query.previous_cursor
    if (order.query === 'desc') {
      order.key = '$gt'
      order.sort_order = 1
    } else {
      order.key = '$lt'
      order.sort_order = -1
    }
  } else if (query.next_cursor) {
    cursor = query.next_cursor
    if (order.query === 'desc') {
      order.key = '$lt'
      order.sort_order = -1
    } else {
      order.key = '$gt'
      order.sort_order = 1
    }
  }
  const sort = {}
  if (cursor) {
    let [cursorId, cursorParam] = cursor.split('_')
    cursorId = new ObjectId(cursorId)
    if (cursorParam) {
      if (cursorField === 'last_used') {
        cursorParam = new Date(cursorParam)
      } else if (cursorField === 'pageviews') {
        cursorParam = +cursorParam
      }
      q.$or = [
        { cursorField: { [order.key]: cursorParam } },
        {
        cursorField: cursorParam,
        _id: { [order.key]: cursorId }
        }
      ]
      sort[cursorField] = order.sort_order
    } else {
      q._id = { [order.key]: cursorId }
    }
    sort._id = order.sort_order
  }

  const data = await db.collection('items').find(q, {
    sort,
    limit
  }).toArray()
  if (query.previous_cursor) data.reverse()

  let hasNext, hasPrev
  if (data.length) {
    // Has next?
    order.key = (order.query === 'desc') ? '$lt' : '$gt'
    cursorParam = data[data.length - 1][cursorField]
    if (cursorField === '_id') {
      q._id = { [order.key]: new ObjectId(cursorParam) }
    } else {
      if (cursorField === 'last_used') {
        cursorParam = new Date(cursorParam)
      } else if (cursorField === 'pageviews') {
        cursorParam = +cursorParam
      }
      q.$or = [
        { cursorField: { [order.key]: cursorParam } },
        {
          cursorField: cursorParam,
          _id: { [order.key]: new ObjectId(data[data.length - 1]._id) }
        }
      ]
    }
    hasNext = !!await dbo.db().collection('items').findOne(q)

    // Has previous page
    order.key = (order.query === 'desc') ? '$lt' : '$gt'
    cursorParam = data[0][cursorField]
    if (cursorField === '_id') {
      q._id = { [order.key]: new ObjectId(cursorParam) }
    } else {
      if (cursorField === 'last_used') {
        cursorParam = new Date(cursorParam)
      } else if (cursorField === 'pageviews') {
        cursorParam = +cursorParam
      }
      q.$or = [
        { cursorField: { [order.key]: cursorParam } },
        {
          cursorField: cursorParam,
          _id: { [order.key]: new ObjectId(data[0]._id) }
        }
      ]
    }
    hasPrev = !!await dbo.db().collection('items').findOne(q)
  }

  const response = {
    data,
    order: order.query
  }
  if (hasNext) {
    response.next_cursor = data[data.length - 1]._id
    if (cursorField === '_id') {
      response.next_cursor += '_' + data[data.length - 1][cursorField]
    }
  }
  if (hasPrev) {
    response.next_cursor = data[0]._id
    if (cursorField === '_id') {
      response.next_cursor += '_' + data[0][cursorField]
    }
  }

  return response
}

A final note on indexing

Always ensure your cursor fields are indexed. This is because you are using them for sorting and filtering. If the cursor is _id, no extra action is needed; ids are indexed by default.

db.collection('items').createIndex('last_used')

If your cursor is multiple fields like with our last_used example, you need to create a compound index based on both fields.

db.collection('items').createIndex({ last_login: -1, _id: -1 })

Both fields are indexed in descending order to support sorting both fields in that order. However, the index will also support sorting both fields in ascending order. In order words, it's the same as:

db.collection('items').createIndex({ last_login: 1, _id: 1 })