Using Data Structures More Efficiently: Towards Augmentation of Data Stuctures…A Journey Towards Optimization

Rushi Trivedi
10 min readAug 8, 2020

Hello readers…..we all uses various Data Structures in our day to day development life, but have you ever think of how efficiently we are using these data structures? No..Yes definetly we always think to use efficient data structure but we haven’t think till now how we can use the Data Structure in the best and efficient way, I am talking about nothing but Augmentation of Data Structures or Using Augmented Data Structures, Augmented Data Stuctures are nothing but extending the existing Data Stuctures for more optimal and efficient use.

So there are some practical engineering situations that may requires dash of creativity, in that type of situation you may need to create an entirely new type of data structure, besides this you may need to program new operations for the data structure to support the desired application.

Augmenting a data structure is not always straightforward as the addd information to the base of data structure must be updated and maintained by the ordinary operations on the data structure.

In this blog of augmentation of data structure, we will focus on augmented data structures that are developed by augmenting red-black trees and the abstract process of augmentation of non-primitive data structures.

Dynamic order statistics

Specifically, the ith order statistic of a set of n elements, where i belongs to { 1, 2, . . ., n}, is simply the element in the set with the ith smallest key. We saw that any order statistic could be retrieved in O(n) time from an unordered set.

Basically, Dynamic Order Statistics is an algorithm in which element at the ith position or position of the ith element has to be determined after sorting the set of undordered data set.

  • First part is Sorting, we need to sort an unordered data set, for this any alogrithm can be used.
  • If we use bubble or selection or insertion sort, sorting itself will result into O(n²) complexity, because if al these sorting algorithm has O(n²) worst case complexity, on the other side if we use merge sort or quick sort or radix sort i.e. any sorting algorithm from divide and conquer section that will also gives minimum complexity of O(nlogn).
  • So in this blog we shall see how red-black trees can be modified so that any order statistic can be determined in O(log n) time. We shall also see how the rank of an element — its position in the linear order of the set — can likewise be determined in O(log n) time.

An order-statistic tree T is simply a red-black tree with additional information stored in each node. Besides the usual red-black tree fields key[x], color[x], p[x], left[x], and right[x] in a node x, we have another field size[x].

This field contains the number of (internal) nodes in the subtree rooted at x (including x itself), that is, the size of the subtree. If we define size[NIL] to be 0, then we have the identity

size[x] = size[left[x]] + size[right[x]] + 1 .

Retrieving an element with a given rank

Before we show how to maintain this size information during insertion and deletion, let us examine the implementation of two order-statistic queries that use this additional information. We begin with an operation that retrieves an element with a given rank. The procedure OS-SELECT(x, i) returns a pointer to the node containing the ith smallest key in the subtree rooted at x. To find the ith smallest key in an order-statistic tree T, we call OS-SELECT(root[T], i).

OS-SELECT(x,i)1  r <- size[left[x]] + 12  if i = r3     then return x4  elseif i < r5     then return OS-SELECT(left[x],i)6  else return OS-SELECT(right[x],i - r)

In line 1 of OS-SELECT, we compute r, the rank of node x within the subtree rooted at x. If i = r, then node x is the ith smallest element, so we return x in line 3. If i < r, then the ith smallest element is in x’s left subtree, so we recurse on left[x] in line 5. If i > r, then the ith smallest element is in x’s right subtree. Since there are r elements in the subtree rooted at x that come before x’s right subtree in an inorder tree walk, the ith smallest element in the subtree rooted at x is the (i — r)th smallest element in the subtree rooted at right[x]. This element is determined recursively in line 6.

To see how OS-SELECT operates, consider a search for the 17th smallest element in the order-statistic tree (figure). We begin with x as the root, whose key is 26, and with i = 17. Since the size of 26’s left subtree is 12, its rank is 13. Thus, we know that the node with rank 17 is the 17–13 = 4th smallest element in 26’s right subtree. After the recursive call, x is the node with key 41, and i = 4. Since the size of 41’s left subtree is 5, its rank within its subtree is 6. Thus, we know that the node with rank 4 is in the 4th smallest element in 41’s left subtree. After the recursive call, x is the node with key 30, and its rank within its subtree is 2. Thus, we recurse once again to find the 4–2 = 2nd smallest element in the subtree rooted at the node with key 38. We now find that its left subtree has size 1, which means it is the second smallest element. Thus, a pointer to the node with key 38 is returned by the procedure.

Because each recursive call goes down one level in the order-statistic tree, the total time for OS-SELECT is at worst proportional to the height of the tree. Since the tree is a red-black tree, its height is O(log n), where n is the number of nodes. Thus, the running time of OS-SELECT is O(log n) for a dynamic set of n elements.

Determining the rank of an element

Given a pointer to a node x in an order-statistic tree T, the procedure OS-RANK returns the position of x in the linear order determined by an inorder tree walk of T.

OS-RANK(T,x)1  r <- size[left[x]] + 12  y <- x3  while y <- root[T]4      do if y = right[p[y]]5            then r <- r + size[left[p[y]]] + 16         y <- p[y]7  return r

The procedure works as follows. The rank of x can be viewed as the number of nodes preceding x in an inorder tree walk, plus 1 for x itself. The following invariant is maintained: at the top of the while loop of lines 3–6, r is the rank of key[x] in the subtree rooted at node y. We maintain this invariant as follows. In line 1, we set r to be the rank of key[x] within the subtree rooted at x. Setting y <- x in line 2 makes the invariant true the first time the test in line 3 executes. In each iteration of the while loop, we consider the subtree rooted at p[y]. We have already counted the number of nodes in the subtree rooted at node y that precede x in an inorder walk, so we must add the nodes in the subtree rooted at y’s sibling that precede x in an inorder walk, plus 1 for p[y] if it, too, precedes x. If y is a left child, then neither p[y] nor any node in p[y]’s right subtree precedes x, so we leave r alone. Otherwise, y is a right child and all the nodes in p[y]’s left subtree precede x, as does p[y] itself. Thus, in line 5, we add size[left[y]] + 1 to the current value of r Setting y <- p[y] makes the invariant true for the next iteration. When y = root[T], the procedure returns the value of r, which is now the rank of key[x].

As an example, when we run OS-RANK on the order-statistic tree of Figure 15.1 to find the rank of the node with key 38, we get the following sequence of values of key[y] and r at the top of the while loop:

iteration  key[y]  r--------------------1        38    22        30    43        41    44        26   17

The rank 17 is returned.

Since each iteration of the while loop takes O(1) time, and y goes up one level in the tree with each iteration, the running time of OS-RANK is at worst proportional to the height of the tree: O(lg n) on an n-node order-statistic tree.

Maintaining subtree sizes

Given the size field in each node, OS-SELECT and OS-RANK can quickly compute order-statistic information. But unless these fields can be efficiently maintained by the basic modifying operations on red-black trees, our work will have been for naught. We shall now show that subtree sizes can be maintained for both insertion and deletion without affecting the asymptotic running times of either operation.

We noted that insertion into a red-black tree consists of two phases. The first phase goes down the tree from the root, inserting the new node as a child of an existing node. The second phase goes up the tree, changing colors and ultimately performing rotations to maintain the red-black properties.

To maintain the subtree sizes in the first phase, we simply increment size[x] for each node x on the path traversed from the root down toward the leaves. The new node added gets a size of 1. Since there are O(lg n) nodes on the traversed path, the additional cost of maintaining the size fields is O(lg n).

In the second phase, the only structural changes to the underlying red-black tree are caused by rotations, of which there are at most two. Moreover, a rotation is a local operation: it invalidates only the two size fields in the nodes incident on the link around which the rotation is performed. Referring to the code for LEFT-ROTATE(T,x) we need to add the following lines:

13  size[y] <- size[x]14  size[x] <- size[left[x]] + size[right[x]] + 1

Since at most two rotations are performed during insertion into a red-black tree, only O(1) additional time is spent updating size fields in the second phase. Thus, the total time for insertion into an n-node order-statistic tree is O(log n) — asymptotically the same as for an ordinary red-black tree.

Deletion from a red-black tree also consists of two phases: the first operates on the underlying search tree, and the second causes at most three rotations and otherwise performs no structural changes. The first phase splices out one node y. To update the subtree sizes, we simply traverse a path from node y up to the root, decrementing the size field of each node on the path. Since this path has length O(log n) in an n-node red-black tree, the additional time spent maintaining size fields in the first phase is O(log n). The O(1) rotations in the second phase of deletion can be handled in the same manner as for insertion. Thus, both insertion and deletion, including the maintenance of the size fields, take O(log n) time for an n-node order-statistic tree.

How to augment a data structure

This section we can consider as the heart of the whole blog, in this section we will discuss that how to give augmentation to an data structure, what things needs to be taken into consideration while augmenting an data structure as per client or project requirement.

The process of augmenting a basic data structure to support additional functionality occurs quite frequently in algorithm design. It will be used again in the next section to design a data structure that supports operations on intervals. In this section, we shall examine the steps involved in such augmentation. We shall also prove a theorem that allows us to augment red-black trees easily in many cases.

Augmenting a data structure can be broken into four steps:

1. choosing an underlying data structure,

2. determining additional information to be maintained in the underlying data structure,

3. verifying that the additional information can be maintained for the basic modifying operations on the underlying data structure, and

4. developing new operations.

As with any prescriptive design method, you should not blindly follow the steps in the order given. Most design work contains an element of trial and error, and progress on all steps usually proceeds in parallel. There is no point, for example, in determining additional information and developing new operations (steps 2 and 4) if we will not be able to maintain the additional information efficiently. Nevertheless, this four-step method provides a good focus for your efforts in augmenting a data structure, and it is also a good way to organize the documentation of an augmented data structure.

We followed these steps to design our order-statistic trees. For step 1, we chose red-black trees as the underlying data structure. A clue to the suitability of red-black trees comes from their efficient support of other dynamic-set operations on a total order, such as MINIMUM, MAXIMUM, SUCCESSOR, and PREDECESSOR.

For step 2, we provided the size fields, which in each node x stores the size of the subtree rooted at x. Generally, the additional information makes operations more efficient. For example, we could have implemented OS-SELECT and OS-RANK using just the keys stored in the tree, but they would not have run in O(log n) time.

For step 3, we ensured that insertion and deletion could maintain the size fields while still running in O(log n) time. Ideally, a small number of changes to the data structure should suffice to maintain the additional information. For example, if we simply stored in each node its rank in the tree, the OS-SELECT and OS-RANK procedures would run quickly, but inserting a new minimum element would cause a change to this information in every node of the tree. When we store subtree sizes instead, inserting a new element causes information to change in only O(log n) nodes.

For step 4, we developed the operations OS-SELECT and OS-RANK. After all, the need for new operations is why we bother to augment a data structure in the first place.

Other examples of data structures may includes Interval trees, you can refer the interval tree implementation from https://www.geeksforgeeks.org/interval-tree/

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

--

--

Rushi Trivedi

Full Stack Developer || Application and Software Developer || DevOps Engineer || Ex-Oracle || M.Tech. CSE