Suppose I want to maintain a set of people where each has an age and a wealth. I want to be able to quickly insert people, delete people, and answer queries of the form “find the th richest person whose age is between and ”.

Here is a summary of solutions to different variants on this question:

Variation Solution details Space Update time Query time
No insertion or deletion, one-sided interval Persistent binary search trees $$ n \cdot \log(n)$$ N/A $$ \log(n)$$
No insertion or deletion, two-sided interval Persistent binary search trees $$ n \cdot \log(n)$$ N/A $$ \log(n)$$
Insertion, deletion, two-sided interval My answer, presented here $$ n \cdot \log(n)$$ $$ \log^2(n)$$ $$ \log^3(n)$$

I have been somewhat interested in this question for about a year, and I’ve asked about related questions a few times. But no-one’s ever managed to give me a complete answer.

The other day, I ran across this StackOverflow answer, which presents an answer which I modified to make a full solution, which I’ll present here.

My solution allows insertion and deletion in , and allows the query in .

Solution

Store an order statistic tree ordered on age. At every node, store a pointer to an auxiliary order statistic tree, of all of that node’s descendants ordered on income.

At every node, this requires an extra amount of memory which is linear in the number of descendants of that node. So this means that the tree will take memory overall.

Query

You might want to go read that second StackOverflow answer again, because this algorithm is similar to that one, and it’s easier to understand the algorithm on arrays.

The query is similar to how queries across ranges of an augmented BST usually work: we start out by finding the set of nodes whose descendants contain the whole subsection of the tree that you care about. There will be of these nodes, and finding all of them takes time.

We end up with OSTs of maximum size , and we want to find the th smallest item in their disjoint union.

As discussed here, we can solve that query in .

Alternatively, if your is small, you can directly traverse the trees to find the correct answer, which takes time.

Updates

When I insert a new value into my set, I need to update the auxiliary OSTs of all of the ancestor nodes of my new node.

Usually, it’s easy to argue that maintaining auxiliary data in your OST is fast, because usually your auxiliary data is something like “the sum of your descendants” or something which is . In this data structure, the efficiency argument is somewhat more complicated.

Inserting a single item into an OST takes time. But making the OST from scratch takes . This is concerning because it means that tree rotations are potentially extremely expensive. If I had to do tree rotations all the way up from my new node to the root of the tree every single insertion, then insertion would take linear time.

Luckily, we can decide that our OST is balanced using the red-black tree rules. Insertion in a red-black tree only involves amortized rotations. (See here for an explanation of this.)

The node at the lowest level will have to totally regenerate its auxiliary OST every insertion, of course. Its parent will have to do a tree rotation which requires it to totally regenerate its auxiliary OST every second insertion. Its grandparent will need to do that of the time. And then and so on.

At height in the tree, defining the leaves to be , the amortized cost of insertion is going to be for insertion plus for totally recreating the OST after a rebalance, for a total cost of .

The total time required for updating all the auxiliary OSTs after an insert is therefore:

$^$O\left(\sum_{h=0}^{log(n)} h \right)= O\left(\log(n)^2\right)$^$

Updating or deleting a node also takes , for the same reason.

Variations

limited

If is always going to fixed below a particular limit –say, you know ahead of time that you’re never going to need to know farther back than the 50th richest person between two ages–each node in your main OST can store a smaller auxiliary tree with only elements in it.

This reduces memory requirements to .

Queries: Using the algorithm for OSTs in this table, the cost is now , which looks like as grows.

Updates: Every ancestor needs to do work now, instead of , but you still have ancestors. So update takes overall time.