So far, we’ve discussed how one might go about specifying queries, or “defining interesting,” and have also shown a couple of different ways to present the results that show only “the interesting data.” Here, we want to turn our attention to the underlying machinery that makes this kind of approach feasible in high performance implementations suitable for use with very large datasets.
All Computer Science undergraduates are introduced to the idea of binary trees and their use as an indexing data structure. Briefly, if you have a sorted array of data of N items, you can construct a binary tree that will have N-1 nodes and N leaves where each interior node partitions the data in deeper nodes and leaves into two groups – “greater than” and “less than or equal to” the value of a key. Once you have constructed this data structure, the search for the data record having the value of some key is performed in log2N search steps assuming an optimal, or balanced tree. This basic idea – called tree-based indexing – is widely used in many types of relational and object-oriented database systems. One obvious limitation of this type of approach when considering very large data is that the size of the indexing structure – the tree – is linear with respect to the size of the dataset being indexed. As this size grows larger, we clearly don’t want to incur a commensurately larger storage cost for our search indices. Another problem, which may not be quite as obvious, is that these tree-based approaches require the original data to be sorted. For scientific data, where you typically write the data once then examine it over and over again, this may not be a serious limitation. In some instances, it may simply be impractical to sort the data.
Of greater concern is the so-called “Curse of Dimensionality”13 The previous paragraph calls out that the storage complexity for a tree-based structure is O(N) when there are N data points. If these data points, or records, have two variables, and we want to create a two-dimensional tree that spans both variables, we end up with a storage complexity of O(N2). If there are three variables, the storage requirements are of O(N3). The basic premise is that storage requirements for tree-based indices grow exponentially with respect to the number of variables being indexed. Many modern simulations routinely have on the order of 100 variables that are computed and saved at each time step. It should be obvious that tree-based indexing is simply not practical for large and complex scientific data.
This well-known problem has received a great deal of attention from our colleagues in the field of scientific data management. They have developed a unique technology called “compressed bitmap indices” that have very favorable storage and search complexity.14 This technology has been applied with great success to index/query problems of some of the world’s largest datasets.15 In a series of collaborative research projects, members of VACET and DOE’s Scientific Data Management Center have demonstrated the practicality of combining fast bitmap indexing with high performance visual data analysis, to implement a novel approach to query-driven visualization applied to visual data analysis of problems in combustion modeling8 and large-scale network traffic analysis.16






