Personal tools
You are here: Home Science Data to Knowledge Streaming Algorithms
Document Actions

Knowledge Extraction Using Streaming Algorithms

by Katharine Chartrand last modified 2007-04-10 20:12

We can expect that, while data processing time will keep pace with Moore’s law, data access time will not. Streaming algorithms that can process data when it is in transit (eg, as it is being collected or moved between memory locations) become increasingly efficient in this regime. One can take advantage of this efficiency, for example, by adaptively coupling data collection from scientific instruments and information extraction or by performing data reduction as data moves through a hierarchy of memory in a next-generation computer architecture.

Given this evolution in computing environments developing scalable algorithms for real-time feature/knowledge extraction from streaming data demands urgent attention. DDMA proposes work in two areas.

Insights into the nature of sets, measures and functions describing data

Geometric analysis permits the construction of precise, discriminating metrics that can be tuned to ignore the unimportant. Geometric measure theory (GMT) is just beginning to be exploited for the analysis of data in all its diversity. A recent breakthrough of ours 1 connects the L1TV image functional and the flat norm from GMT, introducing a flat norm with scale. The flat norm is a particularly appropriate metric for distinguishing between different shapes, so our work enables the identification of shape at any scale. This nonlinear multiscale decomposition and metric opens up a great number of practical applications, including automated identification of astronomical objects.

Optimizing dimensionality reduction

Information extraction from high-dimensional data relies on exploitation of the sparse, low-dimensional nature of the subspaces containing information. Mathematicians have recently discovered the remarkable result that random projections to a low-dimensional space are sufficient to generate a sparse representation of data, even without any prior knowledge of the sparsity. This has given rise to compressed sensing and the single-pixel camera, which directly measures a compressed version of an image rather than measuring the full image and then compressing it. Recent results from our team promise to redefine the state-of-the-art further. We have developed the means, using discrete, non-convex methods, to generate a representation of the data that is almost optimally sparse — close to the theoretical information content of the signal. 2

When prior knowledge of the sparsity is available, the best dimensionality reduction can be achieved through incorporation of this knowledge into an objective function that is then optimized. The resulting optimization problem is nonlinear and lacks explicit gradients. While the computational cost of such an approach is generally prohibitive, we have successfully applied it 3 to the challenge of time-sensitive quantitative interpretation of radiographic data in a computationally poor environment. This reflects a property that we have studied extensively: optimization problems with high worst-case computational complexity often have very modest typical-case complexity. Characterizing the statistical ensemble of such problems leads to an algorithmic phase structure 4 that can predict when such methods will be computationally tractable.

Specific problem areas of interest

  • Image analysis and tracking for large-scale detection. The challenge is to move from image processing techniques to truly automated image analysis, involving unsolved problems such as image search. Large-scale datasets of interest include streaming video, hyperspectral images and subsurface images.
  • Inference and control in complex networks. Problems of interest range from distributed sensor networks for effective automated detection systems, to exploiting structure of complex networks for epidemic modeling, infrastructure protection and community detection. Insights into social phenomena such as rumor propagation and social group formation may provide a key ingredient.
  • Cryptography and information security. Attacks on information systems call for more advanced cryptographic and privacy-preserving protocols than those that presently exist. Streaming data, which are becoming an increasingly large fraction of network traffic, provide a particular challenge owing to the size and heterogeneity of the datasets.
  • Cognitive science. Progress in brain mapping and neuroimaging promise substantial improvements in our understanding of cognition. These contributions have been overwhelming data-driven, and suggests new approaches to machine learning, information retrieval and behavioral prediction.

links