Relevant Publications
Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek, Joel Saltz,
"Database Support for Data-driven Scientific Applications in the
Grid",
Parallel Processing Letters, Vol. 13, No. 2, 245-271,
2003
Abstract: In this paper we describe a services oriented software system to provide
basic database support for efficient execution of applications that make
use of scientific datasets in the Grid. This system supports two core
operations: efficient selection of the data of interest from distributed
databases and efficient transfer of data from storage nodes to compute
nodes for processing. We present its overall architecture and main
components and describe preliminary experimental results.
[
get from WorldSciNet]
[
download pre-publication version as Tech Report]
Sivaramakrishnan Narayanan, Umit Catalyurek, Tahsin Kurc, Xi Zhang, Joel Saltz,
"Applying Database Support for Large Scale Data Driven Science in
Distributed Environments",
Proceedings of the Fourth International Workshop on Grid
Computing (Grid 2003), 141-148,
2003
Abstract:
There is a rapidly growing set of applications, referred to as data
driven applications, in which analysis of large amounts of data drives
the next steps taken by the scientist, e.g., running new simulations,
doing additional measurements, extending the analysis to larger data
collections. Critical steps in data analysis are to extract the data of
interest from large and potentially distributed datasets and to move
it from storage clusters to compute clusters for processing. We have
developed a middleware framework, called GridDB-Lite, that is designed
to efficiently support these two steps. In this paper, we describe
the application of GridDB-Lite in large scale oil reservoir simulation
studies and experimentally evaluate several optimizations that can be
employed in the GridDB-Lite runtime system.
[
get from IEEE]
Li Weng, Gagan Agrawal, Umit Catalyurek, Tahsin Kurc, Sivaramakrishnan Narayanan, Joel Saltz,
"An Approach for Automatic Data Virtualization",
Proceedings of the 13th IEEE International Symposium on
High-Performance Distributed Computing (HPDC-13),
24-33,
June 2004
Abstract: Analysis of large and/or geographically distributed scientific
datasets is emerging as a key component of grid computing.
One challenge in this area is that scientific datasets are typically
stored as binary or character flat-files, which makes specification of
processing much harder. In view of this, there has been recent
interest in data virtualization and data services to support
such virtualization.
This paper presents an approach for automatically creating
data services to support data virtualization. Specifically, we show
how a relational table like data abstraction can be supported for
complex multi-dimensional scientific datasets that are resident on
a cluster. We have designed and implemented a tool that processes
SQL queries (with select and where statements) on multi-dimensional
datasets. We have designed a meta-data description language that
is used for specifying the data layout. From such description,
our tool automatically generates efficient data
subsetting and access functions.
We have extensively evaluated our system. The key observations
from our experiments are as follows. First, our tool can correctly
and efficiently handle a variety of different data layouts. Second,
our system scales well as the number of nodes or the amount of data is
scaled. Third, the performance of the automatically generated code for
indexing and extracting functions is quite comparable to the
performance of hand-written codes.
[
download from IEEE]