STORM is services-based middleware, implemented using DataCutter. It allows for select-type queries against large, potentially distributed datasets and for transfer of that data to client applications in a Grid environment. It is designed to provide:
STORM provides for queries that can be thought of as taking the following SQL-like form:
SELECT <Data Elements> FROM Dataset1, Dataset2, ... Datasetn WHERE <Expresssion> AND <Filter(<Data Element>)> GROUP-BY-PROCESSOR ComputeAttribute (<Data Element>)
In this example, the AND in the where clause
specifies a combination of filtering, using built-in operators and
filtering using user-defined filtering functions. The user-defined
operation might be very difficult or impossible to express
with a simple expression and thus require special filtering. The
GROUP-BY-PROCESSOR operation illustrates the
possibility of partitioning of data processing across distributed
processors to achieve parallelism. Here, the ComputeAttribute
is a user-defined function that generates the attribute value
on which the selected data elements are grouped together based
on the application-specific partitioning of data elements.
The STORM package includes a GTK-based GUI client to simplify the running of queries. It also includes a command-line client, for systems without the GTK library.


STORM has been tested on RedHat Linux and AIX. The packages can be installed from source code.
wget (
ftp://ftp.gnu.org/pub/gnu/wget)
gunzip, tar, make (installed in most distributions or
can be downloaded from GNU)
libfl.a /usr/lib or /usr/local/lib. Note that this library is usually installed
on linux machines and on AIX machines it is typically in /usr/local/lib.
cmake CC=<path to
c compiler> and CXX=<path to c compiler>$ chmod 755 storm-install.sh $ ./storm-install.sh \ <WORK_AREA> \ <DATACUTTER_INSTALL_PREFIX> \ <STORM_INSTALL_PREFIX> \ <FLEX_LIBRARY_PATH>e.g.,
./storm-install.sh /tmp/work /usr/local /usr/local /usr/lib/libfl.a
Add the DataCutter and STORM /bin directories to your shell PATH. Also,
if the /lib directories are in a nonstandard place in your installation,
add them to the LD_LIBRARY_PATH (in Linux) or the LIBPATH (in AIX).
$ dcshell -verbose -debug appd start localhost $ xterm -e stormd & $ stormtestbinaryds 1 | grep "results OK" $ storm stormd stop $ dcshell all stop
Since STORM depends on DataCutter, you need to start up the DataCutter environment before using STORM.
If you are running in an X-Window environment and have your DISPLAY set such that you can open X clients (e.g. xterms), run the command:
dcshell -verbose -debug appd start localhost
to start a new appd (which manages DataCutter processes) on the local machine. If the command succeeds, you will get two new xterms opened on your display, one running 'dird' (a directory daemon) and one running an 'appd' (an execution environment for DataCutter processes). If your appd has started correctly, you will see output like:
APPD ARGUMENTS: 13 0 - appd 1 - -host 2 - akron.bmi.ohio-state.edu 3 - -init-ack 4 - akron:44543,1 5 - -DATACUTTER_DIRDHOST 6 - akron 7 - -DATACUTTER_DIRDPORT 8 - 4001 9 - -DATACUTTER_DIRDTYPE 10 - Standalone 11 - -DATACUTTER_COMMPROTOCOL 12 - TCP DataCutter appd v1.4 (02/14/2002) Registered with dird: akron:4001
If you not are running in an X-Window session, you'll want to run:
$ dcshell -noxterm -verbose -debug appd start localhost
and you can monitor the output logs in /tmp/appd.*. The logfile to
watch is given to you in the output. (Note: once you are satisfied
that the appds are launching correctly, you can omit the -verbose
and -debug parts.)
One of the first things to understand about DataCutter is that
it initializes its daemons (local or remote) via ssh. If you have
different usernames between local and remote hosts, you will need
to set up username maps (with an OpenSSH server this means adding
to the ~/.ssh/config file). The username issue shouldn't affect
you if you're starting up all services on 'localhost'.
Secondly, some sshd servers will not execute your shell startup
files (.profile, etc.) during remote shell commands, which could
mean that the path to DataCutter and STORM binaries/libraries are
not found (normally resulting in 'dird: command not found'
type messages). To get around this problem, you can create a file,
~/.datacutter.source on the node where ssh is being executed. This
file must be in your native shell syntax and is read in as DataCutter
is initialized.
As an example, your .datacutter.source file might contain (for ksh):
PATH=$HOME/dcinstdir/bin:$HOME/storminstdir/bin LD_LIBRARY_PATH=$HOME/dsinstdir/lib:$HOME/storminstdir/lib export PATH export LD_LIBRARY_PATH
Once DataCutter has been started, you can initialize the STORM
daemon. Do this by typing stormd in a new shell session. It
will choose a listen port based off of your UID. If that port is not
adequate, you can either set STORMD_PORT or run stormd 1234
where 1234 is the port you want to listen on.
Once stormd has started, you can connect to it with a test client,
create some binary datasets, query them, index them, and query them
using the index. This all happens automatically with the command:
stormtestbinaryds 1. After running this test
command, if you see a pair of lines of output in this shell
that begin with 'results OK', then the install has
validated. If you see ERROR: messages, then there is some
problem. In that case, you can email the output to Benjamin Rutt (rutt@bmi.osu.edu) and you will be
contacted for follow-up support. (Note: If you are using a port other
than what is chosen for you, be sure to set STORMD_PORT
in this execution environment.)
It may be easier to just type:
$ stormtestbinaryds 1 | grep "results OK"
to sift through the output. If you get two "results OK" lines, STORM is working properly.
$ storm stormd stop
$ dcshell all stop
Finally, if any processes are not shutting down cleanly, you may manually kill them. The processes to kill to would be named 'dird', 'appd', and 'stormd'.
In this example, we use STORM and DataCutter to run a query on a relatively simple binary file containing American League baseball statistics. The goal in this example is to demonstrate the primary features of STORM.
Download the sample datasets and schema to a convenient
location. In the examples below, we have installed STORM and
DataCutter in a sub-directory of a users home directory, called
storm_env.
The base directory for the downloadable sample files is here:
http://bmi.osu.edu/resources/software/storm/datasets/
get these six files:
AL-stats-part1.bin
AL-stats-part2.bin
AL-stats-part1-fixedsize.bin
AL-stats-part2-fixedsize.bin
alstats-fixed.dslist
alstats-fixed.schema
First, we will use the STORM metadata tool (storm) to create and add a
schema for our baseball statistics files. In this process, we'll
describe the format of the data to STORM, so that it can accurately
query the file. We'll then associate the schema with our sample
datasets, start DataCutter and STORM, and run a number of queries.
For this example, we'll be running the queries using the
command-line query tool (stormtextclient), but know that you could
perform the same queries with less typing and prettier output using the
GTK-client included with STORM. The GTK-client also supports the ability
to sort columns ascending and descending by clicking a column name.
We've tried to pick queries that fully demonstrate the syntax of the query language, including how to use a column index to optimize query response time. We'll run this example on a single host for the sake of simplifying the demonstration. It could as easily be run in a cluster environment (see the DataCutter manual's QuickStart example, which demonstrates DataCutter running in a multi-host, cluster environment). It could also be run on computers on the Grid, provided DataCutter and STORM are installed on all machines being used.
At any time, you can type 'storm help' to get a listing of commands for the metadata tool.
$ storm add schema
And follow the prompts. The first prompt will ask you to enter a name for the dataset schema. This is not the file name of the dataset; it is an arbitrary name, preferably a short one--easy to type and remember. You'll be asked to outline the structure of the records in the dataset, including (optionally) whether there are offsets between the data of interest and whether the file has a header that should be skipped. (NOTE: With this release of STORM, you must give the offsets in ascending order; for example, you must enter attribute X which starts at record offset 23 before you enter attribute J which starts at record offset 41). You'll also choose the type of extractor. Binary is the default, built-in, extractor type but others can be added (see the next section, 'Defining and Extending STORM Extractors and Indexes').
Schema are written to a file called DatasetSchema.storm in the
.storm directory of your home directory. The list of datasets, hosts,
extractors, and file structures is written to DatasetList.storm in
the same directory. Here is the output after entering the schema for
our example baseball statistics files. You can see that the file has a
four byte header, no offsets between the data, and a total record length
of 68 bytes. You can use this output to populate a schema on your own.
$ storm list schema Schemas: [1] ALSTATS [19] <FixedSize> = 68 <HeaderSkip> = 4 YEAR = INT4 1 0 RUNSPERGAME = FLOAT 1 4 RUNS = INT4 1 8 GAMES = INT4 1 12 ATBATS = INT4 1 16 HITS = INT4 1 20 DOUBLES = INT4 1 24 TRIPLES = INT4 1 28 HR = INT4 1 32 BB = INT4 1 36 KO = INT4 1 40 BAVG = FLOAT 1 44 ONBASEPCT = FLOAT 1 48 SLG = FLOAT 1 52 SB = INT4 1 56 CS = INT4 1 60 AVGAGE = FLOAT 1 64
Add the dataset(s), its host(s), the extractor you'll use to extract the data from the dataset (the built-in binary extractor, in this example), and the file structure (big- or little-endian).The dataset file name should be a the full path on the host system. You must be able to ssh to that host and have the proper privileges to read the file. The host should be a resolvable, full-qualified host name, reachable from the machine on which the stormd processes is invoked.
$ storm add dataset
And follow the prompts. To list the dataset names you have added, type 'storm list datasets'. Here is the output showing that our two baseball statistics binary files have been added:
$ storm list datasets
Datasets:
[1]
ALSTATS [2]
<DatasetSchema> = ALSTATS
Data-0 = /home/petri/storm_env/AL-stats-part1.bin
akron.bmi.ohio-state.edu BinaryExtractor "little_endian"
Data-1 = /home/petri/storm_env/AL-stats-part2.bin
akron.bmi.ohio-state.edu BinaryExtractor "little_endian"
See the Verification section, above, for further details and troubleshooting.
$ dcshell -verbose -debug appd start akron.bmi.ohio-state.edu $ xterm -e stormd &
Your screen will look something like the graphic below, which shows two xterms spawned to display output from the DataCutter dird and appd and another xterm verifying the that stormd is listening on a port. The above two commands are starting a minimal instance of DataCutter. In a real-world example, appds would be spawned on multiple hosts, with the dird managing the numerous appd instances. Where ever you want STORM services to be executed, you should start appds.

The following queries use the command-line querying client,
stormtextclient. If your running on a system that has the GTK graphics
library installed, then you can use the GUI client, stormgtkclient. The
command-line utility takes three parameters (and an optional fourth,
which we'll discuss shortly): a data source name (set up when
you added your schema), a filter string that has a SQL-like syntax,
and an index string. The syntax is:
stormtextclient -d <data source name> -f <filter string> -i <index string>
Here is a query with an empty filter string, returning all results (truncated) in our datafile:
$ stormtextclient -d ALSTATS -f "" -i "" ... <YEAR=1924 RUNSPERGAME=4.98 RUNS=6141 GAMES=1234 ATBATS=42280 HITS=12253 DOUBLES=2197 TRIPLES=551 HR=397 BB=4136 KO=3249 BAVG=0.28999999 ONBASEPCT=0.35299999 SLG=0.396 SB=749 CS=581 AVGAGE=28.700001> ... # returned 84 records in 0.762919 seconds
The following three queries look for greatest number of home runs since 1991, the biggest hitting year in the 1980s, and years with the highest batting averages, respectively, The queries demonstrate the use of some of the available operators:
$ stormtextclient -d ALSTATS -f "HR > 2500 && YEAR > 1990" -i "" ... # returned 4 records in 0.725780 seconds $ stormtextclient -d ALSTATS -f "HR + TRIPLES > 3000 && \ YEAR >=1980 && YEAR <= 1989" -i "" <YEAR=1987 RUNSPERGAME=4.9000001 RUNS=11112 GAMES=2268 ATBATS=77819 HITS=20620 DOUBLES=3667 TRIPLES=461 HR=2634 BB=7812 KO=13442 BAVG=0.26499999 ONBASEPCT=0.33199999 SLG=0.42500001 SB=1734 CS=772 AVGAGE=28.4> # returned 1 records in 0.672526 seconds $ stormtextclient -d ALSTATS -f "BAVG > 0.280" -i "" # returned 15 records in 0.666577 seconds
We see that there were four years in the 1990s in which home runs were
very high and that 1987 was the biggest hitting year of the 1980s. The
final query, looking at batting averages, shows how to enter floating
point numbers: the STORM parser requires that floating point numbers
between -1.0 and 1.0 have a 0 before the dot. Note that, if you want to
use integer constants greater than 263-1 in your range, put a
suffix of lower-case u on the number, for example
18446744073709551615u. The following arithmetic operators are available
for filter strings: plus, minus, times, divide, equals, less-than,
greater-than, and parentheses (to establish precedence for evaluation):
+ - * / == < > >= <= ( )
In this example, we'll be working with a slightly larger dataset to demonstrate the query speed gains that can be achieved through indexing. You can download this dataset here:
http://bmi.osu.edu/resources/software/storm/datasets/bigd.bin
(Note: the file is 16MB).
Use the storm metadata tool to add the schema: set X to UINT8 and Y
to INT4 Call the schema BIGDATA. And add the dataset. You
should get the following output when you list the schema and datasets:
$ storm list schema
BIGDATA [3]
<FixedSize> = 12
X = UINT8 1 0
Y = INT4 1 8
$ storm list datasets
Datasets:
[3]
...
BIGDATA [2]
<DatasetSchema> = BIGDATA
Data-0 = /home/petri/storm_env/bigd.bin
akron.bmi.ohio-state.edu BinaryExtractor "little_endian"
Index-0 = <Y> RtreeIndex (akron.bmi.ohio-state.edu {0}
"/home/petri")
We are querying for Y columns with a value less than or equal to 5000. The first query uses no index:
$stormtextclient -d BIGDATA -i "Y <= 5000" -i "" <X=35545 Y=3722> <X=504644 Y=1210> <X=678000 Y=4686> <X=1011602 Y=4125> # returned 4 records in 5.045873 seconds
Our query returned in a little over five seconds. Next, let's add an index on the Y column to attempt to make the query quicker. By default, STORM will use an R-Tree index. Other indexing strategies can be added (see 'Defining and Extending STORM Extractors and Indexes'). An index allows for very quick lookup on the indexed column.
$ storm add index Select which table to add index(es) to: 1) ALSTATS 2) BIGDATA x) exit selection Enter choice: 2 ...
The index is calculated and written to a file in your home directory to be used when you specify an index string. To pass an index string to a query, you use the following syntax:
<column name> RANGE [<low value>,<high value>]
The string in our example uses indexed values from
-231(negative infinity represented as a four byte integer)
to 5000. If you have more than one index associated with a dataset, it
is required that you tell the query which index you want to use. You
do this with the -u flag. It takes the integer value of
the index associated with your dataset. In this example, we are using
the zeroth (and only) index on the dataset:
$ stormtextclient -d BIGDATA -f "Y <= 5000" -i "Y RANGE[-INF,5000]" -u 0
By default, if an index string is specified, STORM uses Index-0, so the -u flag is usually only used when you have more than one index. Using the index, we get back the same data, but our query executes nearly three times as fast:
... # returned 4 records in 1.739310 seconds
STORM also has support in the metadata tool for adding schema and
datasets for files in which the data of interest are not perfectly
sequential. One can hand edit the dataset configuration and schema files
and import them into STORM via the metadata tool. In this case, the tool
is operating non-interactively (ni). The following
example, adds the schema and dataset list file for two binary files
of baseball statistics in which there are gaps of uninteresting data
between one of the columns and between full records. The schema includes
byte offsets to specify the locations of the columns in a record.
Import the new schema:
$ storm add schema-ni alstats-fixed.schema
$ storm list schema
...
ALSTATSFIXED [19]
<FixedSize> = 91
<HeaderSkip> = 4
YEAR = INT4 1 0
RUNSPERGAME = FLOAT 1 11
RUNS = INT4 1 15
GAMES = INT4 1 19
ATBATS = INT4 1 23
HITS = INT4 1 27
DOUBLES = INT4 1 31
TRIPLES = INT4 1 35
HR = INT4 1 39
BB = INT4 1 43
KO = INT4 1 47
BAVG = FLOAT 1 51
ONBASEPCT = FLOAT 1 55
SLG = FLOAT 1 59
SB = INT4 1 63
CS = INT4 1 67
AVGAGE = FLOAT 1 71
Notice in the above schema that there is a 7 byte "gap"; between the YEAR column and the RUNSPERGAME column. Were the data in the file perfectly sequential we would have jumped from 0 to 4 bytes, but, above, the RUNSPERGAME column begins at offset 11, instead. Byte offsets can be added interactively. But to "pad" the rows (to ignore uninteresting data at the end of a row), you will need to use the non-interactive commands and hand edit your schema. Notice in the first example schema that the total byte length was 68. In the non-interactive example, the total byte length of the each record is 91--a 7-byte offset and 16 bytes that get ignored at the end of each record. Depending on how many columns you are interested in and their locations within each record, it may be most efficient to always add schema non-interactively.
Append to the dataset list:
$ storm add dataset-ni alstats-fixed.dslist
$ storm list datasets
...
ALSTATSFIXED [4]
<DatasetSchema> = ALSTATSFIXED
Data-0 = /home/petri/storm_env/AL-stats-part1-fixedsize.bin
akron.bmi.ohio-state.edu BinaryExtractor "little_endian"
Data-1 = /home/petri/storm_env/AL-stats-part2-fixedsize.bin
akron.bmi.ohio-state.edu BinaryExtractor "little_endian"
You can now perform queries as normal.
In simplest terms, STORM extractors are entities that fetch data from storage media. They do the work that the index describes to them--that is, what records to fetch and where in the datafile to fetch them. They are responsible for retrieving data, transforming it into tuples (virtual tables, in Fig 2), and collecting tuples into "chunks" for further processing by filters and/or pipelining to the client.
Extractors access storage media as directed by an index and return in-memory images of the requested data. The index describes to an extractor the offsets and sizes of the data records to be extracted. In other words, extractors receive from the index instructions about where in the file the data of interest lay. Extractors have knowledge of the file formats they are designed to handle.
Additionally, extractors must decide how to break a file into chunks for purposes of index creation or normal querying where no index is present. Chunks are user-defined and application specific. Chunks are collections of tuples, grouped so that the tuple data may be efficiently pipelined to clients and/or filters.
STORM defines a default binary extractor. The default extractor will handle the extraction of binary data from an uncompressed file in which data is arranged in record-style, sequential blocks. For many applications, the default extractor is adequate. However, at some point you will likely need to extend STORM's extractors and indexes to meet the needs of your environment and datafiles. The process of extending STORM Indexes and Extractors is covered in the PDF document here:
STORM Extractors and Indexes (pdf file, opens in new window).
A sample extractor developed outside the STORM source tree is available:
stormVariableLengthExtractor.tar.gz .
For technical support, send an email containing your program run output to Benjamin Rutt: rutt@bmi.osu.edu