STORM Documentation

Overview  |   Installation  |   Verification  |   More Documentation  |   Support

Overview    [top]

STORM is services-based middleware, implemented using DataCutter. It allows for select-type queries against large, potentially distributed datasets and for transfer of that data to client applications in a Grid environment. It is designed to provide:

  1. Selection of the data of interest. The data of interest is selected based on either exact values of particular attributes or ranges of attribute values (i.e., range queries). The selection operation can also involve user-defined filtering operations.
  2. Efficient transfer of data from storage nodes to compute nodes for processing. If the data analysis program runs on a cluster, STORM supports application-specific partitioning and parallel transfer of data elements to the destination processors.

STORM provides for queries that can be thought of as taking the following SQL-like form:

  SELECT <Data Elements>
  FROM Dataset1, Dataset2, ... Datasetn
  WHERE <Expresssion> AND <Filter(<Data Element>)>
  GROUP-BY-PROCESSOR ComputeAttribute (<Data Element>)

In this example, the AND in the where clause specifies a combination of filtering, using built-in operators and filtering using user-defined filtering functions. The user-defined operation might be very difficult or impossible to express with a simple expression and thus require special filtering. The GROUP-BY-PROCESSOR operation illustrates the possibility of partitioning of data processing across distributed processors to achieve parallelism. Here, the ComputeAttribute is a user-defined function that generates the attribute value on which the selected data elements are grouped together based on the application-specific partitioning of data elements.

The STORM package includes a GTK-based GUI client to simplify the running of queries. It also includes a command-line client, for systems without the GTK library.

storm gui client
Fig 1: GTK-based STORM GUI client retrieving data
from an oil reservoir management dataset

diagram of storm execution
Fig 2: Diagram representing an abstraction of a STORM execution
STORM Execution (see diagram, above)
  1. The Query Service receives a client query and initiates additional services to solve the query.
  2. The Meta-Data Service is used throughout STORM and persistently stores internal information about current datasets, indexes, etc.
  3. The Indexing Service both creates new indexes when requested by the user and utilizes indexes to optimize query evaluation.
  4. During execution of a query that utilizes an index, the Indexing Service will inform the Data Source Service which areas of disk contain tuples in the result set. This information is passed as a "chunk meta-data" (represented by the ChunkInfo class in the implementation). Datasets consist of application defined chunks.
  5. Using Extractors, the Data Source Service accesses the data files as specified by the Indexing Service and sends the data in the form of virtual tables (tuples) to the Filtering Service.
  6. The Filtering Service includes both built-in equality and range operators as well as custom filtering functions which can be defined by the user through plug-ins. Result tuples which survive the filtering step are then optionally partitioned before being sent to the Data Mover.
  7. During a typical execution, if the client is a standalone application, filtered query results are passed back through the Data Mover Service. If the client is a parallel program or consists of multiple DataCutter filters, a mapping of data elements is needed in order to partition/distribute the data elements across clients. In such a case, the Partition Generation Service generates a mapping using partition attributes. The Data Mover Service then uses this mapping to transfer the data elements to the appropriate clients. Some data partitioning algorithms may need to process "whole" data in order to produce a mapping. In this case an "Inspection phase" is executed first. Another case where an Inspection phase is required is if the client requests an estimate before receiving the result.

Installation   [top]

STORM has been tested on RedHat Linux and AIX. The packages can be installed from source code.

Installation of DataCutter and STORM from Source Code
Requirements:
  1. STORM source code
  2. DataCutter source code
  3. wget ( ftp://ftp.gnu.org/pub/gnu/wget)
  4. gunzip, tar, make (installed in most distributions or can be downloaded from GNU)
  5. a C++ compiler
  6. libfl.a
    Flex library. If it is not already installed, the GNU Flex library is available at:
    ftp://ftp.gnu.org/gnu/non-gnu/flex/flex-2.5.4a.tar.gz

    STORM requires this library to build and will by default pick it up from /usr/lib or /usr/local/lib. Note that this library is usually installed on linux machines and on AIX machines it is typically in /usr/local/lib.
  7. cmake
    A build tool.
    Note that you can set cmake to use the compiler you choose. By default, it will use the first C++ compiler it finds. If you want to specify particular compiler, set an environment variables: CC=<path to c compiler> and CXX=<path to c compiler>
    Download cmake binaries or build from source and put cmake on your path.
    1. Binaries:
         LINUX: http://www.cmake.org/files/v1.8/cmake-1.8.3-x86-linux.tar.gz
         AIX: http://www.cmake.org/files/v1.8/cmake-1.8.3-aix15.tar.gz
    2. Source:
          http://www.cmake.org/files/v1.8/cmake-1.8.3.tar.gz
Installation:
  1. Download the installation script: http://bmi.osu.edu/resources/software/storm/storm-install.sh
  2. Make it executable and run it:
    $ chmod 755 storm-install.sh
    $ ./storm-install.sh  \
      <WORK_AREA> \
      <DATACUTTER_INSTALL_PREFIX> \
      <STORM_INSTALL_PREFIX> \
      <FLEX_LIBRARY_PATH>
    
    e.g., ./storm-install.sh /tmp/work /usr/local /usr/local /usr/lib/libfl.a

Verification   [top]

Quickstart Verification (assuming an X-Window environment)

Add the DataCutter and STORM /bin directories to your shell PATH. Also, if the /lib directories are in a nonstandard place in your installation, add them to the LD_LIBRARY_PATH (in Linux) or the LIBPATH (in AIX).

$ dcshell -verbose -debug appd start localhost
$ xterm -e stormd &
$ stormtestbinaryds 1 | grep "results OK"
$ storm stormd stop
$ dcshell all stop

Full Explanation of Quickstart Verification Procedure

  1. Start DataCutter Daemons

    Since STORM depends on DataCutter, you need to start up the DataCutter environment before using STORM.

    If you are running in an X-Window environment and have your DISPLAY set such that you can open X clients (e.g. xterms), run the command:

    dcshell -verbose -debug appd start localhost

    to start a new appd (which manages DataCutter processes) on the local machine. If the command succeeds, you will get two new xterms opened on your display, one running 'dird' (a directory daemon) and one running an 'appd' (an execution environment for DataCutter processes). If your appd has started correctly, you will see output like:

    APPD ARGUMENTS: 13
    0 - appd
    1 - -host
    2 - akron.bmi.ohio-state.edu
    3 - -init-ack
    4 - akron:44543,1
    5 - -DATACUTTER_DIRDHOST
    6 - akron
    7 - -DATACUTTER_DIRDPORT
    8 - 4001
    9 - -DATACUTTER_DIRDTYPE
    10 - Standalone
    11 - -DATACUTTER_COMMPROTOCOL
    12 - TCP
    DataCutter appd v1.4 (02/14/2002)
    Registered with dird: akron:4001
    

    If you not are running in an X-Window session, you'll want to run:

    $ dcshell -noxterm -verbose -debug appd start localhost
    

    and you can monitor the output logs in /tmp/appd.*. The logfile to watch is given to you in the output. (Note: once you are satisfied that the appds are launching correctly, you can omit the -verbose and -debug parts.)

  2. Troubleshooting DataCutter startup

    One of the first things to understand about DataCutter is that it initializes its daemons (local or remote) via ssh. If you have different usernames between local and remote hosts, you will need to set up username maps (with an OpenSSH server this means adding to the ~/.ssh/config file). The username issue shouldn't affect you if you're starting up all services on 'localhost'.

    Secondly, some sshd servers will not execute your shell startup files (.profile, etc.) during remote shell commands, which could mean that the path to DataCutter and STORM binaries/libraries are not found (normally resulting in 'dird: command not found' type messages). To get around this problem, you can create a file, ~/.datacutter.source on the node where ssh is being executed. This file must be in your native shell syntax and is read in as DataCutter is initialized.

    As an example, your .datacutter.source file might contain (for ksh):

    PATH=$HOME/dcinstdir/bin:$HOME/storminstdir/bin
    LD_LIBRARY_PATH=$HOME/dsinstdir/lib:$HOME/storminstdir/lib
    export PATH
    export LD_LIBRARY_PATH
    
  3. Start a STORM Daemon and Run a Verification Query

    Once DataCutter has been started, you can initialize the STORM daemon. Do this by typing stormd in a new shell session. It will choose a listen port based off of your UID. If that port is not adequate, you can either set STORMD_PORT or run stormd 1234 where 1234 is the port you want to listen on.

    Once stormd has started, you can connect to it with a test client, create some binary datasets, query them, index them, and query them using the index. This all happens automatically with the command: stormtestbinaryds 1. After running this test command, if you see a pair of lines of output in this shell that begin with 'results OK', then the install has validated. If you see ERROR: messages, then there is some problem. In that case, you can email the output to Benjamin Rutt (rutt@bmi.osu.edu) and you will be contacted for follow-up support. (Note: If you are using a port other than what is chosen for you, be sure to set STORMD_PORT in this execution environment.)

    It may be easier to just type:

    $ stormtestbinaryds 1 | grep "results OK"

    to sift through the output. If you get two "results OK" lines, STORM is working properly.

  4. Shutdown the STORM Daemon
    $ storm stormd stop
  5. Shutdown DataCutter Daemons
    $ dcshell all stop

Finally, if any processes are not shutting down cleanly, you may manually kill them. The processes to kill to would be named 'dird', 'appd', and 'stormd'.

More Documentation   [top]

Full Source Code Documentation

Running STORM on Example Datasets

In this example, we use STORM and DataCutter to run a query on a relatively simple binary file containing American League baseball statistics. The goal in this example is to demonstrate the primary features of STORM.

Download the sample datasets and schema to a convenient location. In the examples below, we have installed STORM and DataCutter in a sub-directory of a users home directory, called storm_env.

The base directory for the downloadable sample files is here:
http://bmi.osu.edu/resources/software/storm/datasets/
get these six files:
    AL-stats-part1.bin
    AL-stats-part2.bin
    AL-stats-part1-fixedsize.bin
    AL-stats-part2-fixedsize.bin
    alstats-fixed.dslist
    alstats-fixed.schema

First, we will use the STORM metadata tool (storm) to create and add a schema for our baseball statistics files. In this process, we'll describe the format of the data to STORM, so that it can accurately query the file. We'll then associate the schema with our sample datasets, start DataCutter and STORM, and run a number of queries.

For this example, we'll be running the queries using the command-line query tool (stormtextclient), but know that you could perform the same queries with less typing and prettier output using the GTK-client included with STORM. The GTK-client also supports the ability to sort columns ascending and descending by clicking a column name.

We've tried to pick queries that fully demonstrate the syntax of the query language, including how to use a column index to optimize query response time. We'll run this example on a single host for the sake of simplifying the demonstration. It could as easily be run in a cluster environment (see the DataCutter manual's QuickStart example, which demonstrates DataCutter running in a multi-host, cluster environment). It could also be run on computers on the Grid, provided DataCutter and STORM are installed on all machines being used.

At any time, you can type 'storm help' to get a listing of commands for the metadata tool.

Create a Schema for the Dataset
$ storm add schema

And follow the prompts. The first prompt will ask you to enter a name for the dataset schema. This is not the file name of the dataset; it is an arbitrary name, preferably a short one--easy to type and remember. You'll be asked to outline the structure of the records in the dataset, including (optionally) whether there are offsets between the data of interest and whether the file has a header that should be skipped. (NOTE: With this release of STORM, you must give the offsets in ascending order; for example, you must enter attribute X which starts at record offset 23 before you enter attribute J which starts at record offset 41). You'll also choose the type of extractor. Binary is the default, built-in, extractor type but others can be added (see the next section, 'Defining and Extending STORM Extractors and Indexes').

Schema are written to a file called DatasetSchema.storm in the .storm directory of your home directory. The list of datasets, hosts, extractors, and file structures is written to DatasetList.storm in the same directory. Here is the output after entering the schema for our example baseball statistics files. You can see that the file has a four byte header, no offsets between the data, and a total record length of 68 bytes. You can use this output to populate a schema on your own.

$ storm list schema
Schemas:
[1]
 ALSTATS [19]
 <FixedSize>  =   68 
 <HeaderSkip> =   4 
 YEAR  = INT4           1    0 
 RUNSPERGAME  = FLOAT   1    4
 RUNS  = INT4           1    8 
 GAMES  = INT4          1   12
 ATBATS  = INT4         1   16
 HITS  = INT4           1   20
 DOUBLES  = INT4        1   24
 TRIPLES  = INT4        1   28
 HR  = INT4             1   32 
 BB  = INT4             1   36
 KO  = INT4             1   40
 BAVG  = FLOAT          1   44
 ONBASEPCT  = FLOAT     1   48
 SLG  = FLOAT           1   52
 SB  = INT4             1   56
 CS  = INT4             1   60
 AVGAGE  = FLOAT        1   64 
Add the Datasets

Add the dataset(s), its host(s), the extractor you'll use to extract the data from the dataset (the built-in binary extractor, in this example), and the file structure (big- or little-endian).The dataset file name should be a the full path on the host system. You must be able to ssh to that host and have the proper privileges to read the file. The host should be a resolvable, full-qualified host name, reachable from the machine on which the stormd processes is invoked.

$ storm add dataset

And follow the prompts. To list the dataset names you have added, type 'storm list datasets'. Here is the output showing that our two baseball statistics binary files have been added:

$ storm list datasets
Datasets: 
[1]
   ALSTATS [2]
      <DatasetSchema>  = ALSTATS 
      Data-0  = /home/petri/storm_env/AL-stats-part1.bin
       akron.bmi.ohio-state.edu BinaryExtractor "little_endian"
      Data-1  = /home/petri/storm_env/AL-stats-part2.bin
       akron.bmi.ohio-state.edu BinaryExtractor "little_endian" 
Start DataCutter and STORM

See the Verification section, above, for further details and troubleshooting.

$ dcshell -verbose -debug appd start akron.bmi.ohio-state.edu
$ xterm -e stormd &

Your screen will look something like the graphic below, which shows two xterms spawned to display output from the DataCutter dird and appd and another xterm verifying the that stormd is listening on a port. The above two commands are starting a minimal instance of DataCutter. In a real-world example, appds would be spawned on multiple hosts, with the dird managing the numerous appd instances. Where ever you want STORM services to be executed, you should start appds.

storm and datacutter daemons running
Fig 3. STORM and DataCutter daemons running
Perform Queries

The following queries use the command-line querying client, stormtextclient. If your running on a system that has the GTK graphics library installed, then you can use the GUI client, stormgtkclient. The command-line utility takes three parameters (and an optional fourth, which we'll discuss shortly): a data source name (set up when you added your schema), a filter string that has a SQL-like syntax, and an index string. The syntax is:

stormtextclient -d <data source name> -f <filter string> -i  <index string>

Here is a query with an empty filter string, returning all results (truncated) in our datafile:

$ stormtextclient -d ALSTATS -f "" -i ""
   ...
<YEAR=1924 RUNSPERGAME=4.98 RUNS=6141
GAMES=1234 ATBATS=42280 HITS=12253
DOUBLES=2197 TRIPLES=551 HR=397 BB=4136
KO=3249 BAVG=0.28999999 ONBASEPCT=0.35299999
SLG=0.396 SB=749 CS=581 AVGAGE=28.700001>
   ...
# returned 84 records in 0.762919 seconds

The following three queries look for greatest number of home runs since 1991, the biggest hitting year in the 1980s, and years with the highest batting averages, respectively, The queries demonstrate the use of some of the available operators:

$ stormtextclient -d ALSTATS -f "HR > 2500 && YEAR > 1990" -i ""
   ...
# returned 4 records in 0.725780 seconds

$ stormtextclient -d ALSTATS -f "HR + TRIPLES > 3000 && \ 
  YEAR >=1980 && YEAR <= 1989" -i ""
<YEAR=1987 RUNSPERGAME=4.9000001
RUNS=11112 GAMES=2268 ATBATS=77819
HITS=20620 DOUBLES=3667 TRIPLES=461
HR=2634 BB=7812 KO=13442 BAVG=0.26499999
ONBASEPCT=0.33199999 SLG=0.42500001
SB=1734 CS=772 AVGAGE=28.4>
# returned 1 records in 0.672526 seconds

$ stormtextclient -d ALSTATS -f "BAVG > 0.280" -i ""
# returned 15 records in 0.666577 seconds

We see that there were four years in the 1990s in which home runs were very high and that 1987 was the biggest hitting year of the 1980s. The final query, looking at batting averages, shows how to enter floating point numbers: the STORM parser requires that floating point numbers between -1.0 and 1.0 have a 0 before the dot. Note that, if you want to use integer constants greater than 263-1 in your range, put a suffix of lower-case u on the number, for example 18446744073709551615u. The following arithmetic operators are available for filter strings: plus, minus, times, divide, equals, less-than, greater-than, and parentheses (to establish precedence for evaluation):

+ - * / == < > >= <= ( ) 
Add an Index to Optimize Query Times

In this example, we'll be working with a slightly larger dataset to demonstrate the query speed gains that can be achieved through indexing. You can download this dataset here:

http://bmi.osu.edu/resources/software/storm/datasets/bigd.bin
(Note: the file is 16MB).

Use the storm metadata tool to add the schema: set X to UINT8 and Y to INT4 Call the schema BIGDATA. And add the dataset. You should get the following output when you list the schema and datasets:

$ storm list schema
   BIGDATA [3]
      <FixedSize>  = 12 
      X  = UINT8       1    0 
      Y  = INT4        1    8 

$ storm list datasets
Datasets: 
[3]
 ...
 BIGDATA [2]
 <DatasetSchema>  = BIGDATA 
 Data-0  = /home/petri/storm_env/bigd.bin
  akron.bmi.ohio-state.edu BinaryExtractor "little_endian" 
 Index-0  = <Y> RtreeIndex (akron.bmi.ohio-state.edu {0}
  "/home/petri")

We are querying for Y columns with a value less than or equal to 5000. The first query uses no index:

$stormtextclient -d BIGDATA -i "Y <= 5000" -i ""
<X=35545 Y=3722>
<X=504644 Y=1210>
<X=678000 Y=4686>
<X=1011602 Y=4125>
# returned 4 records in 5.045873 seconds

Our query returned in a little over five seconds. Next, let's add an index on the Y column to attempt to make the query quicker. By default, STORM will use an R-Tree index. Other indexing strategies can be added (see 'Defining and Extending STORM Extractors and Indexes'). An index allows for very quick lookup on the indexed column.

$ storm add index
 Select which table to add index(es) to:
 1) ALSTATS
 2) BIGDATA
 x) exit selection

 Enter choice: 2
  ...

The index is calculated and written to a file in your home directory to be used when you specify an index string. To pass an index string to a query, you use the following syntax:

  <column name> RANGE [<low value>,<high value>]

The string in our example uses indexed values from -231(negative infinity represented as a four byte integer) to 5000. If you have more than one index associated with a dataset, it is required that you tell the query which index you want to use. You do this with the -u flag. It takes the integer value of the index associated with your dataset. In this example, we are using the zeroth (and only) index on the dataset:

$ stormtextclient -d BIGDATA -f "Y <= 5000" -i "Y RANGE[-INF,5000]" -u 0

By default, if an index string is specified, STORM uses Index-0, so the -u flag is usually only used when you have more than one index. Using the index, we get back the same data, but our query executes nearly three times as fast:

  ...
# returned 4 records in 1.739310 seconds
Using the Metadata Tool's Non-Interactive, Advanced Mode

STORM also has support in the metadata tool for adding schema and datasets for files in which the data of interest are not perfectly sequential. One can hand edit the dataset configuration and schema files and import them into STORM via the metadata tool. In this case, the tool is operating non-interactively (ni). The following example, adds the schema and dataset list file for two binary files of baseball statistics in which there are gaps of uninteresting data between one of the columns and between full records. The schema includes byte offsets to specify the locations of the columns in a record.

Import the new schema:

$ storm add schema-ni alstats-fixed.schema
$ storm list schema
    ...
   ALSTATSFIXED [19]
      <FixedSize>  = 91 
      <HeaderSkip>  = 4 
      YEAR  = INT4           1    0 
      RUNSPERGAME  = FLOAT   1   11 
      RUNS  = INT4           1   15 
      GAMES  = INT4          1   19 
      ATBATS  = INT4         1   23 
      HITS  = INT4           1   27 
      DOUBLES  = INT4        1   31 
      TRIPLES  = INT4        1   35 
      HR  = INT4             1   39 
      BB  = INT4             1   43 
      KO  = INT4             1   47 
      BAVG  = FLOAT          1   51 
      ONBASEPCT  = FLOAT     1   55 
      SLG  = FLOAT           1   59 
      SB  = INT4             1   63 
      CS  = INT4             1   67 
      AVGAGE  = FLOAT        1   71 

Notice in the above schema that there is a 7 byte "gap"; between the YEAR column and the RUNSPERGAME column. Were the data in the file perfectly sequential we would have jumped from 0 to 4 bytes, but, above, the RUNSPERGAME column begins at offset 11, instead. Byte offsets can be added interactively. But to "pad" the rows (to ignore uninteresting data at the end of a row), you will need to use the non-interactive commands and hand edit your schema. Notice in the first example schema that the total byte length was 68. In the non-interactive example, the total byte length of the each record is 91--a 7-byte offset and 16 bytes that get ignored at the end of each record. Depending on how many columns you are interested in and their locations within each record, it may be most efficient to always add schema non-interactively.

Append to the dataset list:

$ storm add dataset-ni alstats-fixed.dslist
$ storm list datasets
     ...
   ALSTATSFIXED [4]
      <DatasetSchema>  = ALSTATSFIXED 
      Data-0  = /home/petri/storm_env/AL-stats-part1-fixedsize.bin
       akron.bmi.ohio-state.edu BinaryExtractor "little_endian" 
      Data-1  = /home/petri/storm_env/AL-stats-part2-fixedsize.bin
       akron.bmi.ohio-state.edu BinaryExtractor "little_endian" 

You can now perform queries as normal.

Defining and Extending STORM Extractors and Indexes

In simplest terms, STORM extractors are entities that fetch data from storage media. They do the work that the index describes to them--that is, what records to fetch and where in the datafile to fetch them. They are responsible for retrieving data, transforming it into tuples (virtual tables, in Fig 2), and collecting tuples into "chunks" for further processing by filters and/or pipelining to the client.

Extractors access storage media as directed by an index and return in-memory images of the requested data. The index describes to an extractor the offsets and sizes of the data records to be extracted. In other words, extractors receive from the index instructions about where in the file the data of interest lay. Extractors have knowledge of the file formats they are designed to handle.

Additionally, extractors must decide how to break a file into chunks for purposes of index creation or normal querying where no index is present. Chunks are user-defined and application specific. Chunks are collections of tuples, grouped so that the tuple data may be efficiently pipelined to clients and/or filters.

STORM defines a default binary extractor. The default extractor will handle the extraction of binary data from an uncompressed file in which data is arranged in record-style, sequential blocks. For many applications, the default extractor is adequate. However, at some point you will likely need to extend STORM's extractors and indexes to meet the needs of your environment and datafiles. The process of extending STORM Indexes and Extractors is covered in the PDF document here:

STORM Extractors and Indexes (pdf file, opens in new window).

A sample extractor developed outside the STORM source tree is available:

stormVariableLengthExtractor.tar.gz .

Support   [top]

For technical support, send an email containing your program run output to Benjamin Rutt: rutt@bmi.osu.edu