All of us might have run Hadoop batch jobs.
Now the next phase of revolution is here : Real-time data analysis
So Here comes the Twitter Storm : A distributed, fault-tolerant, real-time computation system
Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
(taken from http://storm-project.net/)
Storm Vs. Traditional Hadoop Batch jobs :
Hadoop is fundamentally a batch processing system. Data is introduced into HDFS , processed by the nodes and once process is complete, resulting data is back to HDFS. But the problem was how to perform the realtime data processing. Storm came into the picture to solve this problem.
Storm makes it easy to process unbound stream of data with real-time processing. It process data into topologies and continues the processing data as it arrives.
Storm Components :
- Topology : As on Hadoop, you run "Map-Reduce jobs", on Storm, you will run 'Topologies'. Key difference between both is : MapReduce job eventually finished, whereas a topology runs forever(until you kill it
- Nimbus : master node runs a daemon called "Nimbus" that is similar to Hadoop's "JobTracker". Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
- Supervisor : Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
- Stream : A stream is an unbounded sequence of tuples. Storm processes transforming a stream into a new stream in a distributed and reliable way. For example, transforming tweets stream into a stream of trending topics.
- Spout: It’s a source of streams. It reads tuple from external source and emits those as stream in the topology.
- Bolt : It consumes input streams, does some processing, and possibly emits new streams. Complex stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do streaming joins, talk to databases etc.
Zookeeper cluster works as coordinator between Nimbus and supervisors. Nimbus and supervisor are stateless and fail-fast. All states are kept in Zookeeper or local disk.
Why should anyone use Storm:
Now we have the clear idea of storm in real-time processing. As Hadoop Map-reduce eases the batch processing, in the same way Storm eases the parallel real-time computation.
Following are some key point, that shows the importance of Storm.
- Scalable : It scales massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
- Guarantees no loss of data : Storm guarantees that every message will be processed.
- Robust: Goal of Storm is to make user painless for Storm cluster management unlike hadoop clusters.
- Fault-tolerent : If any fault occurs during computation, Storm reassigns tasks.
- Broad set of use cases: Storm can be used for stream processing, database updation, doing continous query on data streams and streaming results into client(continous computation), Distributed RPC.