Sunday, 10 November 2013

Singular Value Decomposition(SVD) in R

When we have large number of attributes in a dataset, it’s hard to identify which of the variables are the most useful to use. Singular Value Decomposition (SVD) algorithm helps us to reduce the dimensions of data.
Here we’ll learn how to implement SVD.  But before that, we must know what does Dimensional Reduction mean?

Dimensional Reduction is the process of reducing the number of random variables under consideration as feature selection and feature extraction. There are many ways to do it. Here we will read about SVD, how it helps in Dimensional Reduction.

Before looking into implementation part, let’s have a brief overview what SVD is.

Singular Value Decomposition (SVD) is a matrix factorization method used in data mining.
In data mining, this algorithm is used to reduce the number of attributes that are used in a data mining process. This reduction removes unnecessary data that are linearly dependent in the point of view of Linear Algebra. For example, imagine a database which contains a field that stores the water's temperature on several samples and another that stores its state (solid, liquid or gas). It’s easy to see that the second field is dependent from the first and, therefore, SVD could easily show us that it is not important for the analysis.

Algorithm:

SVD is the factorization of m x n matrix A, with m > n of real or complex numbers:

A = U S VT

Where U is an orthogonal m x n matrix, S is a diagonal matrix of singular values and V is an orthogonal n x n matrix &

UTU = VTV = I

where I is an identity matrix.

To compute SVD :
  • we find the eigen vectors and eigen values of ATA and AAT.
  • The eigenvectors of ATA  are the columns of V and the eigenvectors of AAT are the columns of U. The singular values of A, in the diagonal of matrix S, are the square root of the common positive eigenvalues of AAT and ATA.
  •  If AAT and ATA have the same number of eigenvalues, then A is a square matrix, else eigenvalues of the matrix that have less eigen values are the eigenvalues of the matrix that has more. We can say, that singular values of A are the eigen values of the matrix, between AAT and ATA with less number of eigenvalues.
  • Singular values of matrix A is also known as rank of that matrix that specifies number of linearly independent rows or columns.
  • Rank should not be greater than min(m,n)

Now we have clear idea of SVD, so now we’ll learn how to run SVD in R.

How to run SVD in R:

Here is the example for how to run in R.
Suppose we have a 4x3 matrix.
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") }
X <- hilbert(4)[, 1:3]
(s <- svd(X))
D <- diag(s$d)
s$u %*% D %* t(s$v) #  X = U D V'
t(s$u) %*% X %*% s$v #  D = U' X V

Svd(X) returns a list with components :
d : a vector containing the singular values of x, of length min(n, p).
u : a matrix whose columns contain the left singular vectors of x.
v : a matrix whose columns contain the right singular vectors of x.

s$d is a vector of singular values as below:

[1] 1.451914187 0.143312317 0.004228883

Using SVD, we can remove noise and linear independent elements with most important singular values. This is very useful in data mining.

Thursday, 3 October 2013

Logistic Regression in Mahout

Logistic Regression(LR) is a type of regression analysis used for the prediction of the probability of occurence of an event. It uses several predictors which may be either numerical or categorical.
It refers specifically to the problem in which dependent variable is dichotomous. i.e.
Predict whether a patient has a given disease or not,whether user will buy a product or not... etc

It can be implemented in Mahout as well as in R. Here we'll talk about Mahout implementation.
Mahout implementation uses Stochastic Gradient Descent(SGD) on all large training data sets.

Following are the steps to run LR:

# To train the model -
It produces a model based on training data that can be used to classify dataset of specific format. It takes training dataset as input and uses it to produce the target model.


$MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 100 --rate 60 --input $MAHOUT_HOME/examples/src/main/resources/donut.csv  --features 100 --output output/donutmodel.model --target color  --categories 2 --predictors  x y xx xy yy a b c --types n


"input" :  training data
"output" : path to the file where model will be written.
"target" : dependent variable which is to be predicted
"categories" : number of unique possible values that target can be assigned
"predictors" : list of field names that are to be used to predict target variable
"types" : datatypes for the items in predictor list
"passes" : number of passes over the input data
"features" : size of internal feature vector
"lambda" : amount of co-efficient decay to use
"rate" : initial learning rate

It'll give output like this and one model file will be generated on the given location:


Running on hadoop, using /home/hadoop/hadoop-0.20.203.0/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /data/dataAnalytics/mahout-distribution-0.7-CUSTOM/mahout-examples-0.7-job.jar
13/10/03 11:02:48 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.TrainLogistic.props found on classpath, will use command-line arguments only
100
color ~ 6.214*Intercept Term + 0.894*a + -1.255*b + -26.279*c + 4.623*x + -5.436*xx + 3.050*xy + 6.001*y + -6.190*yy
      Intercept Term 6.21450
                   a 0.89445
                   b -1.25489
                   c -26.27914
                   x 4.62344
                  xx -5.43578
                  xy 3.04982
                   y 6.00145
                  yy -6.19029
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     4.623441607     0.000000000     0.000000000     6.214498855     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -5.435784604     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000   -26.279139691     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -1.254893124     0.000000000     0.000000000    -6.190291596     0.000000000     0.894450921     0.000000000     3.049819437     0.000000000     0.000000000     0.000000000     0.000000000     6.001446962     0.000000000     0.000000000
13/10/03 11:02:48 INFO driver.MahoutDriver: Program took 616 ms (Minutes: 0.010266666666666667)

# To test the model : 
We have generated the model in the first step. Now We'll use that to test our system to see, how accurate it is to classify things.


$MAHOUT_HOME/bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input $MAHOUT_HOME/examples/src/main/resources/donut-test.csv --model output/donutmodel.model --auc –confusion


Output would be like this:

Running on hadoop, using /home/hadoop/hadoop-0.20.203.0/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /data/dataAnalytics/mahout-distribution-0.7-CUSTOM/mahout-examples-0.7-job.jar
13/10/03 11:03:13 WARN driver.MahoutDriver: No org.apache.mahout.classifier.sgd.RunLogistic.props found on classpath, will use command-line arguments only
AUC = 0.97
confusion: [[24.0, 2.0], [3.0, 11.0]]
entropy: [[-0.2, -3.4], [-4.8, -0.1]]
13/10/03 11:03:14 INFO driver.MahoutDriver: Program took 130 ms (Minutes: 0.0021666666666666666)


where  AUC : Area under curve. It ranges from 0 to 1. A value of 0 means it wasn't able to classify the input correctly and a value of 1 means that it was able to classify records correctly. Accordingly, we can see how our model is working.

confusion : it will give you confusion matrix, from where you can see the prediction.

Now we can predict our test data from above generated model and can answer the question.

So start using LR for solving your problems!!!!

Monday, 23 September 2013

Installing a Storm Cluster

From the previous post, you must have clear idea, what Storm is meant for. Now the next step is to setup the Storm cluster on the machine.

Following are the prerequisites for setting the  cluster :
  •  Linux Operating system
  • Java 6 installed
  • Python installed

Installation Steps :
Following steps are needed to get a Storm Cluster up and running.

  • Set up Zookeeper ClusterZookeeper is used as a coordinator  in Storm cluster.  You can refer here to see the installation steps for Zookeeper.
  • Install native dependencies on Nimbus & Worker machine  There are some dependencies which are required by Storm i.e. ZeroMQ, JZMQ.  These native dependencies are needed only on Storm cluster. While using Storm in local mode, Storm uses a pure Java Messaging system so you don’t need to install native dependencies there.  But in cluster mode, it’s needed.     
            ZeroMQ 2.1.7 Installation : Storm has been tested with ZeroMQ 2.1.7, and this is the recommended ZeroMQ release that    you install. You can download it from hereFollowing are the steps for installation.


tar –xzf zeromq-2.1.7.tar.gz

cd zeromq-2.1.7

./configure

make

sudo make install

JZMQ Installation: JZMQ is the java binding ZeroMQ.  Here are the steps:


cd jzmq

./autogen.sh

./configure

make

sudo make install

  • Download Storm release from here  and copy it to every machine in cluster (nimbus and worker machine.
  • Configure  storm.yaml Storm release contains a file at the conf/storm.yaml  with the default configuration to run the Storm Daemon.
storm.zookeeper.servers:
     - "IP_MACHINE_1"
     - "IP_MACHINE_2"
storm.local.dir: "/home/hadoop/STORM_DIR"
java.library.path: "/usr/local/lib"
nimbus.host: "IP_OF_NIMBUS_MACHINE"
supervisor.slots.ports:
    - 6700
    - 6701
    - 6702
    - 6703

               Copy this storm.yaml into “~/.storm/” location.
  • Now launch all the Storm Daemons.
hduser@ubuntu:/usr/local/storm-0.8.1$ bin/storm nimbus &
hduser@ubuntu:/usr/local/storm-0.8.1$ bin/storm supervisor &
hduser@ubuntu:/usr/local/storm-0.8.1$ bin/storm ui &

Here is the example to start with Storm Topology: 
                 https://github.com/nathanmarz/storm-starter

Saturday, 21 September 2013

Classification and Regression Trees(CART)

Classification & Regression Tree is a classification method, technically known as Binary Recursive Partitioning. It uses historical data to construct Decision trees. Decision trees are further used for classifying new data.

Here the point comes : Where should we use CART?

Sometimes we have problems where we want answer  in “Yes/No”.
i.e : “Is salary greater than 30000?”,” Is it going to rain today?” etc

CART asks “Yes/No” questions. CART algorithm searches all possible variables and possible values in order to find the best split (means the question that split the data into two parts to find the maximum homogeneity )
Key elements for CART analysis are :
·         Split  each node in a tree.
·         Decide whether tree is complete or not.
·         Assign each leaf node to a class outcome
It returns the decision tree as below.
CART Modeling via rpart
Classification & Regression Tree can be generated using rpart package in R.
Following are the steps to get the CART model :
·         Grow a tree:   Use following to grow a tree
rpart(formula, data, weights, subset, na.action = na.rpart, method,
      model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)

·         Examine the results based on the model - There are some functions that help to test the result
printcp(fit)
display cp table                             
plotcp(fit)
plot cross-validation results
rsq.rpart(fit)
plot approximate R-squared and relative error for different splits (2 plots). labels are only appropriate for the "anova" method.
print(fit)
print results
summary(fit)
detailed results including surrogate splits
plot(fit)
plot decision tree
text(fit)
label the decision tree plot
post(fit,file=)
create postscript plot of decision tree
            Here fit is the model output of rpart command.

·         Pruning the tree - It helps in avoiding the overfitting of data. Typically, you will want to select a tree size that minimizes the cross-validated error, the xerror column printed by printcp( ).
Prune the tree of desired size using
prune(fit,cp=)

Here is the example of classification tree :
library(rpart)
dataset <- read.table("C:\\Users\\Nishu\\Downloads\\bank\\bank.csv",header=T,sep=";")
# grow tree
fit <- rpart(y ~ ., method="class", data=dataset )
printcp(fit) # display the results
plotcp(fit) # visualize cross-validation results
summary(fit) # detailed summary of splits
# plot tree
plot(fit, uniform=TRUE,main="Classification Tree for Bank")
text(fit, use.n=TRUE, all=TRUE, cex=.8)
# create attractive postscript plot of tree
post(fit, file = "c:/tree.ps", title = "Classification Tree")
# prune the tree
pfit<- prune(fit, cp=   fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
# plot the pruned tree
plot(pfit, uniform=TRUE,
   main="Pruned Classification Tree")
text(pfit, use.n=TRUE, all=TRUE, cex=.8)
post(pfit, file = "c:/ptree.ps",
   title = "Pruned Classification ")

Now we have got the model. Next step is to predict data based on the trained model.
First we’ll split the dataset into trained and testdata with a fixed percentage.
library(rpart)
dataset<-read.table("C:\\Users\\Nishu\\Downloads\\bank\\bank.csv",header=T,sep=";")
sub <- sample(nrow(dataset), floor(nrow(dataset) * 0.9)) # Here we are taking 90% training data
training <- dataset [sub, ]
testing <- dataset [-sub, ]
fit <- rpart(y ~ ., method="class", data=dataset )
predict(fit,testing,type=”class”)
# to get the confusion matrix
out <- table(predict(fit,testing,type="class"),dataset[-sub,"y"])


Here confusion matrix is :
              no      yes
  no      391       25
  yes     13        24

# To get the accuracy and other details, use confusionMatrix method with Caret package

library(caret)
confusionMatrix(out)

Output would be :

no  yes
  no      391  25
  yes      13  24

               Accuracy : 0.9101
                 95% CI : (0.8936, 0.9248)
    No Information Rate : 0.8968
    P-Value [Acc > NIR] : 0.05704
    Kappa : 0.4005
Mcnemar's Test P-Value : 9.213e-08

            Sensitivity : 0.9745
            Specificity : 0.3500
         Pos Pred Value : 0.9287
         Neg Pred Value : 0.6125
             Prevalence : 0.8968
         Detection Rate : 0.8740
   Detection Prevalence : 0.9410


       'Positive' Class : no
So here we have the desired output. Prediction and Accuracy of model based on which, we can predict future data. 

Download this dataset or other dataset from here and test the algorithm.

Here you go..!!!!

Saturday, 14 September 2013

Twitter Storm : Real-time Hadoop

All of us might have run Hadoop batch jobs.
Now the next phase of revolution is here : Real-time data analysis
So Here comes the Twitter Storm : A distributed, fault-tolerant, real-time computation system 

Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.


Storm Vs. Traditional Hadoop Batch jobs :

Hadoop is fundamentally a batch processing system. Data is introduced into HDFS , processed by the nodes and once process is complete, resulting data is back to HDFS. But the problem was how to perform the realtime data processing. Storm came into the picture to solve this problem.
Storm makes it easy to process unbound stream of data with real-time processing. It process data into topologies and continues the processing data as it arrives.


Storm Components :

  • Topology : As on Hadoop, you run "Map-Reduce jobs", on Storm, you will run 'Topologies'. Key difference between both is : MapReduce job eventually finished, whereas a topology runs forever(until you kill it
  • Nimbus :  master node runs a daemon called "Nimbus" that is similar to Hadoop's "JobTracker". Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.
  • Supervisor :  Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
  • Stream : A stream is an unbounded sequence of tuples. Storm processes transforming a stream into a new stream in a distributed and reliable way. For example, transforming  tweets stream into a stream of trending topics.
  • Spout: It’s a source of streams. It reads tuple from external source and emits those as stream in the topology.
  • Bolt : It consumes input streams, does some processing, and possibly emits new streams. Complex stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do streaming joins, talk to databases etc.


Zookeeper cluster works as coordinator between Nimbus and supervisors. Nimbus and supervisor are stateless and fail-fast. All states are kept in Zookeeper or local disk.






Why should anyone use Storm:

Now we have the clear idea of storm in real-time processing. As Hadoop Map-reduce eases the batch processing, in the same way Storm eases the parallel real-time computation.
Following are some key point, that shows the importance of Storm.
  •  Scalable : It scales massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
  • Guarantees no loss of data : Storm guarantees that every message will be processed.
  •  Robust: Goal of Storm is to make user painless for Storm cluster management unlike hadoop clusters.
  •  Fault-tolerent : If any fault occurs during computation, Storm reassigns tasks.
  •  Broad set of use cases: Storm can be used for stream processing, database updation, doing continous query on data streams and streaming results into client(continous computation), Distributed RPC.