Tuesday 20 May 2014

How To WorkOut Navie Bayes Algorithm

A Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.The main advantage of naive Bayes is that it only requires a smaller amount of data for training inorder to estimate the class labels necessary for classification. Because independent variables are assumed.

In general all of Machine Learning Algorithms need to be trained for supervised learning tasks like classification, prediction etc. 

By training it means to train them on particular inputs so that later on we may test them for unknown inputs (which they have never seen before) for which they may classify or predict etc (in case of supervised learning) based on their learning. This is what most of the Machine Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon. 

How to Apply NaiveBayes to Predict an Outcome
Let's try it out using an example.



In the above training data we have 2 class labels buys_computer No and Yes. And we know 4 characteristics.


1. Whether the age is youth,middle_aged or senior.
2. Whether income is high,low or medium.
3. Whether they have student or not.
4. Whether credit is excellent,fair.


There are many things to pre-compute from the training dataset for future prediction.


Prior Probabilities

Prior Probabilities
-------------------

P(yes) = 9/14 = 0.643
  Given that the class label is "yes" the universe is 14 = yes(9) + no(5). 9 of them is yes
P(no) = 5/14 = 0.357
  Given that the class label is "no" the universe is 14 = yes(9) + no(5). 5 of them is no

Probability of Likelihood

Probability of Likelihood
-------------------------

P(youth/yes) = 2/9 = 0.222
  Given that the class label is "yes" the universe is 9. 2 of them are youth.
P(youth/no) = 3/5 = 0.600
...
...
P(fair/yes) = 6/9 = 0.667
P(fair/no) = 2/5 = 0.400

How to classify an outcome



Let's say we are given the properties of an unknown buys_computer (class). We are told that the properties are


X => age = youth, income = medium, student = yes, credit rating = fair

We need to 

 Maximize P(X|Ci )P(Ci ), for i = 1, 2

P(Ci ) - the prior probability of each class, can be computed based on the training tuples:



P(yes/youth,medium,yes and fair) 
      = P(youth/yes)* P(medium/yes)* P(yes/yes)* P(fair/yes) * P(yes)
      = (0.222* 0.444* 0.667* 0.667) * 0.643
      = 0.028

P(no/youth,income,medium,yes and fair) 
      = P(youth/no)* P(medium/no)* P(yes/no)* P(fair/no) * P(no)
      = (0.600* 0.400* 0.200* 0.400) * 0.357
      = 0.007

(0.028 >> 0.007), we classify this youth/medium/yes/fair  as likely to be yes.


Therefore, the naive Bayesian classifier predicts buys_computer = yes for tuple X.


Saturday 17 May 2014

Count Frequency Of Values In A Column Using Apache Pig


There may be situations to count the occurence of a value in a field.
Let this be the sample input bag.


user_id   course_name user_name
1           Social      Anju
2           Maths       Malu
1           English     Anju
1           Maths       Anju

Say we need to calculate no of occurence of each user_name.
Anju 3
Malu 1

Inorder to achieve this - COUNT Built In Function can be used.


COUNT Function in Apache Pig


COUNT function  compute the number of elements in a bag.
To group count a preceding GROUP BY statement and for global counts GROUP ALL statement is required.

The basic idea to do the above example is to group by user_name and count the tuples in the bag.


--count.pig

 userAlias = LOAD '/home/sreeveni/myfiles/pig/count.txt' as 
             (user_id:long,course_name:chararray,user_name:chararray);
 groupedByUser = group userAlias by user_name;
 counted = FOREACH groupedByUser GENERATE group as user_name,COUNT(userAlias) as cnt;
 result = FOREACH counted GENERATE user_name, cnt;
 store result into '/home/sreeveni/myfiles/pig/OUT/count';

The COUNT function ignores NULLs, that is tuple in the bag will not be counted if the first field in this tuple is NULL.
COUNT_STAR can be used to count fields including NULL values.




Monday 12 May 2014

Configuring PasswordLess SSH for Apache Hadoop


In pseudo-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. Hadoop doesn’t actually distinguish between pseudo-distributed and fully distributed modes: it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. Pseudo-distributed mode is just a special case of fully distributed mode in which the (single) host is localhost, so we need to make sure that we can SSH to localhost and log in without having to enter a password.

If you cannot ssh to localhost without a passphrase, execute the following commands:

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/unmesha/.ssh/id_rsa): [press enter]
Enter passphrase (empty for no passphrase): [press enter]
Enter same passphrase again: [press enter]
Your identification has been saved in /home/unmesha/.ssh/id_rsa.
Your public key has been saved in /home/unmesha/.ssh/id_rsa.pub.
The key fingerprint is:
61:c5:33:9f:53:1e:4a:5f:e9:4d:19:87:55:46:d3:6b unmesha@unmesha-virtual-machine
The key's randomart image is:
+--[ RSA 2048]----+
|         ..    *%|
|         .+ . ++*|
|        o  = *.+o|
|       . .  = oE.|
|        S    ..  |
|                 |
|                 |
|                 |
|                 |
+-----------------+

unmesha@unmesha-hadoop-virtual-machine:~$ ssh-copy-id localhost
unmesha@localhost's password: 
Now try logging into the machine, with "ssh 'localhost'", and check in:

  ~/.ssh/authorized_keys

to make sure we haven't added extra keys that you weren't expecting.

Now you will be able to ssh without password.

unmesha@unmesha-hadoop-virtual-machine:~$ ssh localhost
Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-23-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

Last login: Tue Apr 29 17:48:55 2014 from amma-hp-probook-4520s.local
unmesha@unmesha-virtual-machine:~$ 

Happy Hadooping ...

Sunday 4 May 2014

Map-Only Jobs In Hadoop


There may be reasons where Map-Only job is needed,Where there is no Reducer to execute.Here Map does all its task with its InputSplit and no job for Reducer.This can be achieved by setting  job.setNumReduceTasks()  to Zero in Configuration.

Job job = new Job(getConf(), "Map-Only Job");
job.setJarByClass(MaponlyDriver.class);

job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
/*
 * Set no of reducers to 0
 */
job.setNumReduceTasks(0);

job.setMapperClass(Mapper.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

boolean success = job.waitForCompletion(true);
return(success ? 0 : 1);

This sets Reducer task to 0 and turns off the Reducer.

job.setNumReduceTasks(0);

So the no. of output files will be equal to no. of mappers and output files will be named as part-m-00000.

And once Reducer task is set to Zero the result will be unsorted.

If we are not specifying this property in Configuration, an Identity Reducer will get executed in which the same value is simply emitted along with the incoming key and the output file will be part-r-00000.



Happy Hadooping ...

Saturday 3 May 2014

Hadoop Installation Using Cloudera Package - Pseudo Distributed Mode (Single Node)

[Previous Post]

Hadoop can be installed using cloudera also with less steps in an easy way .The difference is Cloudera packed Apache Hadoop and some ecosystem projects into one package.And they have set all the configuration to localhost and we need not want to set the configuration files.

Installation using Cloudera Package.

Prerequistie

1.Java


Installation Steps

Step 1: Set Java home in /etc/profile

unmesha@unmesha-hadoop-virtual-machine:~$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)

Check your current location of java 

unmesha@unmesha-hadoop-virtual-machine:~$ sudo update-alternatives --config java
[sudo] password for unmesha: 
There is only one alternative in link group java: /usr/lib/jvm/java-7-oracle/jre/bin/java
Nothing to configure.

Set JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-7-oracle
unmesha@unmesha-hadoop-virtual-machine:~$ source ~/.bashrc 

Step 2: Download the package for your system under "On Ubuntu and other Debian systems, do the following:" heading from here.

Step 3: Extract the package

unmesha@unmesha-hadoop-virtual-machine:~$sudo dpkg -i cdh4-repository_1.0_all.deb


Step 4: Install Hadoop

unmesha@unmesha-hadoop-virtual-machine:~$sudo apt-get update 
unmesha@unmesha-hadoop-virtual-machine:~$sudo apt-get install hadoop-0.20-conf-pseudo


Step 5: Format Namenode

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hdfs namenode -format


Step 6: Start HDFS

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done


Step 7: Create the /tmp Directory

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir /tmp 
unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chmod -R 1777 /tmp


Step 8: Create the MapReduce system directories

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred


Step 9: Verify the HDFS File Structure

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -ls -R /


Step 10: Start MapReduce

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done


Step 11: Set up user directory

unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir /user/<your username>unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -chown <user> /user/<your username> 
unmesha@unmesha-hadoop-virtual-machine:~$sudo -u hdfs hadoop fs -mkdir /user/unmesha/new


Step 12: Run grep example, you can also try out wordcount example

unmesha@unmesha-hadoop-virtual-machine:~$/usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'

Step 13: You can also stop the services

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done

unmesha@unmesha-hadoop-virtual-machine:~$for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x stop ; done


Happy Hadooping ...