hive-pig_poc

Click stream analysis using hive and pig an POC

Click stream analysis suing hive and pig an POC

clickstream

All the e-commerce portals store the user activities on their site as clickstream activity and later they analyze it to identify what the user has browsed and show the appropriate recommendations when the user visits the site again.

Click Here for POC video about the same

Input 1:

1,John,20,India,M
2,Mike,5,India,M
3,Peter,15,India,F

Input 2

1,www.bbc.com
1,www.abc.com
1,www.gmail.com
2,www.cnn.com
2,www.eenadu.net
2,www.stackoverflow.com
2,www.businessweek.com
3,www.eenadu.net
3,www.stackoverflow.com
3,www.businessweek.com


 

Here is the scripts and queries 

1. Click stream analysis using hive.

CREATE TABLE user (user_id INT,name STRING,age INT,country STRING,gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ stored as textfile;


LOAD DATA LOCAL INPATH ‘/home/hadoop/Music/userdata’ OVERWRITE INTO TABLE user;


CREATE TABLE clickstream (userid INT,url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ stored as textfile;

 

LOAD DATA LOCAL INPATH ‘/home/hadoop/Music/clickstream’ OVERWRITE INTO TABLE clickstream;


 

SELECT url,count(url) c FROM user u JOIN clickstream c ON (u.user_id=c.userid) where u.age<16 group by url order by c DESC LIMIT 3;


 


 

2. Click stream analysis using pig.

Users = load ‘/userdata’ using PigStorage(‘,’) as (user_id, name, age:int, country, gender);


Fltr = filter Users by age <= 16;


PageUrl = load ‘/clickstream’ using PigStorage(‘,’) as (userid, url);


Jn = join Fltr by user_id, PageUrl by userid;


Grp = group Jn by url;


Cou = foreach Grp generate group, COUNT(Jn) as clicks;


 

Ord = order Cou by clicks desc;


Top3 = limit Ord 3;


dump Top3;