spark hive R

SparkR installation

SparkR installation

spark hive R
spark hive R

                                                                                                                          CLICK HERE FOR VIDEO

I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory.  SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.


CLICK HERE FOR VIDEO

SparkR installation with hive

What is R language?
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.


 

What is spark ?
Apache Spark™ is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.


 

What is DataFrame in R?
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.


Prerequisite

1.sudo apt-get install update.
2.sudo apt-get install openjdk-7-jdk.
3.sudo apt-get install r-base.
4.Download jdk1.7.0_45 and set it in .bashrc.
5.Set your HADOOP_HOME and HIVE_HOME in .bashrc (you have to install hive and hadoop before starting this).
6.Execute the bashrc file by source .bashrc.
7.Configure your hive with remote metastore with mysql (click here for video).
8.sudo apt-get install git


Spark with R and hive

CLICK HERE FOR VIDEO

1.Download spark-1.4.1
2.Build the spark.
cd spark-1.4.1
Type the following command for build
sbt/sbt assembly (it will take some time go and have a cup of tea).

cd R
./install-dev.sh

spark with r
spark with r

cd ..

3.Start your hadoop and check all the daemons are up and running by giving jps.

hadoop daemons
hadoop daemons

4.Start your hive server.
bin/hive –service hiveserver

hive server
hive server

5.Copy the hive-site.xml from hive conf folder to spark conf folder.
6.Start your sparkR with mysql connector
Example : sudo bin/sparkR –jars <path of the jar>
>sudo bin/sparkR –jars mysql-connector.jar

spark with r
spark with r

12

7.Now start with the sparkR hive queries

# sc is an existing SparkContext.HiveContext which can access tables in the Hive MetaStore.
>hiveContext <- sparkRHive.init(sc)

spark with r
spark with r

 

# Create a hive table

>sql(hiveContext, “create table src(sno INT ,name STRING) row format delimited fields terminated by ‘,’ stored as textfile”)


 

#Load data into hive table

>sql(hiveContext, “LOAD DATA LOCAL INPATH ‘/home/hadoop/data’ INTO TABLE src”)

spark with r
spark with r

# Queries can be expressed in HiveQL.

>results <- sql(hiveContext, “FROM src SELECT name”)


# create SparkR DataFrames from Hive tables

# results is now a DataFrame
>head(results)

spark with r
spark with r

Finally got the output this also a small POC in R . You can also access all the tables which you created here via hive also by starting it separately.