I am excited to announce that the upcoming Apache Spark 1.4 release will include SparkR, an R package that allows data scientists to analyze large datasets and interactively run jobs on them from the R shell.
R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the runtime is single-threaded and can only process data sets that fit in a single machine’s memory. SparkR, an R package initially developed at the AMPLab, provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows us to run large scale data analysis from the R shell.
SparkR installation with hive
What is R language?
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
What is spark ?
Apache Spark™ is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.
What is DataFrame in R?
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
1.sudo apt-get install update.
2.sudo apt-get install openjdk-7-jdk.
3.sudo apt-get install r-base.
4.Download jdk1.7.0_45 and set it in .bashrc.
5.Set your HADOOP_HOME and HIVE_HOME in .bashrc (you have to install hive and hadoop before starting this).
6.Execute the bashrc file by source .bashrc.
7.Configure your hive with remote metastore with mysql (click here for video).
8.sudo apt-get install git
Spark with R and hive
2.Build the spark.
Type the following command for build
sbt/sbt assembly (it will take some time go and have a cup of tea).
3.Start your hadoop and check all the daemons are up and running by giving jps.
4.Start your hive server.
bin/hive –service hiveserver
5.Copy the hive-site.xml from hive conf folder to spark conf folder.
6.Start your sparkR with mysql connector
Example : sudo bin/sparkR –jars <path of the jar>
>sudo bin/sparkR –jars mysql-connector.jar
7.Now start with the sparkR hive queries
# sc is an existing SparkContext.HiveContext which can access tables in the Hive MetaStore.
>hiveContext <- sparkRHive.init(sc)
# Create a hive table
>sql(hiveContext, “create table src(sno INT ,name STRING) row format delimited fields terminated by ‘,’ stored as textfile”)
#Load data into hive table
>sql(hiveContext, “LOAD DATA LOCAL INPATH ‘/home/hadoop/data’ INTO TABLE src”)
# Queries can be expressed in HiveQL.
>results <- sql(hiveContext, “FROM src SELECT name”)
# create SparkR DataFrames from Hive tables
# results is now a DataFrame
Finally got the output this also a small POC in R . You can also access all the tables which you created here via hive also by starting it separately.