Category Archives: Tech Blog

Streaming with Spark Kafka

Streaming with Spark Kafka

structured-streaming-kafka-blog-image-1-overview

Apache Kafka 

is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Apache spark streaming

is an extension of the core SparkAPI that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. …Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Video Click Here

Code


import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils

object sparkkafka {

def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("spark://big-virtual-machine:7077")
val ssc = new StreamingContext(sparkConf, Seconds(2))

val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("customer" -> 5))
lines.print()

ssc.start()
ssc.awaitTermination()
}
}

POM.XML

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>sparkkafka</groupId>
  <artifactId>sparkkafka</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.1</version>
        <configuration>
          <source>1.7</source>
          <target>1.7</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
   <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

Hbase (Nosql) single node installation

Hbase (Nosql) single node installation 

Prerequisites 

Hadoop version 2 (Click here for hadoop 2 installation )

Click here for the video

Hbase Installation steps

1.Extract the HBase

2.Open the conf directory

3.Set the java home in hbase-env.sh

hbase-env.sh

———————–

export JAVA_HOME=/home/big/jdk1.7.0_45

hbase-site.xml

————————-

<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:50000/hbase</value>
</property>
<property>
<name>hbase.master.port</name>
<value>60001</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>35</value>
</property>

—————————–

Start hbase

bin/start-hbase.sh

—————————–

Open your browser  logon to

localhost:60010  (Hbase UI)

——————————-

Open the hbase shell

bin/hbase shell

Hadoop 2 single node Installation on Linux

Hadoop 2 single node  Installation on Linux

This blog gives you the steps to install Hadoop on Linux (Ubuntu)

Click here for video

Step 

1. Set the Java home in environment file(.bashrc file)

2.Create a ssh key gen for password less communication

3.Start your hadoop installation

4.Extract the hadoop tar file and open all the configuration file (etc/hadoop/)

 

core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50000</value>
</property>
————————————
mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
————————————
hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/big/hadoop2-dir/namenode-dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/big/hadoop2-dir/datanode-dir</value>
</property>
————————————
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
————————————
hadoop-env.sh
export JAVA_HOME=/home/big/jdk1.7.0_45
————————————
yarn-env.sh
export JAVA_HOME=/home/big/jdk1.7.0_45
————————————
mapred-env.sh
export JAVA_HOME=/home/big/jdk1.7.0_45

————————————-

Format the hadoop name node

bin/hadoop namenode -format

Start hadoop

sbin/start-all.sh

—————————————

Open your browser and type

localhost:50070 (Name node web UI)

Hadoop 2 Installation on Windows

Hadoop 2 Installation on Windows  without “cygwin”

Hadoop2 recent release supports windows installation it includes all cmd file for configurations steps to follow

Click here for VIDEO

Step 1 : You need hadoop win utils Click here to download

Step 2: Download Java 1.7 for windows and set the Path in Environment .

Continue reading Hadoop 2 Installation on Windows

Apache hive Installation with ACID

Follow The list of steps to install Hive with Transaction

Video Reference Click here 

Set this in your Command line

export HADOOP_USER_CLASSPATH_FIRST=true

Set the environment path and variable in .bashrc

Set HADOOP_HOME in .bashrc

Download and move the mysqlconnector.jar in apache-hive-1.2.1  lib folder(here we are using 1.2.1) this steps is applicable for all version of Hive

1) Extract the tar file of apache-hive
2) hadoop dfs -chmod 700 /tmp
3) Set or add the bellow properties in hive-site.xml inside conf directory (by default conf folder doesn’t have this xml file)

set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1;

4) bin/hive

5) if you are using mysql means (run the metastore table)
Enter in to mysql
source /home/hadoop/apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-txn-schema-0.14.0.mysql.sql

6) create table with transactional = true (your table must need to be bucked + ORC)

create table acid_table (id INT, name STRING, country STRING,salary INT) clustered by (id) into 4 buckets
stored as orc TBLPROPERTIES (‘transactional’=’true’) ;

7) Insert

insert into table acid_table values(1,’john’,’IND’,50000);

8) Update

UPDATE acid_table SET salary = 300 WHERE id = 1;

9) Delete

delete from acid_table where id=1;

Scala Partially Applied Functions

When you invoke a function, you’re said to be applying the function to the arguments. If you pass all the expected arguments, you have fully applied it. If you send only a few arguments, then you get back a partially applied function. This gives you the convenience of binding some arguments and leaving the rest to be filled in later. Following is a simple example to show the concept for Scala Partially Applied Functions

Continue reading Scala Partially Applied Functions

Scala custom exception handling

Scala custom exception handling 

If you are creating your own Exception that is known as custom exception or user-defined exception. Scala custom exceptions are used to customize the exception according to user need.

By the help of custom exception, you can have your own exception and message.

Continue reading Scala custom exception handling

Scala: How to Create Scala project in eclipse

Setup Scala project in eclipse with Hello World program

Scala is a programming language for general software applications. Scala has full support for functional programming and a very strong static type system. This allows programs written in Scala to be very concise and thus smaller in size than other general-purpose programming languages. One way for using Scala is via Eclipse an modern IDE

Continue reading Scala: How to Create Scala project in eclipse

Apache flink with hadoop 1

Apache flink with hadoop 1

What is Apache flink ?

Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Click Here official site

Continue reading Apache flink with hadoop 1