All posts by admin

Streaming with Spark Kafka

Streaming with Spark Kafka

structured-streaming-kafka-blog-image-1-overview

Apache Kafka 

is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Apache spark streaming

is an extension of the core SparkAPI that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. …Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

Video Click Here

Code


import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils

object sparkkafka {

def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("spark://big-virtual-machine:7077")
val ssc = new StreamingContext(sparkConf, Seconds(2))

val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("customer" -> 5))
lines.print()

ssc.start()
ssc.awaitTermination()
}
}

POM.XML

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>sparkkafka</groupId>
  <artifactId>sparkkafka</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.1</version>
        <configuration>
          <source>1.7</source>
          <target>1.7</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
   <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka_2.10</artifactId>
      <version>1.5.1</version>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
</project>

Hbase (Nosql) single node installation

Hbase (Nosql) single node installation 

Prerequisites 

Hadoop version 2 (Click here for hadoop 2 installation )

Click here for the video

Hbase Installation steps

1.Extract the HBase

2.Open the conf directory

3.Set the java home in hbase-env.sh

hbase-env.sh

———————–

export JAVA_HOME=/home/big/jdk1.7.0_45

hbase-site.xml

————————-

<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:50000/hbase</value>
</property>
<property>
<name>hbase.master.port</name>
<value>60001</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>35</value>
</property>

—————————–

Start hbase

bin/start-hbase.sh

—————————–

Open your browser  logon to

localhost:60010  (Hbase UI)

——————————-

Open the hbase shell

bin/hbase shell

Hadoop 2 single node Installation on Linux

Hadoop 2 single node  Installation on Linux

This blog gives you the steps to install Hadoop on Linux (Ubuntu)

Click here for video

Step 

1. Set the Java home in environment file(.bashrc file)

2.Create a ssh key gen for password less communication

3.Start your hadoop installation

4.Extract the hadoop tar file and open all the configuration file (etc/hadoop/)

 

core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50000</value>
</property>
————————————
mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
————————————
hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/big/hadoop2-dir/namenode-dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/big/hadoop2-dir/datanode-dir</value>
</property>
————————————
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
————————————
hadoop-env.sh
export JAVA_HOME=/home/big/jdk1.7.0_45
————————————
yarn-env.sh
export JAVA_HOME=/home/big/jdk1.7.0_45
————————————
mapred-env.sh
export JAVA_HOME=/home/big/jdk1.7.0_45

————————————-

Format the hadoop name node

bin/hadoop namenode -format

Start hadoop

sbin/start-all.sh

—————————————

Open your browser and type

localhost:50070 (Name node web UI)

Hadoop 2 Installation on Windows

Hadoop 2 Installation on Windows  without “cygwin”

Hadoop2 recent release supports windows installation it includes all cmd file for configurations steps to follow

Click here for VIDEO

Step 1 : You need hadoop win utils Click here to download

Step 2: Download Java 1.7 for windows and set the Path in Environment .

Continue reading Hadoop 2 Installation on Windows

Apache hive Installation with ACID

Follow The list of steps to install Hive with Transaction

Video Reference Click here 

Set this in your Command line

export HADOOP_USER_CLASSPATH_FIRST=true

Set the environment path and variable in .bashrc

Set HADOOP_HOME in .bashrc

Download and move the mysqlconnector.jar in apache-hive-1.2.1  lib folder(here we are using 1.2.1) this steps is applicable for all version of Hive

1) Extract the tar file of apache-hive
2) hadoop dfs -chmod 700 /tmp
3) Set or add the bellow properties in hive-site.xml inside conf directory (by default conf folder doesn’t have this xml file)

set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1;

4) bin/hive

5) if you are using mysql means (run the metastore table)
Enter in to mysql
source /home/hadoop/apache-hive-1.2.1-bin/scripts/metastore/upgrade/mysql/hive-txn-schema-0.14.0.mysql.sql

6) create table with transactional = true (your table must need to be bucked + ORC)

create table acid_table (id INT, name STRING, country STRING,salary INT) clustered by (id) into 4 buckets
stored as orc TBLPROPERTIES (‘transactional’=’true’) ;

7) Insert

insert into table acid_table values(1,’john’,’IND’,50000);

8) Update

UPDATE acid_table SET salary = 300 WHERE id = 1;

9) Delete

delete from acid_table where id=1;

Scala Partially Applied Functions

When you invoke a function, you’re said to be applying the function to the arguments. If you pass all the expected arguments, you have fully applied it. If you send only a few arguments, then you get back a partially applied function. This gives you the convenience of binding some arguments and leaving the rest to be filled in later. Following is a simple example to show the concept for Scala Partially Applied Functions

Continue reading Scala Partially Applied Functions

Scala custom exception handling

Scala custom exception handling 

If you are creating your own Exception that is known as custom exception or user-defined exception. Scala custom exceptions are used to customize the exception according to user need.

By the help of custom exception, you can have your own exception and message.

Continue reading Scala custom exception handling

Scala: How to Create Scala project in eclipse

Setup Scala project in eclipse with Hello World program

Scala is a programming language for general software applications. Scala has full support for functional programming and a very strong static type system. This allows programs written in Scala to be very concise and thus smaller in size than other general-purpose programming languages. One way for using Scala is via Eclipse an modern IDE

Continue reading Scala: How to Create Scala project in eclipse

Apache Cassandra Certification

Mock up test of Cassandra certification.The same mock up is also provided in Data stax site also.

Apache Cassandra Certification has been released on September 22 brought to you by Data Stax and O’Reilly. They are providing various level of certifications.
1.Developer (DS201 and DS220)
2.Administrator (DS201 and DS210)
3.Architect (DS201 , DS210 and DS220)
For more you can visit Data stax web site.

Which of these Cassandra technologies work together to keep track of the cluster data center and rack topology? Choose all that apply.

 
 
 
 

When defining a table in Apache Cassandra, a _________________ must be defined

 
 
 
 

The CAP theorem states that only two out of three characteristics can be met by a distributed database system simultaneously. Which of the two characteristics does Apache Cassandra value most? Choose 2

 
 
 

The main function of a keyspace is to control ___________________.

 
 
 
 

Tombstones are ___________________.

 
 
 
 

Given the following table from a physical data model, what are the most likely  choices for missing data types for the avg_rating, category, and amount_ingredient, in that order?

receipe_name Text

avg_rating  ?

prpe_time INT

cook_time INT

directions TEXT

category ?

amount_ ingredient ?

 
 
 
 

Durability is the property that ensures that all written data is the same on all nodes that it is written to.

 
 

Compaction does NOT do which of the following?

 
 
 
 

Apache Cassandra can be downloaded in which formats, including DataStax Community variants? Choose all that apply

 
 
 
 

Apache Cassandra achieves high availability by____________?

 
 
 

Which of the following is a valid Cassandra data type?

 
 
 
 

Which of the following is not true for Apache Cassandra?

 
 
 

What does the gossip protocol do?

 
 
 

Given the following table, which of the following statements is an example of Data Modification Language (DML) in CQL?

CREATE TABLE comics ( title text, issueNumber int, originalPriceDollars float, PRIMARY KEY ( title, issueNumber ) );

 
 
 
 

An application tracks a user’s habits. Every time a user clicks a link on a page within your site, the time of the event is recorded as well as the link clicked. In order to write an efficient query, all data must be stored in a single partition. Which of the following tables best models the needs of the application?

 
 
 
 

Clients should have ________________ so that the column values will replace older values based on their timestamp.

 
 
 
 

In a Cassandra instance, a table called Orders holds order information. Each time an order is placed, an entry must also be placed in a denormalized table OrdersByCustomer to keep track of order history per customer. How does Cassandra handle this?

 
 
 
 

An SSTable is immutable, meaning that it cannot be modified.

 
 

You have designed a query to return all users by a designated state and designated age from a table holding all user information. Which approach is best for efficient queries like this?

 
 
 
 

Which of the following statements about writes is incorrect?