apache spark

Apache spark word count program

Apache spark word count program

 

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop‘s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.

Click here for video

Wordcount in Scala

bin/spark-shell

scala>val¬†textFile = sc.textFile(“/home/username/word.txt”)

scala>val¬†counts = textFile.flatMap(line => line.split(” “))map(word => (word, 1))reduceByKey(_ + _)


scala>counts.collect()

Wordcount in Python


bin/pyspark

>>>text_file = sc.textFile(“/home/username/word.txt”)

>>>counts = text_file.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)


>>>counts.collect()


Input File for word.txt

I love bigdata

I like bigdata


Spark web UI

spark UI