Apache Spark and Hadoop Integration with example
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop‘s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.
Step 1 : Install hadoop in your machine ( 1x or 2x) and also you need to set java path and scala path in .bashrc ( for setting path refer this post Spark installation )
Step 2: Check all hadoop daemons are up running.
Step 3: Write some data in your hdfs (here my file name in hdfs is word)
Step 4: Download apache spark for hadoop 1x or 2x based on you hadoop installed version in step 1.
Step 5: Untar the spark hadoop file.
Step 6: Start the spark hadoop shell.
Step 7: Type the following command.once spark shell started
Step 8: See the Out put In terminal.
Step 9: Check the Namenode UI (localhost:50070)
Step 10: Check the spark UI (localhost:4040) for monitoring the job