Tag Archives: apache hadoop tutorial

Spark MLLib and Installation of R in Jupyter Notebook

Introduction to MLLib

Are you planning to learn Spark and searching for useful information regarding Spark MLLib, R and Jupyter? Well, this article is presented to you to get useful information regarding these.  Let us begin with the MLLib.

Introduction to MLLib

Spark Machine Language Library ( MLLib), focuses mainly on learning algorithms as well as utilities such as clustering, classification, collaborative filtering, regression  and dimensionality reduction. It can fit easily into the APIs of the Spark and can interoperate with R libraries and NumPy in Python. It is also possible to use any data source of Hadoop like HBase, HDFS or local files since it makes simple to plug into workflows of Hadoop. When it comes to performance, MLLib can support high-quality algorithms and it works hundred times faster than MapReduce. Spark shines iterative computation which enables MLLib to work fast. The high-quality algorithm in the MLLib benefits the iteration and provides better results when compared to one-pass approximation, which is used on Hadoop MapReduce.

Learn About Spark Mllib

Why to use MLLib?

MLLib is built on Spark which is a rapid general engine designed for high-scale processing. It support to write application code in a various languages like Scale, Java, and Python.

MLLib Installation

When it comes to the installation of MLLib, the only thing that you need to do is, installing Spark, since MLLib is already encompassed in Spark.

Let us look on how to install Spark 1.1.0. First, download the Apache Spark from the download link on the official website.

Download page generally includes Apache Spark Package for several famous HDFS versions. If you want to build Apache Spark from the scratch, then it is suggested to go through building Apache Spark with Maven. In the download page, just choose the Spark release, package type and download type.

Apache Spark can run on both Windows and Unix-based systems such as Mac OS and Linux. It is effortless to run Spark locally on the machine. All you want to include in your system is, Java on the system PATH or JAVA_HOME platform variable directing to Java installation. Apache Spark needs Python 2.6+ and Java 6+. Spark 1.1.0 utilizes Scala 2.10 for the Scala Application Programming Interface.

There may be a situation arise at the time of creating a machine-learning model, that is, the input dataset does not match the computer’s memory. Generally, developers use distributed computing tools such as Apache Spark and Hadoop for the computation in a bunch with several machines. On the other hand, Spark has the ability to process the input data locally on the machine at the stand alone mode. It can even able to build models once the amount of dataset exceeds the memory capacity of the computer.

Introduction to Jupyter Notebook

It is a web application, which permits the users to build as well as share documents, which includes equations, live codes, explanatory texts, and visualization. Its benefits include machine learning, statistical modeling, numerical simulation, data cleaning & transformation and much more. When functioning on a data science issue, users might need to fix an interactive platform to create and share the code with others. This issue can be resolved easily by using a notebook. A notebook can support the reproducible and transparent report. Notebooks are ideal for conditions where the user needs to integrate plain text with rich-text elements like calculations, graphics etc.

R Notebook

Nowadays, Jupyter appears as the standard key for R users. It offers the best solution when compared to other notebooks like Beaker and Apache Zeppelin. Other alternatives like R Markdown, Sweave or knitr have been more famous among the R community.

Installation of R in Jupyter Notebook with the R Kernel

  • One of the best ways to run R in Jupyter notebook is by utilizing R Kernel. If you want to run R, you will have to load IRKernel (Kernel for R, which is available at Github) in the notebook platform. You need to activate it in order to start working with R.
  • At the beginning, it is essential to install certain packages. Ensure that you do this in regular R terminal. Instead, if you do it in the RStudio console, you will get an error.
  • Next, enter a number in the command prompt to choose a CRAN mirror in order to install essential packages and the installation process will continue.
  • Then you are required to make the Kernel noticeable for Jupyter.
  • Finally, you can open the application with the Jupyter notebook. You will notice R displays in the Kernel lists whenever you build a new notebook.

Advantages of using Jupyter

The main focus is to facilitate sharing notebooks with other users. It is possible to write some code, mix that code with some text, and publish the compilation as the notebook. The idea here is to enable the user to view the code and the result of the executing  code.

Using Jupyter is an ideal way to share few experimental snippets and publish detailed reports along with an entire code set and explanations. The main advantage which makes Jupyter superior from other services is that it will extract the code output in addition to allowing code snippets posting.

Apache Spark Scala Tutorial

Attend a Live WEBINAR about Hadoop Admin Training on May 11, 2016 @8:00 p.m. CST

Hadoop Apdmin Training

Time : Wednesday May 11, 2016 @ 8:00 p.m. CST

You are most welcome to join our Upcoming batch, details of the same is as follows:

Attend a Live Demo Session : Sign up for Hadoop Admin Webinar
Batch Start Date : May 18, 2016 @ 08:00 p.m. CST
For More Information http://goo.gl/lBjw99
Class Schedule Mon-Thu 08:00 p.m. CST 3 hrs each session

Contact : Ilyas @ 515-978-9788 , Email : ilyas@zarantech.com

Demo Video by Trainer Ravi

Attend a Free Live WEBINAR about Hadoop Admin Training on May 11, 2016 @8:00 p.m. CST. Register Link  – http://goo.gl/BYufNJ

 

Attend a Live WEBINAR about Hadoop Admin Training on Apr 26, 2016 @7:30 p.m. CST

Hadoop Apdmin Training

Time : Tuesday Apr 26, 2016 @ 7:30 p.m. CST

You are most welcome to join our Upcoming batch, details of the same is as follows:

Attend a Live Demo Session : Sign up for Hadoop Admin Webinar
Batch Start Date : Apr 29, 2016 @ 07:30 p.m. CST
For More Information http://goo.gl/lBjw99
Class Schedule Mon-Thu 10:00 am CST 3 hrs each session

Contact : Ilyas @ 515-978-9788 , Email : ilyas@zarantech.com

Demo Video by Trainer Ravi

Attend a Free Live WEBINAR about Hadoop Admin Training on April 26, 2016 @7:30 p.m. CST. Register Link  – http://goo.gl/BYufNJ

 

Attend a Live WEBINAR about Big Data Hadoop Admin Training on 9-Jun-2015 @8:00 PM CST ‪#‎ZaranTech

Hadoop Apdmin Training

Time : Tuesday June 9th , 2015 @ 8:00 pm CST

You are most welcome to join our Upcoming batch, details of the same is as follows:

Demo Date 9th June Tue @ 8:00 PM CST
Class Schedule 18th June Thu and Fri at 8:00 PM CST and Sat 10 am CST for 1-2 hrs each session
Attend a Introductory Session (Click here to Register)

Contact : Ilyas @ 515-978-9788 , Email : ilyas@zarantech.com

Demo Video by Trainer Arun

Attend a Free Live WEBINAR about Hadoop Admin Training on 9-Jun-15 @8:00 PM CST. Register Link  – http://goo.gl/BYufNJ