This article I wrote based on previous article about how to install WSL 2 on Windows 10. To sum it up, I need to learn about how to use Apache Spark. Well, I want to use it in a Docker container anyway. While installing the Apache Spark I found some problem that the software itself won’t run. Because there is some spaces name in a folder where I save the installation file. But don’t worry I will explain later and how to solved that problem so you it won’t happen in the future.
Before we going further, I will explain a little bit about the Apache Spark. What is Apache Spark? Apache Spark is open-source software. It used as a distributed processing system that used especially for big data. The features itself includes from fast analytics query of big data, memory caching, it’s also support for development API with Java, Scala, Python even R. Apache Spark also popular to use in machine learning field because it can handle very big dataset with efficient use of memory.
Right know I’m I attended some competition from Kaggle that come with very big dataset. So now I tried used Apache Spark to process that big dataset. Wish me luck 😊. Without further ado let’s start on how to install the Apache Spark on windows.
You don’t need to install Apache Spark in default windows drive as C. You can install the software at location that you desire. Below is some requirement that you need to prepare in your computer
- Java version 8/11
- Apache Spark
There is some not about the java version. From the documentation, Apache Spark that start with version 3 is already support for Java 11. In my computer I use Java 11. Actually, you can install more than one version of Java in Windows. Make sure you set the environment variable correctly in Windows setting. You can check the installed Java version in your computer by running command from CMD or PowerShell “java -version” without quotes.
- Head to Apache Spark download pages and choose the Spark release version (I’m using the 3.1.1). Select the package type with prebuilt Hadoop. After that you can click on download.
- Create a folder that you will use to save the installation files. You can choose on default drive C on Windows or maybe another drive as you like. There is some note on naming the folder. It cannot contain spaces for example “ApacheSpark”. If there is a space in the folder name, the program won’t run. This one takes me a litter while on Reddit and Stack Overflow to found out why the Apache Spark won’t run on my computer.
- The next step is to add the winutils.ext. Winutils will used by Hadoop. Go the download pages on Github and download the “winutils”. On the folder installation create a folder name “hadoop” and inside of it you need to create another folder “bin”. After that copy the “winutils.exe” to the bin folder
- Another step is to set the environment variable. From start windows button type “env” without quotes. Click on Edit System Environment Variable. On User Variable, create two new variables called “SPARK_HOME” and “HADOOP_HOME”. For “SPARK HOME” copy the path to folder name apacheSpark on installation directory. Next is “HADOOP_HOME” copy the path to folder bin that we create before to save the winutils file.
- The last step is to add path on user variable. Choose path and click “Edit” and add %SPARK_HOME%\bin and %HADOOP_HOME%\bin
If you already follow all the instruction, now you can run or launch the Apache Spark by type on command prompt or PowerShell spark-shell and open your favorite browser and type http://localhost:4040/. If you successfully open the WEB UI than congratulation you’ve just install Apache Spark.