ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

Getting Started with PySpark on Windows

I decided to teach myself how to work with big data and came across Apache Spark. While I had heard of Apache Hadoop, to use Hadoop for working with big data, I had to write code in Java which I was not really looking forward to as I love to write code in Python. Spark supports a Python programming API called PySpark that is actively maintained and was enough to convince me to start learning PySpark for working with big data.

In this post, I describe how I got started with PySpark on Windows. My laptop is running Windows 10. So the screenshots are specific to Windows 10. I am also assuming that you are comfortable working with the Command Prompt on Windows. You do not have to be an expert, but you need to know how to start a Command Prompt and run commands such as those that help you move around your computer’s file system. In case you need a refresher, a quick introduction might be handy.

Often times, many open source projects do not have good Windows support. So I had to first figure out if Spark and PySpark would work well on Windows. The official Spark documentation does mention about supporting Windows.

Installing Prerequisites

PySpark requires Java version 7 or later and Python version 2.6 or later. Let’s first check if they are already installed or install them and make sure that PySpark can work with these two components.

Java is used by many other software. So it is quite possible that a required version (in our case version 7 or later) is already available on your computer. To check if Java is available and find it’s version, open a Command Prompt and type the following command.

If Java is installed and configured to work from a Command Prompt, running the above command should print the information about the Java version to the console. For example, I got the following output on my laptop.

Instead if you get a message like

It means you need to install Java. To do so,

Go to the Java download page. In case the download link has changed, search for Java SE Runtime Environment on the internet and you should be able to find the download page.

Click the Download button beneath JRE

Accept the license agreement and download the latest version of Java SE Runtime Environment installer. I suggest getting the exe for Windows x64 (such as jre-8u92-windows-x64.exe ) unless you are using a 32 bit version of Windows in which case you need to get the Windows x86 Offline version.

Python

Python is used by many other software. So it is quite possible that a required version (in our case version 2.6 or later) is already available on your computer. To check if Python is available and find it’s version, open a Command Prompt and type the following command.

If Python is installed and configured to work from a Command Prompt, running the above command should print the information about the Python version to the console. For example, I got the following output on my laptop.

Instead if you get a message like

It means you need to install Python. To do so,

Go to the Python download page.

Click the Latest Python 2 Release link.

Download the Windows x86-64 MSI installer file. If you are using a 32 bit version of Windows download the Windows x86 MSI installer file.

When you run the installer, on the Customize Python section, make sure that the option Add python.exe to Path is selected. If this option is not selected, some of the PySpark utilities such as pyspark and spark-submit might not work.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

Installing Apache Spark

Go to the Spark download page.

For Choose a Spark release, select the latest stable release of Spark.

For Choose a package type, select a version that is pre-built for the latest version of Hadoop such as Pre-built for Hadoop 2.6.

For Choose a download type, select Direct Download.

In order to install Apache Spark, there is no need to run any installer. You can extract the files from the downloaded tarball in any folder of your choice using the 7Zip tool.

Make sure that the folder path and the folder name containing Spark files do not contain any spaces.

The PySpark shell outputs a few messages on exit. So you need to hit enter to get back to the Command Prompt.

Configuring the Spark Installation

This error message does not prevent the PySpark shell from starting. However if you try to run a standalone Python script using the bin\spark-submit utility, you will get an error. For example, try running the wordcount.py script from the examples folder in the Command Prompt when you are in the SPARK_HOME directory.

which produces the following error that also points to missing winutils.exe

Installing winutils

Create a hadoop\bin folder inside the SPARK_HOME folder.

Download the winutils.exe for the version of hadoop against which your Spark installation was built for. In my case the hadoop version was 2.6.0. So I downloaded the winutils.exe for hadoop 2.6.0 and copied it to the hadoop\bin folder in the SPARK_HOME folder.

Create a system environment variable in Windows called SPARK_HOME that points to the SPARK_HOME folder path. Search the internet in case you need a refresher on how to create environment variables in your version of Windows such as articles like these.

Create another system environment variable in Windows called HADOOP_HOME that points to the hadoop folder inside the SPARK_HOME folder.

If you now run the bin\pyspark script from a Windows Command Prompt, the error messages related to winutils.exe should be gone. For example, I got the following messages after running the bin\pyspark utility after configuring winutils

The bin\spark-submit utility can also be successfully used to run wordcount.py script.

Configuring the log level for Spark

There are still a lot of extra INFO messages in the console everytime you start or exit from a PySpark shell or run the spark-submit utility. So let’s make one more change to our Spark installation so that only warning and error messages are written to the console. In order to do this

Copy the log4j.properties.template file in the SPARK_HOME\conf folder as log4j.properties file in the SPARK_HOME\conf folder.

Set the log4j.rootCategory property value to WARN, console

Save the log4j.properties file.

Summary

In order to work with PySpark, start a Windows Command Prompt and change into your SPARK_HOME directory.

To start a PySpark shell, run the bin\pyspark utility. Once your are in the PySpark shell use the sc and sqlContext names and type exit() to return back to the Command Prompt.

To run a standalone Python script, run the bin\spark-submit utility and specify the path of your Python script as well as any arguments your Python script needs in the Command Prompt. For example, to run the wordcount.py script from examples directory in your SPARK_HOME folder, you can run the following command

bin\spark-submit examples\src\main\python\wordcount.py README.md

References

I used the following references to gather information about this post.

Downloading Spark and Getting Started (chapter 2) from O’Reilly’s Learning Spark book.

Any suggestions or feedback? Leave your comments below.

Π˜ΡΡ‚ΠΎΡ‡Π½ΠΈΠΊ

Установка Apache PySpark Π² Windows 10

Π”Π°Ρ‚Π° ΠΏΡƒΠ±Π»ΠΈΠΊΠ°Ρ†ΠΈΠΈ Aug 30, 2019

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

ПослСдниС нСсколько мСсяцСв я Ρ€Π°Π±ΠΎΡ‚Π°Π» Π½Π°Π΄ ΠΏΡ€ΠΎΠ΅ΠΊΡ‚ΠΎΠΌ Data Science, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°Π΅Ρ‚ ΠΎΠ³Ρ€ΠΎΠΌΠ½Ρ‹ΠΉ Π½Π°Π±ΠΎΡ€ Π΄Π°Π½Π½Ρ‹Ρ…, ΠΈ стало Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΡ‹ΠΌ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ Ρ€Π°ΡΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½Π½ΡƒΡŽ срСду, ΠΏΡ€Π΅Π΄ΠΎΡΡ‚Π°Π²Π»ΡΠ΅ΠΌΡƒΡŽ Apache PySpark.

Π― ΠΌΠ½ΠΎΠ³ΠΎ боролся ΠΏΡ€ΠΈ установкС PySpark Π½Π° Windows 10. ΠŸΠΎΡΡ‚ΠΎΠΌΡƒ я Ρ€Π΅ΡˆΠΈΠ» Π½Π°ΠΏΠΈΡΠ°Ρ‚ΡŒ этот Π±Π»ΠΎΠ³, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΠΎΠΌΠΎΡ‡ΡŒ Π»ΡŽΠ±ΠΎΠΌΡƒ Π»Π΅Π³ΠΊΠΎ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ Apache PySpark Π½Π° ΠΊΠΎΠΌΠΏΡŒΡŽΡ‚Π΅Ρ€Π΅ с Windows 10.

1. Π¨Π°Π³ 1

PySpark Ρ‚Ρ€Π΅Π±ΡƒΠ΅Ρ‚ Java вСрсии 7 ΠΈΠ»ΠΈ Π½ΠΎΠ²Π΅Π΅ ΠΈ Python вСрсии 2.6 ΠΈΠ»ΠΈ Π½ΠΎΠ²Π΅Π΅. Π”Π°Π²Π°ΠΉΡ‚Π΅ сначала ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΠΌ, установлСны Π»ΠΈ ΠΎΠ½ΠΈ, ΠΈΠ»ΠΈ установим ΠΈΡ… ΠΈ убСдимся, Ρ‡Ρ‚ΠΎ PySpark ΠΌΠΎΠΆΠ΅Ρ‚ Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ с этими двумя ΠΊΠΎΠΌΠΏΠΎΠ½Π΅Π½Ρ‚Π°ΠΌΠΈ.

Установка Java

ΠŸΡ€ΠΎΠ²Π΅Ρ€ΡŒΡ‚Π΅, установлСна ​​ли Π½Π° вашСм ΠΊΠΎΠΌΠΏΡŒΡŽΡ‚Π΅Ρ€Π΅ Java вСрсии 7 ΠΈΠ»ΠΈ Π½ΠΎΠ²Π΅Π΅. Для этого Π²Ρ‹ΠΏΠΎΠ»Π½ΠΈΡ‚Π΅ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΡƒΡŽ ΠΊΠΎΠΌΠ°Π½Π΄Ρƒ Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строкС.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

Если Java установлСна ​​и настроСна для Ρ€Π°Π±ΠΎΡ‚Ρ‹ ΠΈΠ· ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строки, Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ Π²Ρ‹ΡˆΠ΅ΡƒΠΊΠ°Π·Π°Π½Π½ΠΎΠΉ ΠΊΠΎΠΌΠ°Π½Π΄Ρ‹ Π΄ΠΎΠ»ΠΆΠ½ΠΎ вывСсти ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΡŽ ΠΎ вСрсии Java Π½Π° консоль. Π˜Π½Π°Ρ‡Π΅, Ссли Π²Ρ‹ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚Π΅ сообщСниС, ΠΏΠΎΠ΄ΠΎΠ±Π½ΠΎΠ΅:

Β«JavaΒ» Π½Π΅ распознаСтся ΠΊΠ°ΠΊ внутрСнняя ΠΈΠ»ΠΈ внСшняя ΠΊΠΎΠΌΠ°Π½Π΄Π°, Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‰Π°Ρ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ° ΠΈΠ»ΠΈ ΠΏΠ°ΠΊΠ΅Ρ‚Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ».

Ρ‚ΠΎΠ³Π΄Π° Π²Ρ‹ Π΄ΠΎΠ»ΠΆΠ½Ρ‹ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ Java.

Π±) ΠŸΠΎΠ»ΡƒΡ‡ΠΈΡ‚ΡŒ Windows x64 (Π½Π°ΠΏΡ€ΠΈΠΌΠ΅Ρ€, jre-8u92-windows-x64.exe), Ссли Π²Ρ‹ Π½Π΅ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚Π΅ 32-Ρ€Π°Π·Ρ€ΡΠ΄Π½ΡƒΡŽ Π²Π΅Ρ€ΡΠΈΡŽ Windows, Π² этом случаС Π²Π°ΠΌ Π½ΡƒΠΆΠ½ΠΎ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚ΡŒWindows x86 OfflineвСрсия.

Π²) ЗапуститС установщик.

2. Π¨Π°Π³ 2

ΠΏΠΈΡ‚ΠΎΠ½

Если Python установлСн ΠΈ настроСн для Ρ€Π°Π±ΠΎΡ‚Ρ‹ ΠΈΠ· ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строки, ΠΏΡ€ΠΈ Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠΈ Π²Ρ‹ΡˆΠ΅ΡƒΠΊΠ°Π·Π°Π½Π½ΠΎΠΉ ΠΊΠΎΠΌΠ°Π½Π΄Ρ‹ информация ΠΎ вСрсии Python Π΄ΠΎΠ»ΠΆΠ½Π° Π²Ρ‹Π²ΠΎΠ΄ΠΈΡ‚ΡŒΡΡ Π½Π° консоль. НапримСр, я ΠΏΠΎΠ»ΡƒΡ‡ΠΈΠ» ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΠΈΠΉ Π²Ρ‹Π²ΠΎΠ΄ Π½Π° ΠΌΠΎΠ΅ΠΌ Π½ΠΎΡƒΡ‚Π±ΡƒΠΊΠ΅:

ВмСсто этого, Ссли Π²Ρ‹ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚Π΅ сообщСниС, ΠΊΠ°ΠΊ

Β«PythonΒ» Π½Π΅ распознаСтся ΠΊΠ°ΠΊ внутрСнняя ΠΈΠ»ΠΈ внСшняя ΠΊΠΎΠΌΠ°Π½Π΄Π°, Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‰Π°Ρ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ° ΠΈΠ»ΠΈ ΠΏΠ°ΠΊΠ΅Ρ‚Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ» Β».

Π­Ρ‚ΠΎ ΠΎΠ·Π½Π°Ρ‡Π°Π΅Ρ‚, Ρ‡Ρ‚ΠΎ Π²Π°ΠΌ Π½ΡƒΠΆΠ½ΠΎ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ Python. Для этого

Π°) ΠŸΠ΅Ρ€Π΅ΠΉΡ‚ΠΈ ΠΊ ΠΏΠΈΡ‚ΠΎΠ½ΡƒΡΠΊΠ°Ρ‡Π°Ρ‚ΡŒΡΡ‚Ρ€.

Π±) ΠΠ°ΠΆΠΌΠΈΡ‚Π΅ΠŸΠΎΡΠ»Π΅Π΄Π½ΠΈΠΉ выпуск Python 2ссылка.

c) Π—Π°Π³Ρ€ΡƒΠ·ΠΈΡ‚Π΅ установочный Ρ„Π°ΠΉΠ» MSI для Windows x86–64. Если Π²Ρ‹ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚Π΅ 32-Ρ€Π°Π·Ρ€ΡΠ΄Π½ΡƒΡŽ Π²Π΅Ρ€ΡΠΈΡŽ Windows, Π·Π°Π³Ρ€ΡƒΠ·ΠΈΡ‚Π΅ установочный Ρ„Π°ΠΉΠ» MSI для Windows x86.

Π³) Когда Π²Ρ‹ запускаСтС установщик, Π½Π°ΠΠ°ΡΡ‚Ρ€ΠΎΠΈΡ‚ΡŒ PythonΡ€Π°Π·Π΄Π΅Π», ΡƒΠ±Π΅Π΄ΠΈΡ‚Π΅ΡΡŒ, Ρ‡Ρ‚ΠΎ ΠΎΠΏΡ†ΠΈΡΠ”ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ python.exe Π² ΠΏΡƒΡ‚ΡŒΠ²Ρ‹Π±Ρ€Π°Π½. Если этот ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€ Π½Π΅ Π²Ρ‹Π±Ρ€Π°Π½, Π½Π΅ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ PySpark, Ρ‚Π°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ pyspark ΠΈ spark-submit, ΠΌΠΎΠ³ΡƒΡ‚ Π½Π΅ Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ.

3. Π¨Π°Π³ 3

Установка Apache Spark

Π°) ΠŸΠ΅Ρ€Π΅ΠΉΡ‚ΠΈ ΠΊ ΠΈΡΠΊΡ€Π΅ΡΠΊΠ°Ρ‡Π°Ρ‚ΡŒΡΡ‚Ρ€.

Π±) Π’Ρ‹Π±Π΅Ρ€ΠΈΡ‚Π΅ послСднюю ΡΡ‚Π°Π±ΠΈΠ»ΡŒΠ½ΡƒΡŽ Π²Π΅Ρ€ΡΠΈΡŽ Spark.

с)Π’Ρ‹Π±Π΅Ρ€ΠΈΡ‚Π΅ Ρ‚ΠΈΠΏ ΡƒΠΏΠ°ΠΊΠΎΠ²ΠΊΠΈ: sΠ²Ρ‹Π±Π΅Ρ€ΠΈΡ‚Π΅ Π²Π΅Ρ€ΡΠΈΡŽ, ΠΏΡ€Π΅Π΄Π²Π°Ρ€ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ ΡΠΎΠ·Π΄Π°Π½Π½ΡƒΡŽ для послСднСй вСрсии Hadoop, Ρ‚Π°ΠΊΡƒΡŽ β€‹β€‹ΠΊΠ°ΠΊΠŸΡ€Π΅Π΄Π²Π°Ρ€ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ построСн для Hadoop 2.6,

Π³)Π’Ρ‹Π±Π΅Ρ€ΠΈΡ‚Π΅ Ρ‚ΠΈΠΏ Π·Π°Π³Ρ€ΡƒΠ·ΠΊΠΈ:Π’Ρ‹Π±Ρ€Π°Ρ‚ΡŒΠŸΡ€ΡΠΌΠΎΠ΅ скачиваниС,

f) Для установки Apache Spark Π²Π°ΠΌ Π½Π΅ Π½ΡƒΠΆΠ½ΠΎ Π·Π°ΠΏΡƒΡΠΊΠ°Ρ‚ΡŒ ΠΊΠ°ΠΊΠΎΠΉ-Π»ΠΈΠ±ΠΎ установщик. Π˜Π·Π²Π»Π΅ΠΊΠΈΡ‚Π΅ Ρ„Π°ΠΉΠ»Ρ‹ ΠΈΠ· Π·Π°Π³Ρ€ΡƒΠΆΠ΅Π½Π½ΠΎΠ³ΠΎ tar-Ρ„Π°ΠΉΠ»Π° Π² Π»ΡŽΠ±ΡƒΡŽ ΠΏΠ°ΠΏΠΊΡƒ ΠΏΠΎ Π²Π°ΡˆΠ΅ΠΌΡƒ Π²Ρ‹Π±ΠΎΡ€Ρƒ, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ7ZipинструмСнт / Π΄Ρ€ΡƒΠ³ΠΈΠ΅ инструмСнты для разархивирования.

Π£Π±Π΅Π΄ΠΈΡ‚Π΅ΡΡŒ, Ρ‡Ρ‚ΠΎ ΠΏΡƒΡ‚ΡŒ ΠΊ ΠΏΠ°ΠΏΠΊΠ΅ ΠΈ имя ΠΏΠ°ΠΏΠΊΠΈ, содСрТащСй Ρ„Π°ΠΉΠ»Ρ‹ Spark, Π½Π΅ содСрТат ΠΏΡ€ΠΎΠ±Π΅Π»ΠΎΠ².

Π― создал ΠΏΠ°ΠΏΠΊΡƒ с ΠΈΠΌΠ΅Π½Π΅ΠΌ spark Π½Π° ΠΌΠΎΠ΅ΠΌ дискС D ΠΈ распаковал Π·Π°Π°Ρ€Ρ…ΠΈΠ²ΠΈΡ€ΠΎΠ²Π°Π½Π½Ρ‹ΠΉ tar-Ρ„Π°ΠΉΠ» Π² ΠΏΠ°ΠΏΠΊΡƒ с ΠΈΠΌΠ΅Π½Π΅ΠΌ spark-2.4.3-bin-hadoop2.7. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ, всС Ρ„Π°ΠΉΠ»Ρ‹ Spark находятся Π² ΠΏΠ°ΠΏΠΊΠ΅ с ΠΈΠΌΠ΅Π½Π΅ΠΌ D: \ spark \ spark-2.4.3-bin-hadoop2.7. Π”Π°Π²Π°ΠΉΡ‚Π΅ Π½Π°Π·ΠΎΠ²Π΅ΠΌ эту ΠΏΠ°ΠΏΠΊΡƒ SPARK_HOME Π² этом постС.

Π§Ρ‚ΠΎΠ±Ρ‹ ΠΏΡ€ΠΎΠ²Π΅Ρ€ΠΈΡ‚ΡŒ ΡƒΡΠΏΠ΅ΡˆΠ½ΠΎΡΡ‚ΡŒ установки, ΠΎΡ‚ΠΊΡ€ΠΎΠΉΡ‚Π΅ ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡƒΡŽ строку, ΠΏΠ΅Ρ€Π΅ΠΉΠ΄ΠΈΡ‚Π΅ Π² ΠΊΠ°Ρ‚Π°Π»ΠΎΠ³ SPARK_HOME ΠΈ Π²Π²Π΅Π΄ΠΈΡ‚Π΅ bin \ pyspark. Π­Ρ‚ΠΎ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΡƒ PySpark, ΠΊΠΎΡ‚ΠΎΡ€ΡƒΡŽ ΠΌΠΎΠΆΠ½ΠΎ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ для ΠΈΠ½Ρ‚Π΅Ρ€Π°ΠΊΡ‚ΠΈΠ²Π½ΠΎΠΉ Ρ€Π°Π±ΠΎΡ‚Ρ‹ со Spark.

ΠžΠ±ΠΎΠ»ΠΎΡ‡ΠΊΠ° PySpark Π²Ρ‹Π²ΠΎΠ΄ΠΈΡ‚ нСсколько сообщСний ΠΏΡ€ΠΈ Π²Ρ‹Ρ…ΠΎΠ΄Π΅. ΠŸΠΎΡΡ‚ΠΎΠΌΡƒ Π²Π°ΠΌ Π½ΡƒΠΆΠ½ΠΎ Π½Π°ΠΆΠ°Ρ‚ΡŒ Enter, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π²Π΅Ρ€Π½ΡƒΡ‚ΡŒΡΡ Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡƒΡŽ строку.

4. Π¨Π°Π³ 4

Настройка установки Spark

ΠŸΠ΅Ρ€Π²ΠΎΠ½Π°Ρ‡Π°Π»ΡŒΠ½ΠΎ, ΠΊΠΎΠ³Π΄Π° Π²Ρ‹ запускаСтС ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΡƒ PySpark, ΠΎΠ½Π° Π²Ρ‹Π΄Π°Π΅Ρ‚ ΠΌΠ½ΠΎΠ³ΠΎ сообщСний Ρ‚ΠΈΠΏΠ° INFO, ERROR ΠΈ WARN. Π”Π°Π²Π°ΠΉΡ‚Π΅ посмотрим, ΠΊΠ°ΠΊ ΡƒΠ΄Π°Π»ΠΈΡ‚ΡŒ эти сообщСния.

Установка Spark Π² Windows ΠΏΠΎ ΡƒΠΌΠΎΠ»Ρ‡Π°Π½ΠΈΡŽ Π½Π΅ Π²ΠΊΠ»ΡŽΡ‡Π°Π΅Ρ‚ ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρƒ winutils.exe, которая ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ΡΡ Spark. Если Π²Ρ‹ Π½Π΅ ΡƒΠΊΠ°ΠΆΠ΅Ρ‚Π΅ своСй установкС Spark, Π³Π΄Π΅ ΠΈΡΠΊΠ°Ρ‚ΡŒ winutils.exe, Π²Ρ‹ ΡƒΠ²ΠΈΠ΄ΠΈΡ‚Π΅ сообщСния ΠΎΠ± ΠΎΡˆΠΈΠ±ΠΊΠ°Ρ… ΠΏΡ€ΠΈ запускС ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΠΈ PySpark, Ρ‚Π°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ

Β«ΠžΠ¨Π˜Π‘ΠšΠ Shell: Π½Π΅ ΡƒΠ΄Π°Π»ΠΎΡΡŒ Π½Π°ΠΉΡ‚ΠΈ Π΄Π²ΠΎΠΈΡ‡Π½Ρ‹ΠΉ Ρ„Π°ΠΉΠ» winutils Π² Π΄Π²ΠΎΠΈΡ‡Π½ΠΎΠΌ ΠΏΡƒΡ‚ΠΈ hadoop java.io.IOException: Π½Π΅ ΡƒΠ΄Π°Π»ΠΎΡΡŒ Π½Π°ΠΉΡ‚ΠΈ исполняСмый Ρ„Π°ΠΉΠ» null \ bin \ winutils.exe Π² Π΄Π²ΠΎΠΈΡ‡Π½Ρ‹Ρ… Ρ„Π°ΠΉΠ»Π°Ρ… HadoopΒ».

Π­Ρ‚ΠΎ сообщСниС ΠΎΠ± ошибкС Π½Π΅ прСпятствуСт запуску ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΠΈ PySpark. Однако Ссли Π²Ρ‹ ΠΏΠΎΠΏΡ‹Ρ‚Π°Π΅Ρ‚Π΅ΡΡŒ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ Π°Π²Ρ‚ΠΎΠ½ΠΎΠΌΠ½Ρ‹ΠΉ скрипт Python с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ bin \ spark-submit, Π²Ρ‹ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΡ‚Π΅ ΠΎΡˆΠΈΠ±ΠΊΡƒ. НапримСр, ΠΏΠΎΠΏΡ€ΠΎΠ±ΡƒΠΉΡ‚Π΅ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ скрипт wordcount.py ΠΈΠ· ΠΏΠ°ΠΏΠΊΠΈ ΠΏΡ€ΠΈΠΌΠ΅Ρ€ΠΎΠ² Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строкС, ΠΊΠΎΠ³Π΄Π° Π²Ρ‹ Π½Π°Ρ…ΠΎΠ΄ΠΈΡ‚Π΅ΡΡŒ Π² ΠΊΠ°Ρ‚Π°Π»ΠΎΠ³Π΅ SPARK_HOME.

Β«Bin \ spark-submit examples \ src \ main \ python \ wordcount.py README.mdΒ»

Установка winutils

Π”Π°Π²Π°ΠΉΡ‚Π΅ Π·Π°Π³Ρ€ΡƒΠ·ΠΈΠΌ winutils.exe ΠΈ сконфигурируСм Π½Π°ΡˆΡƒ установку Spark, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π½Π°ΠΉΡ‚ΠΈ winutils.exe.

a) Π‘ΠΎΠ·Π΄Π°ΠΉΡ‚Π΅ ΠΏΠ°ΠΏΠΊΡƒ hadoop \ bin Π²Π½ΡƒΡ‚Ρ€ΠΈ ΠΏΠ°ΠΏΠΊΠΈ SPARK_HOME.

Π±) Π‘ΠΊΠ°Ρ‡Π°Ρ‚ΡŒwinutils.exeдля вСрсии hadoop, для ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠΉ Π±Ρ‹Π»Π° создана ваша установка Spark. Π’ ΠΌΠΎΠ΅ΠΌ случаС вСрсия hadoop Π±Ρ‹Π»Π° 2.6.0. Π’Π°ΠΊ Ρ‡Ρ‚ΠΎ язагруТСнноСwinutils.exe для hadoop 2.6.0 ΠΈ скопировал Π΅Π³ΠΎ Π² ΠΏΠ°ΠΏΠΊΡƒ hadoop \ bin Π² ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME.

c) Π‘ΠΎΠ·Π΄Π°ΠΉΡ‚Π΅ ΡΠΈΡΡ‚Π΅ΠΌΠ½ΡƒΡŽ ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Π½Π½ΡƒΡŽ срСды Π² Windows с ΠΈΠΌΠ΅Π½Π΅ΠΌ SPARK_HOME, которая ΡƒΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚ ΠΏΡƒΡ‚ΡŒ ΠΊ ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME.

d) Π‘ΠΎΠ·Π΄Π°ΠΉΡ‚Π΅ Π² Windows Π΄Ρ€ΡƒΠ³ΡƒΡŽ ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Π½Π½ΡƒΡŽ систСмной срСды с ΠΈΠΌΠ΅Π½Π΅ΠΌ HADOOP_HOME, которая ΡƒΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚ Π½Π° ΠΏΠ°ΠΏΠΊΡƒ hadoop Π²Π½ΡƒΡ‚Ρ€ΠΈ ΠΏΠ°ΠΏΠΊΠΈ SPARK_HOME.

ΠŸΠΎΡΠΊΠΎΠ»ΡŒΠΊΡƒ ΠΏΠ°ΠΏΠΊΠ° hadoop находится Π²Π½ΡƒΡ‚Ρ€ΠΈ ΠΏΠ°ΠΏΠΊΠΈ SPARK_HOME, Π»ΡƒΡ‡ΡˆΠ΅ ΡΠΎΠ·Π΄Π°Ρ‚ΡŒ ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Π½Π½ΡƒΡŽ срСды HADOOP_HOME, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅% SPARK_HOME% \ hadoop. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ, Π²Π°ΠΌ Π½Π΅ Π½ΡƒΠΆΠ½ΠΎ ΠΌΠ΅Π½ΡΡ‚ΡŒ HADOOP_HOME, Ссли SPARK_HOME ΠΎΠ±Π½ΠΎΠ²Π»Π΅Π½.

Если Π²Ρ‹ Ρ‚Π΅ΠΏΠ΅Ρ€ΡŒ запуститС сцСнарий bin \ pyspark ΠΈΠ· ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строки Windows, сообщСния ΠΎΠ± ΠΎΡˆΠΈΠ±ΠΊΠ°Ρ…, связанныС с winutils.exe, Π΄ΠΎΠ»ΠΆΠ½Ρ‹ ΠΈΡΡ‡Π΅Π·Π½ΡƒΡ‚ΡŒ.

5. Π¨Π°Π³ 5

Настройка уровня ΠΆΡƒΡ€Π½Π°Π»Π° для Spark

ΠšΠ°ΠΆΠ΄Ρ‹ΠΉ Ρ€Π°Π· ΠΏΡ€ΠΈ запускС ΠΈΠ»ΠΈ Π²Ρ‹Ρ…ΠΎΠ΄Π΅ ΠΈΠ· ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΠΈ PySpark ΠΈΠ»ΠΈ ΠΏΡ€ΠΈ запускС ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρ‹ spark-submit остаСтся ΠΌΠ½ΠΎΠ³ΠΎ Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Ρ… сообщСний INFO. Π˜Ρ‚Π°ΠΊ, Π΄Π°Π²Π°ΠΉΡ‚Π΅ внСсСм Π΅Ρ‰Π΅ ΠΎΠ΄Π½ΠΎ ΠΈΠ·ΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ Π² Π½Π°ΡˆΡƒ установку Spark, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π² консоль Π·Π°ΠΏΠΈΡΡ‹Π²Π°Π»ΠΈΡΡŒ Ρ‚ΠΎΠ»ΡŒΠΊΠΎ прСдупрСТдСния ΠΈ сообщСния ΠΎΠ± ΠΎΡˆΠΈΠ±ΠΊΠ°Ρ…. Для этого:

a) Π‘ΠΊΠΎΠΏΠΈΡ€ΡƒΠΉΡ‚Π΅ Ρ„Π°ΠΉΠ» log4j.properties.template Π² ΠΏΠ°ΠΏΠΊΡƒ SPARK_HOME \ conf ΠΊΠ°ΠΊ Ρ„Π°ΠΉΠ» log4j.properties Π² ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME \ conf.

b) УстановитС для свойства log4j.rootCategory Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ WARN, console.

c) Π‘ΠΎΡ…Ρ€Π°Π½ΠΈΡ‚Π΅ Ρ„Π°ΠΉΠ» log4j.properties.

Π’Π΅ΠΏΠ΅Ρ€ΡŒ Π»ΡŽΠ±Ρ‹Π΅ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Π΅ сообщСния Π½Π΅ Π±ΡƒΠ΄ΡƒΡ‚ Π·Π°ΠΏΠΈΡΡ‹Π²Π°Ρ‚ΡŒΡΡ Π½Π° консоль.

РСзюмС

Π§Ρ‚ΠΎΠ±Ρ‹ Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ с PySpark, запуститС ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡƒΡŽ строку ΠΈ ΠΏΠ΅Ρ€Π΅ΠΉΠ΄ΠΈΡ‚Π΅ Π² ΠΊΠ°Ρ‚Π°Π»ΠΎΠ³ SPARK_HOME.

Π°) Π§Ρ‚ΠΎΠ±Ρ‹ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΡƒ PySpark, запуститС ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρƒ bin \ pyspark. Когда Π²Ρ‹ ΠΎΠΊΠ°ΠΆΠ΅Ρ‚Π΅ΡΡŒ Π² ΠΎΠ±ΠΎΠ»ΠΎΡ‡ΠΊΠ΅ PySpark, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠΉΡ‚Π΅ ΠΈΠΌΠ΅Π½Π° sc ΠΈ sqlContext ΠΈ Π²Π²Π΅Π΄ΠΈΡ‚Π΅ exit (), Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π²Π΅Ρ€Π½ΡƒΡ‚ΡŒΡΡ Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡƒΡŽ строку.

Π±) Π§Ρ‚ΠΎΠ±Ρ‹ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ Π°Π²Ρ‚ΠΎΠ½ΠΎΠΌΠ½Ρ‹ΠΉ скрипт Python, запуститС ΡƒΡ‚ΠΈΠ»ΠΈΡ‚Ρƒ bin \ spark-submit ΠΈ ΡƒΠΊΠ°ΠΆΠΈΡ‚Π΅ ΠΏΡƒΡ‚ΡŒ ΠΊ Π²Π°ΡˆΠ΅ΠΌΡƒ скрипту Python, Π° Ρ‚Π°ΠΊΠΆΠ΅ Π»ΡŽΠ±Ρ‹Π΅ Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚Ρ‹, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Π½ΡƒΠΆΠ½Ρ‹ Π²Π°ΡˆΠ΅ΠΌΡƒ скрипту Python, Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строкС. НапримСр, Ρ‡Ρ‚ΠΎΠ±Ρ‹ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ скрипт wordcount.py ΠΈΠ· ΠΊΠ°Ρ‚Π°Π»ΠΎΠ³Π° examples Π² ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME, Π²Ρ‹ ΠΌΠΎΠΆΠ΅Ρ‚Π΅ Π²Ρ‹ΠΏΠΎΠ»Π½ΠΈΡ‚ΡŒ ΡΠ»Π΅Π΄ΡƒΡŽΡ‰ΡƒΡŽ ΠΊΠΎΠΌΠ°Π½Π΄Ρƒ:

Β«bin \ spark-submit examples \ src \ main \ python \ wordcount.py README.mdΒ«

6. Π¨Π°Π³ 6

Π’Π°ΠΆΠ½ΠΎ: я столкнулся с ΠΏΡ€ΠΎΠ±Π»Π΅ΠΌΠΎΠΉ ΠΏΡ€ΠΈ установкС

ПослС Π·Π°Π²Π΅Ρ€ΡˆΠ΅Π½ΠΈΡ ΠΏΡ€ΠΎΡ†Π΅Π΄ΡƒΡ€Ρ‹ установки Π½Π° ΠΌΠΎΠ΅ΠΌ ΠΊΠΎΠΌΠΏΡŒΡŽΡ‚Π΅Ρ€Π΅ с Windows 10 я ΠΏΠΎΠ»ΡƒΡ‡Π°Π» ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅Π΅ сообщСниС ΠΎΠ± ошибкС.

РСшСниС:

Π― просто разобрался, ΠΊΠ°ΠΊ это ΠΈΡΠΏΡ€Π°Π²ΠΈΡ‚ΡŒ!

Π’ ΠΌΠΎΠ΅ΠΌ случаС я Π½Π΅ Π·Π½Π°Π», Ρ‡Ρ‚ΠΎ ΠΌΠ½Π΅ Π½ΡƒΠΆΠ½ΠΎ Π΄ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ ВРИ ΠΏΡƒΡ‚ΠΈ, связанныС с ΠΌΠΈΠ½ΠΈΠΊΠΎΠ½Π΄Π°ΠΌΠΈ, Π² ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Π½Π½ΡƒΡŽ окруТСния PATH.

C: \ Users \ uug20 \ Anaconda3

C: \ Users \ uug20 \ Anaconda3 \ Scripts

C: \ Users \ uug20 \ Anaconda3 \ Library \ bin

ПослС этого я Π½Π΅ ΠΏΠΎΠ»ΡƒΡ‡ΠΈΠ» Π½ΠΈΠΊΠ°ΠΊΠΈΡ… сообщСний ΠΎΠ± ΠΎΡˆΠΈΠ±ΠΊΠ°Ρ…, ΠΈ pyspark Π½Π°Ρ‡Π°Π» Ρ€Π°Π±ΠΎΡ‚Π°Ρ‚ΡŒ ΠΏΡ€Π°Π²ΠΈΠ»ΡŒΠ½ΠΎ ΠΈ ΠΎΡ‚ΠΊΡ€Ρ‹Π» Π·Π°ΠΏΠΈΡΠ½ΡƒΡŽ ΠΊΠ½ΠΈΠΆΠΊΡƒ Jupyter послС Π²Π²ΠΎΠ΄Π° pyspark Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ строкС.

Π˜ΡΡ‚ΠΎΡ‡Π½ΠΈΠΊ

Руководство ΠΏΠΎ PySpark для Π½Π°Ρ‡ΠΈΠ½Π°ΡŽΡ‰ΠΈΡ…

PySpark β€” это API Apache Spark, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ прСдставляСт собой систСму с ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚Ρ‹ΠΌ исходным ΠΊΠΎΠ΄ΠΎΠΌ, ΠΏΡ€ΠΈΠΌΠ΅Π½ΡΠ΅ΠΌΡƒΡŽ для распрСдСлСнной ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠΈ Π±ΠΎΠ»ΡŒΡˆΠΈΡ… Π΄Π°Π½Π½Ρ‹Ρ…. Π˜Π·Π½Π°Ρ‡Π°Π»ΡŒΠ½ΠΎ ΠΎΠ½Π° Π±Ρ‹Π»Π° Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Π°Π½Π° Π½Π° языкС программирования Scala Π² ΠšΠ°Π»ΠΈΡ„ΠΎΡ€Π½ΠΈΠΉΡΠΊΠΎΠΌ унивСрситСтС Π‘Π΅Ρ€ΠΊΠ»ΠΈ.

Spark прСдоставляСт API для Scala, Java, Python ΠΈ R. БистСма ΠΏΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΈΠ²Π°Π΅Ρ‚ ΠΏΠΎΠ²Ρ‚ΠΎΡ€Π½ΠΎΠ΅ использованиС ΠΊΠΎΠ΄Π° ΠΌΠ΅ΠΆΠ΄Ρƒ Ρ€Π°Π±ΠΎΡ‡ΠΈΠΌΠΈ Π·Π°Π΄Π°Ρ‡Π°ΠΌΠΈ, ΠΏΠ°ΠΊΠ΅Ρ‚Π½ΡƒΡŽ ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΡƒ Π΄Π°Π½Π½Ρ‹Ρ…, ΠΈΠ½Ρ‚Π΅Ρ€Π°ΠΊΡ‚ΠΈΠ²Π½Ρ‹Π΅ запросы, Π°Π½Π°Π»ΠΈΡ‚ΠΈΠΊΡƒ Π² Ρ€Π΅Π°Π»ΡŒΠ½ΠΎΠΌ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ, машинноС ΠΎΠ±ΡƒΡ‡Π΅Π½ΠΈΠ΅ ΠΈ вычислСния Π½Π° Π³Ρ€Π°Ρ„Π°Ρ…. Она ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ ΠΊΡΡˆΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ Π² памяти ΠΈ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½ΠΎΠ΅ Π²Ρ‹ΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ запросов ΠΊ Π΄Π°Π½Π½Ρ‹ΠΌ любого Ρ€Π°Π·ΠΌΠ΅Ρ€Π°.

Π£ Π½Π΅Π΅ Π½Π΅Ρ‚ ΠΎΠ΄Π½ΠΎΠΉ собствСнной Ρ„Π°ΠΉΠ»ΠΎΠ²ΠΎΠΉ систСмы, Ρ‚Π°ΠΊΠΎΠΉ ΠΊΠ°ΠΊ Hadoop Distributed File System (HDFS), вмСсто этого Spark ΠΏΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΈΠ²Π°Π΅Ρ‚ мноТСство популярных Ρ„Π°ΠΉΠ»ΠΎΠ²Ρ‹Ρ… систСм, Ρ‚Π°ΠΊΠΈΡ… ΠΊΠ°ΠΊ HDFS, HBase, Cassandra, Amazon S3, Amazon Redshift, Couchbase ΠΈ Ρ‚. Π΄.

ΠŸΡ€Π΅ΠΈΠΌΡƒΡ‰Π΅ΡΡ‚Π²Π° использования Apache Spark:

Настройка срСды Π² Google Colab

Π§Ρ‚ΠΎΠ±Ρ‹ Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ pyspark Π½Π° локальной машинС, Π½Π°ΠΌ понадобится Java ΠΈ Π΅Ρ‰Π΅ Π½Π΅ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠ΅ ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ½ΠΎΠ΅ обСспСчСниС. ΠŸΠΎΡΡ‚ΠΎΠΌΡƒ вмСсто слоТной ΠΏΡ€ΠΎΡ†Π΅Π΄ΡƒΡ€Ρ‹ установки ΠΌΡ‹ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅ΠΌ Google Colaboratory, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ идСально удовлСтворяСт наши трСбования ΠΊ ΠΎΠ±ΠΎΡ€ΡƒΠ΄ΠΎΠ²Π°Π½ΠΈΡŽ, ΠΈ Ρ‚Π°ΠΊΠΆΠ΅ поставляСтся с ΡˆΠΈΡ€ΠΎΠΊΠΈΠΌ Π½Π°Π±ΠΎΡ€ΠΎΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡ‚Π΅ΠΊ для Π°Π½Π°Π»ΠΈΠ·Π° Π΄Π°Π½Π½Ρ‹Ρ… ΠΈ машинного обучСния. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ, Π½Π°ΠΌ остаСтся Ρ‚ΠΎΠ»ΡŒΠΊΠΎ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ ΠΏΠ°ΠΊΠ΅Ρ‚Ρ‹ pyspark ΠΈ Py4J. Py4J позволяСт ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ°ΠΌ Python, Ρ€Π°Π±ΠΎΡ‚Π°ΡŽΡ‰ΠΈΠΌ Π² ΠΈΠ½Ρ‚Π΅Ρ€ΠΏΡ€Π΅Ρ‚Π°Ρ‚ΠΎΡ€Π΅ Python, динамичСски ΠΎΠ±Ρ€Π°Ρ‰Π°Ρ‚ΡŒΡΡ ΠΊ ΠΎΠ±ΡŠΠ΅ΠΊΡ‚Π°ΠΌ Java ΠΈΠ· Π²ΠΈΡ€Ρ‚ΡƒΠ°Π»ΡŒΠ½ΠΎΠΉ ΠΌΠ°ΡˆΠΈΠ½Ρ‹ Java.

Π˜Ρ‚ΠΎΠ³ΠΎΠ²Ρ‹ΠΉ Π½ΠΎΡƒΡ‚Π±ΡƒΠΊ ΠΌΠΎΠΆΠ½ΠΎ ΡΠΊΠ°Ρ‡Π°Ρ‚ΡŒ Π² Ρ€Π΅ΠΏΠΎΠ·ΠΈΡ‚ΠΎΡ€ΠΈΠΈ: https://gitlab.com/PythonRu/notebooks/-/blob/master/pyspark_beginner.ipynb

Команда для установки Π²Ρ‹ΡˆΠ΅ΡƒΠΊΠ°Π·Π°Π½Π½Ρ‹Ρ… ΠΏΠ°ΠΊΠ΅Ρ‚ΠΎΠ²:

Spark Session

SparkSession стал Ρ‚ΠΎΡ‡ΠΊΠΎΠΉ Π²Ρ…ΠΎΠ΄Π° Π² PySpark, начиная с вСрсии 2.0: Ρ€Π°Π½Π΅Π΅ для этого использовался SparkContext. SparkSession β€” это способ ΠΈΠ½ΠΈΡ†ΠΈΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΠΈ Π±Π°Π·ΠΎΠ²ΠΎΠΉ Ρ„ΡƒΠ½ΠΊΡ†ΠΈΠΎΠ½Π°Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ PySpark для ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠ½ΠΎΠ³ΠΎ создания PySpark RDD, DataFrame ΠΈ Dataset. Π•Π³ΠΎ ΠΌΠΎΠΆΠ½ΠΎ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ вмСсто SQLContext, HiveContext ΠΈ Π΄Ρ€ΡƒΠ³ΠΈΡ… контСкстов, ΠΎΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½Π½Ρ‹Ρ… Π΄ΠΎ 2.0.

Π‘ΠΎΠ·Π΄Π°Π½ΠΈΠ΅ SparkSession

Π˜ΡΡ‚ΠΎΡ‡Π½ΠΈΠΊ

How to Install Apache Spark on Windows 10

Home Β» DevOps and Development Β» How to Install Apache Spark on Windows 10

Apache Spark is an open-source framework that processes large volumes of stream data from multiple sources. Spark is used in distributed computing with machine learning applications, data analytics, and graph-parallel processing.

This guide will show you how to install Apache Spark on Windows 10 and test the installation.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

Install Apache Spark on Windows

Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.

Step 1: Install Java 8

Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.

Open the command line by clicking Start > type cmd > click Command Prompt.

Type the following command in the command prompt:

If Java is installed, it will respond with the following output:

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

Your version may be different. The second digit is the Java version – in this case, Java 8.

If you don’t have Java installed:

1. Open a browser window, and navigate to https://java.com/en/download/.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

2. Click the Java Download button and save the file to a location of your choice.

3. Once the download finishes double-click the file to install Java.

Note: At the time this article was written, the latest Java version is 1.8.0_251. Installing a later version will still work. This process only needs the Java Runtime Environment (JRE) – the full Development Kit (JDK) is not required. The download link to JDK is https://www.oracle.com/java/technologies/javase-downloads.html.

Step 2: Install Python

1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.

2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of writing the article.

3. Once the download finishes, run the file.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other box checked.

5. Next, click Customize installation.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.

7. Click Next.

8. Select the box Install for all users and leave other boxes as they are.

9. Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it Python.

10. Select that folder and click OK.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

11. Click Install, and let the installation complete.

12. When the installation completes, click the Disable path length limit option at the bottom and then click Close.

13. If you have a command prompt open, restart it. Verify the installation by checking the version of Python:

Note: For detailed instructions on how to install Python 3 on Windows or how to troubleshoot potential issues, refer to our Install Python 3 on Windows guide.

Step 3: Download Apache Spark

2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.

3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the list and save the file to your Downloads folder.

Step 4: Verify Spark Software File

1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working with unaltered, uncorrupted software.

2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.

3. Next, open a command line and enter the following command:

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

5. Compare the code to the one you opened in a new browser tab. If they match, your download file is uncorrupted.

Step 5: Install Apache Spark

Installing Apache Spark involves extracting the downloaded file to the desired location.

1. Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:

2. In Explorer, locate the Spark file you downloaded.

3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).

4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.

Step 6: Add winutils.exe File

Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.

1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and click it.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

2. Find the Download button on the right side to download the file.

3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.

4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.

Step 7: Configure Environment Variables

Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH. It allows you to run the Spark shell directly from a command prompt window.

1. Click Start and type environment.

2. Select the result labeled Edit the system environment variables.

3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

4. For Variable Name type SPARK_HOME.

5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

7. You should see a box with entries on the left. On the right, click New.

8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

9. Repeat this process for Hadoop and Java.

10. Click OK to close all open windows.

Note: Star by restarting the Command Prompt to apply changes. If that doesn’t work, you will need to reboot the system.

Step 8: Launch Spark

1. Open a new command-prompt window using the right-click and Run as administrator:

2. To start Spark, enter:

If you set the environment path correctly, you can type spark-shell to launch Spark.

3. The system should display several lines indicating the status of the application. You may get a Java pop-up. Select Allow access to continue.

Finally, the Spark logo appears, and the prompt displays the Scala shell.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

4., Open a web browser and navigate to http://localhost:4040/.

5. You can replace localhost with the name of your system.

6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.

Note: If you installed Python, you can run Spark using Python with this command:

Test Spark

In this example, we will launch the Spark shell and use Scala to read the contents of a file. You can use an existing file, such as the README file in the Spark directory, or you can create your own. We created pnaptest with some text.

1. Open a command-prompt window and navigate to the folder with the file you want to use and launch the Spark shell.

2. First, state a variable to use in the Spark context with the name of the file. Remember to add the file extension if there is any.

3. The output shows an RDD is created. Then, we can view the file contents by using this command to call an action:

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

This command instructs Spark to print 11 lines from the file you specified. To perform an action on this file (value x), add another value y, and do a map transformation.

4. For example, you can print the characters in reverse with this command:

5. The system creates a child RDD in relation to the first one. Then, specify how many lines you want to print from the value y:

ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ Ρ„ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π‘ΠΌΠΎΡ‚Ρ€Π΅Ρ‚ΡŒ ΠΊΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΡƒ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. ΠšΠ°Ρ€Ρ‚ΠΈΠ½ΠΊΠ° ΠΏΡ€ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10. Π€ΠΎΡ‚ΠΎ ΠΊΠ°ΠΊ ΡƒΡΡ‚Π°Π½ΠΎΠ²ΠΈΡ‚ΡŒ payspark Π½Π° windows 10

The output prints 11 lines of the pnaptest file in the reverse order.

You should now have a working installation of Apache Spark on Windows 10 with all dependencies installed. Get started running an instance of Spark in your Windows environment.

Our suggestion is to also learn more about what Spark DataFrame is, the features, and how to use Spark DataFrame when collecting data.

Π˜ΡΡ‚ΠΎΡ‡Π½ΠΈΠΊ

Π”ΠΎΠ±Π°Π²ΠΈΡ‚ΡŒ ΠΊΠΎΠΌΠΌΠ΅Π½Ρ‚Π°Ρ€ΠΈΠΉ

Π’Π°Ρˆ адрСс email Π½Π΅ Π±ΡƒΠ΄Π΅Ρ‚ ΠΎΠΏΡƒΠ±Π»ΠΈΠΊΠΎΠ²Π°Π½. ΠžΠ±ΡΠ·Π°Ρ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ поля ΠΏΠΎΠΌΠ΅Ρ‡Π΅Π½Ρ‹ *