ΠΊΠ°ΠΊ ΡΡΡΠ°Π½ΠΎΠ²ΠΈΡΡ payspark Π½Π° windows 10
Getting Started with PySpark on Windows
I decided to teach myself how to work with big data and came across Apache Spark. While I had heard of Apache Hadoop, to use Hadoop for working with big data, I had to write code in Java which I was not really looking forward to as I love to write code in Python. Spark supports a Python programming API called PySpark that is actively maintained and was enough to convince me to start learning PySpark for working with big data.
In this post, I describe how I got started with PySpark on Windows. My laptop is running Windows 10. So the screenshots are specific to Windows 10. I am also assuming that you are comfortable working with the Command Prompt on Windows. You do not have to be an expert, but you need to know how to start a Command Prompt and run commands such as those that help you move around your computerβs file system. In case you need a refresher, a quick introduction might be handy.
Often times, many open source projects do not have good Windows support. So I had to first figure out if Spark and PySpark would work well on Windows. The official Spark documentation does mention about supporting Windows.
Installing Prerequisites
PySpark requires Java version 7 or later and Python version 2.6 or later. Letβs first check if they are already installed or install them and make sure that PySpark can work with these two components.
Java is used by many other software. So it is quite possible that a required version (in our case version 7 or later) is already available on your computer. To check if Java is available and find itβs version, open a Command Prompt and type the following command.
If Java is installed and configured to work from a Command Prompt, running the above command should print the information about the Java version to the console. For example, I got the following output on my laptop.
Instead if you get a message like
It means you need to install Java. To do so,
Go to the Java download page. In case the download link has changed, search for Java SE Runtime Environment on the internet and you should be able to find the download page.
Click the Download button beneath JRE
Accept the license agreement and download the latest version of Java SE Runtime Environment installer. I suggest getting the exe for Windows x64 (such as jre-8u92-windows-x64.exe ) unless you are using a 32 bit version of Windows in which case you need to get the Windows x86 Offline version.
Python
Python is used by many other software. So it is quite possible that a required version (in our case version 2.6 or later) is already available on your computer. To check if Python is available and find itβs version, open a Command Prompt and type the following command.
If Python is installed and configured to work from a Command Prompt, running the above command should print the information about the Python version to the console. For example, I got the following output on my laptop.
Instead if you get a message like
It means you need to install Python. To do so,
Go to the Python download page.
Click the Latest Python 2 Release link.
Download the Windows x86-64 MSI installer file. If you are using a 32 bit version of Windows download the Windows x86 MSI installer file.
When you run the installer, on the Customize Python section, make sure that the option Add python.exe to Path is selected. If this option is not selected, some of the PySpark utilities such as pyspark and spark-submit might not work.
Installing Apache Spark
Go to the Spark download page.
For Choose a Spark release, select the latest stable release of Spark.
For Choose a package type, select a version that is pre-built for the latest version of Hadoop such as Pre-built for Hadoop 2.6.
For Choose a download type, select Direct Download.
In order to install Apache Spark, there is no need to run any installer. You can extract the files from the downloaded tarball in any folder of your choice using the 7Zip tool.
Make sure that the folder path and the folder name containing Spark files do not contain any spaces.
The PySpark shell outputs a few messages on exit. So you need to hit enter to get back to the Command Prompt.
Configuring the Spark Installation
This error message does not prevent the PySpark shell from starting. However if you try to run a standalone Python script using the bin\spark-submit utility, you will get an error. For example, try running the wordcount.py script from the examples folder in the Command Prompt when you are in the SPARK_HOME directory.
which produces the following error that also points to missing winutils.exe
Installing winutils
Create a hadoop\bin folder inside the SPARK_HOME folder.
Download the winutils.exe for the version of hadoop against which your Spark installation was built for. In my case the hadoop version was 2.6.0. So I downloaded the winutils.exe for hadoop 2.6.0 and copied it to the hadoop\bin folder in the SPARK_HOME folder.
Create a system environment variable in Windows called SPARK_HOME that points to the SPARK_HOME folder path. Search the internet in case you need a refresher on how to create environment variables in your version of Windows such as articles like these.
Create another system environment variable in Windows called HADOOP_HOME that points to the hadoop folder inside the SPARK_HOME folder.
If you now run the bin\pyspark script from a Windows Command Prompt, the error messages related to winutils.exe should be gone. For example, I got the following messages after running the bin\pyspark utility after configuring winutils
The bin\spark-submit utility can also be successfully used to run wordcount.py script.
Configuring the log level for Spark
There are still a lot of extra INFO messages in the console everytime you start or exit from a PySpark shell or run the spark-submit utility. So letβs make one more change to our Spark installation so that only warning and error messages are written to the console. In order to do this
Copy the log4j.properties.template file in the SPARK_HOME\conf folder as log4j.properties file in the SPARK_HOME\conf folder.
Set the log4j.rootCategory property value to WARN, console
Save the log4j.properties file.
Summary
In order to work with PySpark, start a Windows Command Prompt and change into your SPARK_HOME directory.
To start a PySpark shell, run the bin\pyspark utility. Once your are in the PySpark shell use the sc and sqlContext names and type exit() to return back to the Command Prompt.
To run a standalone Python script, run the bin\spark-submit utility and specify the path of your Python script as well as any arguments your Python script needs in the Command Prompt. For example, to run the wordcount.py script from examples directory in your SPARK_HOME folder, you can run the following command
bin\spark-submit examples\src\main\python\wordcount.py README.md
References
I used the following references to gather information about this post.
Downloading Spark and Getting Started (chapter 2) from OβReillyβs Learning Spark book.
Any suggestions or feedback? Leave your comments below.
Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° Apache PySpark Π² Windows 10
ΠΠ°ΡΠ° ΠΏΡΠ±Π»ΠΈΠΊΠ°ΡΠΈΠΈ Aug 30, 2019
ΠΠΎΡΠ»Π΅Π΄Π½ΠΈΠ΅ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΠΌΠ΅ΡΡΡΠ΅Π² Ρ ΡΠ°Π±ΠΎΡΠ°Π» Π½Π°Π΄ ΠΏΡΠΎΠ΅ΠΊΡΠΎΠΌ Data Science, ΠΊΠΎΡΠΎΡΡΠΉ ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°Π΅Ρ ΠΎΠ³ΡΠΎΠΌΠ½ΡΠΉ Π½Π°Π±ΠΎΡ Π΄Π°Π½Π½ΡΡ , ΠΈ ΡΡΠ°Π»ΠΎ Π½Π΅ΠΎΠ±Ρ ΠΎΠ΄ΠΈΠΌΡΠΌ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΡΡ ΡΡΠ΅Π΄Ρ, ΠΏΡΠ΅Π΄ΠΎΡΡΠ°Π²Π»ΡΠ΅ΠΌΡΡ Apache PySpark.
Π― ΠΌΠ½ΠΎΠ³ΠΎ Π±ΠΎΡΠΎΠ»ΡΡ ΠΏΡΠΈ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅ PySpark Π½Π° Windows 10. ΠΠΎΡΡΠΎΠΌΡ Ρ ΡΠ΅ΡΠΈΠ» Π½Π°ΠΏΠΈΡΠ°ΡΡ ΡΡΠΎΡ Π±Π»ΠΎΠ³, ΡΡΠΎΠ±Ρ ΠΏΠΎΠΌΠΎΡΡ Π»ΡΠ±ΠΎΠΌΡ Π»Π΅Π³ΠΊΠΎ ΡΡΡΠ°Π½ΠΎΠ²ΠΈΡΡ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Apache PySpark Π½Π° ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ΅ Ρ Windows 10.
1. Π¨Π°Π³ 1
PySpark ΡΡΠ΅Π±ΡΠ΅Ρ Java Π²Π΅ΡΡΠΈΠΈ 7 ΠΈΠ»ΠΈ Π½ΠΎΠ²Π΅Π΅ ΠΈ Python Π²Π΅ΡΡΠΈΠΈ 2.6 ΠΈΠ»ΠΈ Π½ΠΎΠ²Π΅Π΅. ΠΠ°Π²Π°ΠΉΡΠ΅ ΡΠ½Π°ΡΠ°Π»Π° ΠΏΡΠΎΠ²Π΅ΡΠΈΠΌ, ΡΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½Ρ Π»ΠΈ ΠΎΠ½ΠΈ, ΠΈΠ»ΠΈ ΡΡΡΠ°Π½ΠΎΠ²ΠΈΠΌ ΠΈΡ ΠΈ ΡΠ±Π΅Π΄ΠΈΠΌΡΡ, ΡΡΠΎ PySpark ΠΌΠΎΠΆΠ΅Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ ΡΡΠΈΠΌΠΈ Π΄Π²ΡΠΌΡ ΠΊΠΎΠΌΠΏΠΎΠ½Π΅Π½ΡΠ°ΠΌΠΈ.
Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° Java
ΠΡΠΎΠ²Π΅ΡΡΡΠ΅, ΡΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½Π° ββΠ»ΠΈ Π½Π° Π²Π°ΡΠ΅ΠΌ ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ΅ Java Π²Π΅ΡΡΠΈΠΈ 7 ΠΈΠ»ΠΈ Π½ΠΎΠ²Π΅Π΅. ΠΠ»Ρ ΡΡΠΎΠ³ΠΎ Π²ΡΠΏΠΎΠ»Π½ΠΈΡΠ΅ ΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΊΠΎΠΌΠ°Π½Π΄Ρ Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠ΅.
ΠΡΠ»ΠΈ Java ΡΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½Π° ββΠΈ Π½Π°ΡΡΡΠΎΠ΅Π½Π° Π΄Π»Ρ ΡΠ°Π±ΠΎΡΡ ΠΈΠ· ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠΈ, Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ Π²ΡΡΠ΅ΡΠΊΠ°Π·Π°Π½Π½ΠΎΠΉ ΠΊΠΎΠΌΠ°Π½Π΄Ρ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π²ΡΠ²Π΅ΡΡΠΈ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ Π²Π΅ΡΡΠΈΠΈ Java Π½Π° ΠΊΠΎΠ½ΡΠΎΠ»Ρ. ΠΠ½Π°ΡΠ΅, Π΅ΡΠ»ΠΈ Π²Ρ ΠΏΠΎΠ»ΡΡΠΈΡΠ΅ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠ΅, ΠΏΠΎΠ΄ΠΎΠ±Π½ΠΎΠ΅:
Β«JavaΒ» Π½Π΅ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π΅ΡΡΡ ΠΊΠ°ΠΊ Π²Π½ΡΡΡΠ΅Π½Π½ΡΡ ΠΈΠ»ΠΈ Π²Π½Π΅ΡΠ½ΡΡ ΠΊΠΎΠΌΠ°Π½Π΄Π°, ΡΠ°Π±ΠΎΡΠ°ΡΡΠ°Ρ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ° ΠΈΠ»ΠΈ ΠΏΠ°ΠΊΠ΅ΡΠ½ΡΠΉ ΡΠ°ΠΉΠ».
ΡΠΎΠ³Π΄Π° Π²Ρ Π΄ΠΎΠ»ΠΆΠ½Ρ ΡΡΡΠ°Π½ΠΎΠ²ΠΈΡΡ Java.
Π±) ΠΠΎΠ»ΡΡΠΈΡΡ Windows x64 (Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, jre-8u92-windows-x64.exe), Π΅ΡΠ»ΠΈ Π²Ρ Π½Π΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΠ΅ 32-ΡΠ°Π·ΡΡΠ΄Π½ΡΡ Π²Π΅ΡΡΠΈΡ Windows, Π² ΡΡΠΎΠΌ ΡΠ»ΡΡΠ°Π΅ Π²Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΠΏΠΎΠ»ΡΡΠΈΡΡWindows x86 OfflineΠ²Π΅ΡΡΠΈΡ.
Π²) ΠΠ°ΠΏΡΡΡΠΈΡΠ΅ ΡΡΡΠ°Π½ΠΎΠ²ΡΠΈΠΊ.
2. Π¨Π°Π³ 2
ΠΏΠΈΡΠΎΠ½
ΠΡΠ»ΠΈ Python ΡΡΡΠ°Π½ΠΎΠ²Π»Π΅Π½ ΠΈ Π½Π°ΡΡΡΠΎΠ΅Π½ Π΄Π»Ρ ΡΠ°Π±ΠΎΡΡ ΠΈΠ· ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠΈ, ΠΏΡΠΈ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΠΈ Π²ΡΡΠ΅ΡΠΊΠ°Π·Π°Π½Π½ΠΎΠΉ ΠΊΠΎΠΌΠ°Π½Π΄Ρ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΡ ΠΎ Π²Π΅ΡΡΠΈΠΈ Python Π΄ΠΎΠ»ΠΆΠ½Π° Π²ΡΠ²ΠΎΠ΄ΠΈΡΡΡΡ Π½Π° ΠΊΠΎΠ½ΡΠΎΠ»Ρ. ΠΠ°ΠΏΡΠΈΠΌΠ΅Ρ, Ρ ΠΏΠΎΠ»ΡΡΠΈΠ» ΡΠ»Π΅Π΄ΡΡΡΠΈΠΉ Π²ΡΠ²ΠΎΠ΄ Π½Π° ΠΌΠΎΠ΅ΠΌ Π½ΠΎΡΡΠ±ΡΠΊΠ΅:
ΠΠΌΠ΅ΡΡΠΎ ΡΡΠΎΠ³ΠΎ, Π΅ΡΠ»ΠΈ Π²Ρ ΠΏΠΎΠ»ΡΡΠΈΡΠ΅ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠ΅, ΠΊΠ°ΠΊ
Β«PythonΒ» Π½Π΅ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π΅ΡΡΡ ΠΊΠ°ΠΊ Π²Π½ΡΡΡΠ΅Π½Π½ΡΡ ΠΈΠ»ΠΈ Π²Π½Π΅ΡΠ½ΡΡ ΠΊΠΎΠΌΠ°Π½Π΄Π°, ΡΠ°Π±ΠΎΡΠ°ΡΡΠ°Ρ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ° ΠΈΠ»ΠΈ ΠΏΠ°ΠΊΠ΅ΡΠ½ΡΠΉ ΡΠ°ΠΉΠ» Β».
ΠΡΠΎ ΠΎΠ·Π½Π°ΡΠ°Π΅Ρ, ΡΡΠΎ Π²Π°ΠΌ Π½ΡΠΆΠ½ΠΎ ΡΡΡΠ°Π½ΠΎΠ²ΠΈΡΡ Python. ΠΠ»Ρ ΡΡΠΎΠ³ΠΎ
Π°) ΠΠ΅ΡΠ΅ΠΉΡΠΈ ΠΊ ΠΏΠΈΡΠΎΠ½ΡΡΠΊΠ°ΡΠ°ΡΡΡΡΡ.
Π±) ΠΠ°ΠΆΠΌΠΈΡΠ΅ΠΠΎΡΠ»Π΅Π΄Π½ΠΈΠΉ Π²ΡΠΏΡΡΠΊ Python 2ΡΡΡΠ»ΠΊΠ°.
c) ΠΠ°Π³ΡΡΠ·ΠΈΡΠ΅ ΡΡΡΠ°Π½ΠΎΠ²ΠΎΡΠ½ΡΠΉ ΡΠ°ΠΉΠ» MSI Π΄Π»Ρ Windows x86β64. ΠΡΠ»ΠΈ Π²Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΠ΅ 32-ΡΠ°Π·ΡΡΠ΄Π½ΡΡ Π²Π΅ΡΡΠΈΡ Windows, Π·Π°Π³ΡΡΠ·ΠΈΡΠ΅ ΡΡΡΠ°Π½ΠΎΠ²ΠΎΡΠ½ΡΠΉ ΡΠ°ΠΉΠ» MSI Π΄Π»Ρ Windows x86.
Π³) ΠΠΎΠ³Π΄Π° Π²Ρ Π·Π°ΠΏΡΡΠΊΠ°Π΅ΡΠ΅ ΡΡΡΠ°Π½ΠΎΠ²ΡΠΈΠΊ, Π½Π°ΠΠ°ΡΡΡΠΎΠΈΡΡ PythonΡΠ°Π·Π΄Π΅Π», ΡΠ±Π΅Π΄ΠΈΡΠ΅ΡΡ, ΡΡΠΎ ΠΎΠΏΡΠΈΡΠΠΎΠ±Π°Π²ΠΈΡΡ python.exe Π² ΠΏΡΡΡΠ²ΡΠ±ΡΠ°Π½. ΠΡΠ»ΠΈ ΡΡΠΎΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡ Π½Π΅ Π²ΡΠ±ΡΠ°Π½, Π½Π΅ΠΊΠΎΡΠΎΡΡΠ΅ ΡΡΠΈΠ»ΠΈΡΡ PySpark, ΡΠ°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ pyspark ΠΈ spark-submit, ΠΌΠΎΠ³ΡΡ Π½Π΅ ΡΠ°Π±ΠΎΡΠ°ΡΡ.
3. Π¨Π°Π³ 3
Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° Apache Spark
Π°) ΠΠ΅ΡΠ΅ΠΉΡΠΈ ΠΊ ΠΈΡΠΊΡΠ΅ΡΠΊΠ°ΡΠ°ΡΡΡΡΡ.
Π±) ΠΡΠ±Π΅ΡΠΈΡΠ΅ ΠΏΠΎΡΠ»Π΅Π΄Π½ΡΡ ΡΡΠ°Π±ΠΈΠ»ΡΠ½ΡΡ Π²Π΅ΡΡΠΈΡ Spark.
Ρ)ΠΡΠ±Π΅ΡΠΈΡΠ΅ ΡΠΈΠΏ ΡΠΏΠ°ΠΊΠΎΠ²ΠΊΠΈ: sΠ²ΡΠ±Π΅ΡΠΈΡΠ΅ Π²Π΅ΡΡΠΈΡ, ΠΏΡΠ΅Π΄Π²Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΡΠΎΠ·Π΄Π°Π½Π½ΡΡ Π΄Π»Ρ ΠΏΠΎΡΠ»Π΅Π΄Π½Π΅ΠΉ Π²Π΅ΡΡΠΈΠΈ Hadoop, ΡΠ°ΠΊΡΡ ββΠΊΠ°ΠΊΠΡΠ΅Π΄Π²Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΠΏΠΎΡΡΡΠΎΠ΅Π½ Π΄Π»Ρ Hadoop 2.6,
Π³)ΠΡΠ±Π΅ΡΠΈΡΠ΅ ΡΠΈΠΏ Π·Π°Π³ΡΡΠ·ΠΊΠΈ:ΠΡΠ±ΡΠ°ΡΡΠΡΡΠΌΠΎΠ΅ ΡΠΊΠ°ΡΠΈΠ²Π°Π½ΠΈΠ΅,
f) ΠΠ»Ρ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ Apache Spark Π²Π°ΠΌ Π½Π΅ Π½ΡΠΆΠ½ΠΎ Π·Π°ΠΏΡΡΠΊΠ°ΡΡ ΠΊΠ°ΠΊΠΎΠΉ-Π»ΠΈΠ±ΠΎ ΡΡΡΠ°Π½ΠΎΠ²ΡΠΈΠΊ. ΠΠ·Π²Π»Π΅ΠΊΠΈΡΠ΅ ΡΠ°ΠΉΠ»Ρ ΠΈΠ· Π·Π°Π³ΡΡΠΆΠ΅Π½Π½ΠΎΠ³ΠΎ tar-ΡΠ°ΠΉΠ»Π° Π² Π»ΡΠ±ΡΡ ΠΏΠ°ΠΏΠΊΡ ΠΏΠΎ Π²Π°ΡΠ΅ΠΌΡ Π²ΡΠ±ΠΎΡΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ7ZipΠΈΠ½ΡΡΡΡΠΌΠ΅Π½Ρ / Π΄ΡΡΠ³ΠΈΠ΅ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΡ Π΄Π»Ρ ΡΠ°Π·Π°ΡΡ ΠΈΠ²ΠΈΡΠΎΠ²Π°Π½ΠΈΡ.
Π£Π±Π΅Π΄ΠΈΡΠ΅ΡΡ, ΡΡΠΎ ΠΏΡΡΡ ΠΊ ΠΏΠ°ΠΏΠΊΠ΅ ΠΈ ΠΈΠΌΡ ΠΏΠ°ΠΏΠΊΠΈ, ΡΠΎΠ΄Π΅ΡΠΆΠ°ΡΠ΅ΠΉ ΡΠ°ΠΉΠ»Ρ Spark, Π½Π΅ ΡΠΎΠ΄Π΅ΡΠΆΠ°Ρ ΠΏΡΠΎΠ±Π΅Π»ΠΎΠ².
Π― ΡΠΎΠ·Π΄Π°Π» ΠΏΠ°ΠΏΠΊΡ Ρ ΠΈΠΌΠ΅Π½Π΅ΠΌ spark Π½Π° ΠΌΠΎΠ΅ΠΌ Π΄ΠΈΡΠΊΠ΅ D ΠΈ ΡΠ°ΡΠΏΠ°ΠΊΠΎΠ²Π°Π» Π·Π°Π°ΡΡ ΠΈΠ²ΠΈΡΠΎΠ²Π°Π½Π½ΡΠΉ tar-ΡΠ°ΠΉΠ» Π² ΠΏΠ°ΠΏΠΊΡ Ρ ΠΈΠΌΠ΅Π½Π΅ΠΌ spark-2.4.3-bin-hadoop2.7. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, Π²ΡΠ΅ ΡΠ°ΠΉΠ»Ρ Spark Π½Π°Ρ ΠΎΠ΄ΡΡΡΡ Π² ΠΏΠ°ΠΏΠΊΠ΅ Ρ ΠΈΠΌΠ΅Π½Π΅ΠΌ D: \ spark \ spark-2.4.3-bin-hadoop2.7. ΠΠ°Π²Π°ΠΉΡΠ΅ Π½Π°Π·ΠΎΠ²Π΅ΠΌ ΡΡΡ ΠΏΠ°ΠΏΠΊΡ SPARK_HOME Π² ΡΡΠΎΠΌ ΠΏΠΎΡΡΠ΅.
Π§ΡΠΎΠ±Ρ ΠΏΡΠΎΠ²Π΅ΡΠΈΡΡ ΡΡΠΏΠ΅ΡΠ½ΠΎΡΡΡ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ, ΠΎΡΠΊΡΠΎΠΉΡΠ΅ ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡΡ ΡΡΡΠΎΠΊΡ, ΠΏΠ΅ΡΠ΅ΠΉΠ΄ΠΈΡΠ΅ Π² ΠΊΠ°ΡΠ°Π»ΠΎΠ³ SPARK_HOME ΠΈ Π²Π²Π΅Π΄ΠΈΡΠ΅ bin \ pyspark. ΠΡΠΎ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΡ PySpark, ΠΊΠΎΡΠΎΡΡΡ ΠΌΠΎΠΆΠ½ΠΎ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π΄Π»Ρ ΠΈΠ½ΡΠ΅ΡΠ°ΠΊΡΠΈΠ²Π½ΠΎΠΉ ΡΠ°Π±ΠΎΡΡ ΡΠΎ Spark.
ΠΠ±ΠΎΠ»ΠΎΡΠΊΠ° PySpark Π²ΡΠ²ΠΎΠ΄ΠΈΡ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΎ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ ΠΏΡΠΈ Π²ΡΡ ΠΎΠ΄Π΅. ΠΠΎΡΡΠΎΠΌΡ Π²Π°ΠΌ Π½ΡΠΆΠ½ΠΎ Π½Π°ΠΆΠ°ΡΡ Enter, ΡΡΠΎΠ±Ρ Π²Π΅ΡΠ½ΡΡΡΡΡ Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡΡ ΡΡΡΠΎΠΊΡ.
4. Π¨Π°Π³ 4
ΠΠ°ΡΡΡΠΎΠΉΠΊΠ° ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ Spark
ΠΠ΅ΡΠ²ΠΎΠ½Π°ΡΠ°Π»ΡΠ½ΠΎ, ΠΊΠΎΠ³Π΄Π° Π²Ρ Π·Π°ΠΏΡΡΠΊΠ°Π΅ΡΠ΅ ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΡ PySpark, ΠΎΠ½Π° Π²ΡΠ΄Π°Π΅Ρ ΠΌΠ½ΠΎΠ³ΠΎ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ ΡΠΈΠΏΠ° INFO, ERROR ΠΈ WARN. ΠΠ°Π²Π°ΠΉΡΠ΅ ΠΏΠΎΡΠΌΠΎΡΡΠΈΠΌ, ΠΊΠ°ΠΊ ΡΠ΄Π°Π»ΠΈΡΡ ΡΡΠΈ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΡ.
Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° Spark Π² Windows ΠΏΠΎ ΡΠΌΠΎΠ»ΡΠ°Π½ΠΈΡ Π½Π΅ Π²ΠΊΠ»ΡΡΠ°Π΅Ρ ΡΡΠΈΠ»ΠΈΡΡ winutils.exe, ΠΊΠΎΡΠΎΡΠ°Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΡΡΡ Spark. ΠΡΠ»ΠΈ Π²Ρ Π½Π΅ ΡΠΊΠ°ΠΆΠ΅ΡΠ΅ ΡΠ²ΠΎΠ΅ΠΉ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅ Spark, Π³Π΄Π΅ ΠΈΡΠΊΠ°ΡΡ winutils.exe, Π²Ρ ΡΠ²ΠΈΠ΄ΠΈΡΠ΅ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΡ ΠΎΠ± ΠΎΡΠΈΠ±ΠΊΠ°Ρ ΠΏΡΠΈ Π·Π°ΠΏΡΡΠΊΠ΅ ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΠΈ PySpark, ΡΠ°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ
Β«ΠΠ¨ΠΠΠΠ Shell: Π½Π΅ ΡΠ΄Π°Π»ΠΎΡΡ Π½Π°ΠΉΡΠΈ Π΄Π²ΠΎΠΈΡΠ½ΡΠΉ ΡΠ°ΠΉΠ» winutils Π² Π΄Π²ΠΎΠΈΡΠ½ΠΎΠΌ ΠΏΡΡΠΈ hadoop java.io.IOException: Π½Π΅ ΡΠ΄Π°Π»ΠΎΡΡ Π½Π°ΠΉΡΠΈ ΠΈΡΠΏΠΎΠ»Π½ΡΠ΅ΠΌΡΠΉ ΡΠ°ΠΉΠ» null \ bin \ winutils.exe Π² Π΄Π²ΠΎΠΈΡΠ½ΡΡ ΡΠ°ΠΉΠ»Π°Ρ HadoopΒ».
ΠΡΠΎ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΎΠ± ΠΎΡΠΈΠ±ΠΊΠ΅ Π½Π΅ ΠΏΡΠ΅ΠΏΡΡΡΡΠ²ΡΠ΅Ρ Π·Π°ΠΏΡΡΠΊΡ ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΠΈ PySpark. ΠΠ΄Π½Π°ΠΊΠΎ Π΅ΡΠ»ΠΈ Π²Ρ ΠΏΠΎΠΏΡΡΠ°Π΅ΡΠ΅ΡΡ Π·Π°ΠΏΡΡΡΠΈΡΡ Π°Π²ΡΠΎΠ½ΠΎΠΌΠ½ΡΠΉ ΡΠΊΡΠΈΠΏΡ Python Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΡΠΈΠ»ΠΈΡΡ bin \ spark-submit, Π²Ρ ΠΏΠΎΠ»ΡΡΠΈΡΠ΅ ΠΎΡΠΈΠ±ΠΊΡ. ΠΠ°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΏΠΎΠΏΡΠΎΠ±ΡΠΉΡΠ΅ Π·Π°ΠΏΡΡΡΠΈΡΡ ΡΠΊΡΠΈΠΏΡ wordcount.py ΠΈΠ· ΠΏΠ°ΠΏΠΊΠΈ ΠΏΡΠΈΠΌΠ΅ΡΠΎΠ² Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠ΅, ΠΊΠΎΠ³Π΄Π° Π²Ρ Π½Π°Ρ ΠΎΠ΄ΠΈΡΠ΅ΡΡ Π² ΠΊΠ°ΡΠ°Π»ΠΎΠ³Π΅ SPARK_HOME.
Β«Bin \ spark-submit examples \ src \ main \ python \ wordcount.py README.mdΒ»
Π£ΡΡΠ°Π½ΠΎΠ²ΠΊΠ° winutils
ΠΠ°Π²Π°ΠΉΡΠ΅ Π·Π°Π³ΡΡΠ·ΠΈΠΌ winutils.exe ΠΈ ΡΠΊΠΎΠ½ΡΠΈΠ³ΡΡΠΈΡΡΠ΅ΠΌ Π½Π°ΡΡ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΡ Spark, ΡΡΠΎΠ±Ρ Π½Π°ΠΉΡΠΈ winutils.exe.
a) Π‘ΠΎΠ·Π΄Π°ΠΉΡΠ΅ ΠΏΠ°ΠΏΠΊΡ hadoop \ bin Π²Π½ΡΡΡΠΈ ΠΏΠ°ΠΏΠΊΠΈ SPARK_HOME.
Π±) Π‘ΠΊΠ°ΡΠ°ΡΡwinutils.exeΠ΄Π»Ρ Π²Π΅ΡΡΠΈΠΈ hadoop, Π΄Π»Ρ ΠΊΠΎΡΠΎΡΠΎΠΉ Π±ΡΠ»Π° ΡΠΎΠ·Π΄Π°Π½Π° Π²Π°ΡΠ° ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ° Spark. Π ΠΌΠΎΠ΅ΠΌ ΡΠ»ΡΡΠ°Π΅ Π²Π΅ΡΡΠΈΡ hadoop Π±ΡΠ»Π° 2.6.0. Π’Π°ΠΊ ΡΡΠΎ ΡΠ·Π°Π³ΡΡΠΆΠ΅Π½Π½ΠΎΠ΅winutils.exe Π΄Π»Ρ hadoop 2.6.0 ΠΈ ΡΠΊΠΎΠΏΠΈΡΠΎΠ²Π°Π» Π΅Π³ΠΎ Π² ΠΏΠ°ΠΏΠΊΡ hadoop \ bin Π² ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME.
c) Π‘ΠΎΠ·Π΄Π°ΠΉΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΠ½ΡΡ ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ ΡΡΠ΅Π΄Ρ Π² Windows Ρ ΠΈΠΌΠ΅Π½Π΅ΠΌ SPARK_HOME, ΠΊΠΎΡΠΎΡΠ°Ρ ΡΠΊΠ°Π·ΡΠ²Π°Π΅Ρ ΠΏΡΡΡ ΠΊ ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME.
d) Π‘ΠΎΠ·Π΄Π°ΠΉΡΠ΅ Π² Windows Π΄ΡΡΠ³ΡΡ ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ ΡΠΈΡΡΠ΅ΠΌΠ½ΠΎΠΉ ΡΡΠ΅Π΄Ρ Ρ ΠΈΠΌΠ΅Π½Π΅ΠΌ HADOOP_HOME, ΠΊΠΎΡΠΎΡΠ°Ρ ΡΠΊΠ°Π·ΡΠ²Π°Π΅Ρ Π½Π° ΠΏΠ°ΠΏΠΊΡ hadoop Π²Π½ΡΡΡΠΈ ΠΏΠ°ΠΏΠΊΠΈ SPARK_HOME.
ΠΠΎΡΠΊΠΎΠ»ΡΠΊΡ ΠΏΠ°ΠΏΠΊΠ° hadoop Π½Π°Ρ ΠΎΠ΄ΠΈΡΡΡ Π²Π½ΡΡΡΠΈ ΠΏΠ°ΠΏΠΊΠΈ SPARK_HOME, Π»ΡΡΡΠ΅ ΡΠΎΠ·Π΄Π°ΡΡ ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ ΡΡΠ΅Π΄Ρ HADOOP_HOME, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅% SPARK_HOME% \ hadoop. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, Π²Π°ΠΌ Π½Π΅ Π½ΡΠΆΠ½ΠΎ ΠΌΠ΅Π½ΡΡΡ HADOOP_HOME, Π΅ΡΠ»ΠΈ SPARK_HOME ΠΎΠ±Π½ΠΎΠ²Π»Π΅Π½.
ΠΡΠ»ΠΈ Π²Ρ ΡΠ΅ΠΏΠ΅ΡΡ Π·Π°ΠΏΡΡΡΠΈΡΠ΅ ΡΡΠ΅Π½Π°ΡΠΈΠΉ bin \ pyspark ΠΈΠ· ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠΈ Windows, ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΡ ΠΎΠ± ΠΎΡΠΈΠ±ΠΊΠ°Ρ , ΡΠ²ΡΠ·Π°Π½Π½ΡΠ΅ Ρ winutils.exe, Π΄ΠΎΠ»ΠΆΠ½Ρ ΠΈΡΡΠ΅Π·Π½ΡΡΡ.
5. Π¨Π°Π³ 5
ΠΠ°ΡΡΡΠΎΠΉΠΊΠ° ΡΡΠΎΠ²Π½Ρ ΠΆΡΡΠ½Π°Π»Π° Π΄Π»Ρ Spark
ΠΠ°ΠΆΠ΄ΡΠΉ ΡΠ°Π· ΠΏΡΠΈ Π·Π°ΠΏΡΡΠΊΠ΅ ΠΈΠ»ΠΈ Π²ΡΡ ΠΎΠ΄Π΅ ΠΈΠ· ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΠΈ PySpark ΠΈΠ»ΠΈ ΠΏΡΠΈ Π·Π°ΠΏΡΡΠΊΠ΅ ΡΡΠΈΠ»ΠΈΡΡ spark-submit ΠΎΡΡΠ°Π΅ΡΡΡ ΠΌΠ½ΠΎΠ³ΠΎ Π΄ΠΎΠΏΠΎΠ»Π½ΠΈΡΠ΅Π»ΡΠ½ΡΡ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ INFO. ΠΡΠ°ΠΊ, Π΄Π°Π²Π°ΠΉΡΠ΅ Π²Π½Π΅ΡΠ΅ΠΌ Π΅ΡΠ΅ ΠΎΠ΄Π½ΠΎ ΠΈΠ·ΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ Π² Π½Π°ΡΡ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΡ Spark, ΡΡΠΎΠ±Ρ Π² ΠΊΠΎΠ½ΡΠΎΠ»Ρ Π·Π°ΠΏΠΈΡΡΠ²Π°Π»ΠΈΡΡ ΡΠΎΠ»ΡΠΊΠΎ ΠΏΡΠ΅Π΄ΡΠΏΡΠ΅ΠΆΠ΄Π΅Π½ΠΈΡ ΠΈ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΡ ΠΎΠ± ΠΎΡΠΈΠ±ΠΊΠ°Ρ . ΠΠ»Ρ ΡΡΠΎΠ³ΠΎ:
a) Π‘ΠΊΠΎΠΏΠΈΡΡΠΉΡΠ΅ ΡΠ°ΠΉΠ» log4j.properties.template Π² ΠΏΠ°ΠΏΠΊΡ SPARK_HOME \ conf ΠΊΠ°ΠΊ ΡΠ°ΠΉΠ» log4j.properties Π² ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME \ conf.
b) Π£ΡΡΠ°Π½ΠΎΠ²ΠΈΡΠ΅ Π΄Π»Ρ ΡΠ²ΠΎΠΉΡΡΠ²Π° log4j.rootCategory Π·Π½Π°ΡΠ΅Π½ΠΈΠ΅ WARN, console.
c) Π‘ΠΎΡ ΡΠ°Π½ΠΈΡΠ΅ ΡΠ°ΠΉΠ» log4j.properties.
Π’Π΅ΠΏΠ΅ΡΡ Π»ΡΠ±ΡΠ΅ ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΡΠ΅ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΡ Π½Π΅ Π±ΡΠ΄ΡΡ Π·Π°ΠΏΠΈΡΡΠ²Π°ΡΡΡΡ Π½Π° ΠΊΠΎΠ½ΡΠΎΠ»Ρ.
Π Π΅Π·ΡΠΌΠ΅
Π§ΡΠΎΠ±Ρ ΡΠ°Π±ΠΎΡΠ°ΡΡ Ρ PySpark, Π·Π°ΠΏΡΡΡΠΈΡΠ΅ ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡΡ ΡΡΡΠΎΠΊΡ ΠΈ ΠΏΠ΅ΡΠ΅ΠΉΠ΄ΠΈΡΠ΅ Π² ΠΊΠ°ΡΠ°Π»ΠΎΠ³ SPARK_HOME.
Π°) Π§ΡΠΎΠ±Ρ Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΡ PySpark, Π·Π°ΠΏΡΡΡΠΈΡΠ΅ ΡΡΠΈΠ»ΠΈΡΡ bin \ pyspark. ΠΠΎΠ³Π΄Π° Π²Ρ ΠΎΠΊΠ°ΠΆΠ΅ΡΠ΅ΡΡ Π² ΠΎΠ±ΠΎΠ»ΠΎΡΠΊΠ΅ PySpark, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠΉΡΠ΅ ΠΈΠΌΠ΅Π½Π° sc ΠΈ sqlContext ΠΈ Π²Π²Π΅Π΄ΠΈΡΠ΅ exit (), ΡΡΠΎΠ±Ρ Π²Π΅ΡΠ½ΡΡΡΡΡ Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΡΡ ΡΡΡΠΎΠΊΡ.
Π±) Π§ΡΠΎΠ±Ρ Π·Π°ΠΏΡΡΡΠΈΡΡ Π°Π²ΡΠΎΠ½ΠΎΠΌΠ½ΡΠΉ ΡΠΊΡΠΈΠΏΡ Python, Π·Π°ΠΏΡΡΡΠΈΡΠ΅ ΡΡΠΈΠ»ΠΈΡΡ bin \ spark-submit ΠΈ ΡΠΊΠ°ΠΆΠΈΡΠ΅ ΠΏΡΡΡ ΠΊ Π²Π°ΡΠ΅ΠΌΡ ΡΠΊΡΠΈΠΏΡΡ Python, Π° ΡΠ°ΠΊΠΆΠ΅ Π»ΡΠ±ΡΠ΅ Π°ΡΠ³ΡΠΌΠ΅Π½ΡΡ, ΠΊΠΎΡΠΎΡΡΠ΅ Π½ΡΠΆΠ½Ρ Π²Π°ΡΠ΅ΠΌΡ ΡΠΊΡΠΈΠΏΡΡ Python, Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠ΅. ΠΠ°ΠΏΡΠΈΠΌΠ΅Ρ, ΡΡΠΎΠ±Ρ Π·Π°ΠΏΡΡΡΠΈΡΡ ΡΠΊΡΠΈΠΏΡ wordcount.py ΠΈΠ· ΠΊΠ°ΡΠ°Π»ΠΎΠ³Π° examples Π² ΠΏΠ°ΠΏΠΊΠ΅ SPARK_HOME, Π²Ρ ΠΌΠΎΠΆΠ΅ΡΠ΅ Π²ΡΠΏΠΎΠ»Π½ΠΈΡΡ ΡΠ»Π΅Π΄ΡΡΡΡΡ ΠΊΠΎΠΌΠ°Π½Π΄Ρ:
Β«bin \ spark-submit examples \ src \ main \ python \ wordcount.py README.mdΒ«
6. Π¨Π°Π³ 6
ΠΠ°ΠΆΠ½ΠΎ: Ρ ΡΡΠΎΠ»ΠΊΠ½ΡΠ»ΡΡ Ρ ΠΏΡΠΎΠ±Π»Π΅ΠΌΠΎΠΉ ΠΏΡΠΈ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠ΅
ΠΠΎΡΠ»Π΅ Π·Π°Π²Π΅ΡΡΠ΅Π½ΠΈΡ ΠΏΡΠΎΡΠ΅Π΄ΡΡΡ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ Π½Π° ΠΌΠΎΠ΅ΠΌ ΠΊΠΎΠΌΠΏΡΡΡΠ΅ΡΠ΅ Ρ Windows 10 Ρ ΠΏΠΎΠ»ΡΡΠ°Π» ΡΠ»Π΅Π΄ΡΡΡΠ΅Π΅ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΎΠ± ΠΎΡΠΈΠ±ΠΊΠ΅.
Π Π΅ΡΠ΅Π½ΠΈΠ΅:
Π― ΠΏΡΠΎΡΡΠΎ ΡΠ°Π·ΠΎΠ±ΡΠ°Π»ΡΡ, ΠΊΠ°ΠΊ ΡΡΠΎ ΠΈΡΠΏΡΠ°Π²ΠΈΡΡ!
Π ΠΌΠΎΠ΅ΠΌ ΡΠ»ΡΡΠ°Π΅ Ρ Π½Π΅ Π·Π½Π°Π», ΡΡΠΎ ΠΌΠ½Π΅ Π½ΡΠΆΠ½ΠΎ Π΄ΠΎΠ±Π°Π²ΠΈΡΡ Π’Π Π ΠΏΡΡΠΈ, ΡΠ²ΡΠ·Π°Π½Π½ΡΠ΅ Ρ ΠΌΠΈΠ½ΠΈΠΊΠΎΠ½Π΄Π°ΠΌΠΈ, Π² ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΡΡ ΠΎΠΊΡΡΠΆΠ΅Π½ΠΈΡ PATH.
C: \ Users \ uug20 \ Anaconda3
C: \ Users \ uug20 \ Anaconda3 \ Scripts
C: \ Users \ uug20 \ Anaconda3 \ Library \ bin
ΠΠΎΡΠ»Π΅ ΡΡΠΎΠ³ΠΎ Ρ Π½Π΅ ΠΏΠΎΠ»ΡΡΠΈΠ» Π½ΠΈΠΊΠ°ΠΊΠΈΡ ΡΠΎΠΎΠ±ΡΠ΅Π½ΠΈΠΉ ΠΎΠ± ΠΎΡΠΈΠ±ΠΊΠ°Ρ , ΠΈ pyspark Π½Π°ΡΠ°Π» ΡΠ°Π±ΠΎΡΠ°ΡΡ ΠΏΡΠ°Π²ΠΈΠ»ΡΠ½ΠΎ ΠΈ ΠΎΡΠΊΡΡΠ» Π·Π°ΠΏΠΈΡΠ½ΡΡ ΠΊΠ½ΠΈΠΆΠΊΡ Jupyter ΠΏΠΎΡΠ»Π΅ Π²Π²ΠΎΠ΄Π° pyspark Π² ΠΊΠΎΠΌΠ°Π½Π΄Π½ΠΎΠΉ ΡΡΡΠΎΠΊΠ΅.
Π ΡΠΊΠΎΠ²ΠΎΠ΄ΡΡΠ²ΠΎ ΠΏΠΎ PySpark Π΄Π»Ρ Π½Π°ΡΠΈΠ½Π°ΡΡΠΈΡ
PySpark β ΡΡΠΎ API Apache Spark, ΠΊΠΎΡΠΎΡΡΠΉ ΠΏΡΠ΅Π΄ΡΡΠ°Π²Π»ΡΠ΅Ρ ΡΠΎΠ±ΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΡ Ρ ΠΎΡΠΊΡΡΡΡΠΌ ΠΈΡΡ ΠΎΠ΄Π½ΡΠΌ ΠΊΠΎΠ΄ΠΎΠΌ, ΠΏΡΠΈΠΌΠ΅Π½ΡΠ΅ΠΌΡΡ Π΄Π»Ρ ΡΠ°ΡΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΠΎΠΉ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ Π±ΠΎΠ»ΡΡΠΈΡ Π΄Π°Π½Π½ΡΡ . ΠΠ·Π½Π°ΡΠ°Π»ΡΠ½ΠΎ ΠΎΠ½Π° Π±ΡΠ»Π° ΡΠ°Π·ΡΠ°Π±ΠΎΡΠ°Π½Π° Π½Π° ΡΠ·ΡΠΊΠ΅ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠΈΡΠΎΠ²Π°Π½ΠΈΡ Scala Π² ΠΠ°Π»ΠΈΡΠΎΡΠ½ΠΈΠΉΡΠΊΠΎΠΌ ΡΠ½ΠΈΠ²Π΅ΡΡΠΈΡΠ΅ΡΠ΅ ΠΠ΅ΡΠΊΠ»ΠΈ.
Spark ΠΏΡΠ΅Π΄ΠΎΡΡΠ°Π²Π»ΡΠ΅Ρ API Π΄Π»Ρ Scala, Java, Python ΠΈ R. Π‘ΠΈΡΡΠ΅ΠΌΠ° ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΈΠ²Π°Π΅Ρ ΠΏΠΎΠ²ΡΠΎΡΠ½ΠΎΠ΅ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΊΠΎΠ΄Π° ΠΌΠ΅ΠΆΠ΄Ρ ΡΠ°Π±ΠΎΡΠΈΠΌΠΈ Π·Π°Π΄Π°ΡΠ°ΠΌΠΈ, ΠΏΠ°ΠΊΠ΅ΡΠ½ΡΡ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΡ Π΄Π°Π½Π½ΡΡ , ΠΈΠ½ΡΠ΅ΡΠ°ΠΊΡΠΈΠ²Π½ΡΠ΅ Π·Π°ΠΏΡΠΎΡΡ, Π°Π½Π°Π»ΠΈΡΠΈΠΊΡ Π² ΡΠ΅Π°Π»ΡΠ½ΠΎΠΌ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ, ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ΅ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠ΅ ΠΈ Π²ΡΡΠΈΡΠ»Π΅Π½ΠΈΡ Π½Π° Π³ΡΠ°ΡΠ°Ρ . ΠΠ½Π° ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅Ρ ΠΊΡΡΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ Π² ΠΏΠ°ΠΌΡΡΠΈ ΠΈ ΠΎΠΏΡΠΈΠΌΠΈΠ·ΠΈΡΠΎΠ²Π°Π½Π½ΠΎΠ΅ Π²ΡΠΏΠΎΠ»Π½Π΅Π½ΠΈΠ΅ Π·Π°ΠΏΡΠΎΡΠΎΠ² ΠΊ Π΄Π°Π½Π½ΡΠΌ Π»ΡΠ±ΠΎΠ³ΠΎ ΡΠ°Π·ΠΌΠ΅ΡΠ°.
Π£ Π½Π΅Π΅ Π½Π΅Ρ ΠΎΠ΄Π½ΠΎΠΉ ΡΠΎΠ±ΡΡΠ²Π΅Π½Π½ΠΎΠΉ ΡΠ°ΠΉΠ»ΠΎΠ²ΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΡ, ΡΠ°ΠΊΠΎΠΉ ΠΊΠ°ΠΊ Hadoop Distributed File System (HDFS), Π²ΠΌΠ΅ΡΡΠΎ ΡΡΠΎΠ³ΠΎ Spark ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΈΠ²Π°Π΅Ρ ΠΌΠ½ΠΎΠΆΠ΅ΡΡΠ²ΠΎ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΡ ΡΠ°ΠΉΠ»ΠΎΠ²ΡΡ ΡΠΈΡΡΠ΅ΠΌ, ΡΠ°ΠΊΠΈΡ ΠΊΠ°ΠΊ HDFS, HBase, Cassandra, Amazon S3, Amazon Redshift, Couchbase ΠΈ Ρ. Π΄.
ΠΡΠ΅ΠΈΠΌΡΡΠ΅ΡΡΠ²Π° ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Apache Spark:
ΠΠ°ΡΡΡΠΎΠΉΠΊΠ° ΡΡΠ΅Π΄Ρ Π² Google Colab
Π§ΡΠΎΠ±Ρ Π·Π°ΠΏΡΡΡΠΈΡΡ pyspark Π½Π° Π»ΠΎΠΊΠ°Π»ΡΠ½ΠΎΠΉ ΠΌΠ°ΡΠΈΠ½Π΅, Π½Π°ΠΌ ΠΏΠΎΠ½Π°Π΄ΠΎΠ±ΠΈΡΡΡ Java ΠΈ Π΅ΡΠ΅ Π½Π΅ΠΊΠΎΡΠΎΡΠΎΠ΅ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΠΎΠ΅ ΠΎΠ±Π΅ΡΠΏΠ΅ΡΠ΅Π½ΠΈΠ΅. ΠΠΎΡΡΠΎΠΌΡ Π²ΠΌΠ΅ΡΡΠΎ ΡΠ»ΠΎΠΆΠ½ΠΎΠΉ ΠΏΡΠΎΡΠ΅Π΄ΡΡΡ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ ΠΌΡ ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΠ΅ΠΌ Google Colaboratory, ΠΊΠΎΡΠΎΡΡΠΉ ΠΈΠ΄Π΅Π°Π»ΡΠ½ΠΎ ΡΠ΄ΠΎΠ²Π»Π΅ΡΠ²ΠΎΡΡΠ΅Ρ Π½Π°ΡΠΈ ΡΡΠ΅Π±ΠΎΠ²Π°Π½ΠΈΡ ΠΊ ΠΎΠ±ΠΎΡΡΠ΄ΠΎΠ²Π°Π½ΠΈΡ, ΠΈ ΡΠ°ΠΊΠΆΠ΅ ΠΏΠΎΡΡΠ°Π²Π»ΡΠ΅ΡΡΡ Ρ ΡΠΈΡΠΎΠΊΠΈΠΌ Π½Π°Π±ΠΎΡΠΎΠΌ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊ Π΄Π»Ρ Π°Π½Π°Π»ΠΈΠ·Π° Π΄Π°Π½Π½ΡΡ ΠΈ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, Π½Π°ΠΌ ΠΎΡΡΠ°Π΅ΡΡΡ ΡΠΎΠ»ΡΠΊΠΎ ΡΡΡΠ°Π½ΠΎΠ²ΠΈΡΡ ΠΏΠ°ΠΊΠ΅ΡΡ pyspark ΠΈ Py4J. Py4J ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ°ΠΌ Python, ΡΠ°Π±ΠΎΡΠ°ΡΡΠΈΠΌ Π² ΠΈΠ½ΡΠ΅ΡΠΏΡΠ΅ΡΠ°ΡΠΎΡΠ΅ Python, Π΄ΠΈΠ½Π°ΠΌΠΈΡΠ΅ΡΠΊΠΈ ΠΎΠ±ΡΠ°ΡΠ°ΡΡΡΡ ΠΊ ΠΎΠ±ΡΠ΅ΠΊΡΠ°ΠΌ Java ΠΈΠ· Π²ΠΈΡΡΡΠ°Π»ΡΠ½ΠΎΠΉ ΠΌΠ°ΡΠΈΠ½Ρ Java.
ΠΡΠΎΠ³ΠΎΠ²ΡΠΉ Π½ΠΎΡΡΠ±ΡΠΊ ΠΌΠΎΠΆΠ½ΠΎ ΡΠΊΠ°ΡΠ°ΡΡ Π² ΡΠ΅ΠΏΠΎΠ·ΠΈΡΠΎΡΠΈΠΈ: https://gitlab.com/PythonRu/notebooks/-/blob/master/pyspark_beginner.ipynb
ΠΠΎΠΌΠ°Π½Π΄Π° Π΄Π»Ρ ΡΡΡΠ°Π½ΠΎΠ²ΠΊΠΈ Π²ΡΡΠ΅ΡΠΊΠ°Π·Π°Π½Π½ΡΡ ΠΏΠ°ΠΊΠ΅ΡΠΎΠ²:
Spark Session
SparkSession ΡΡΠ°Π» ΡΠΎΡΠΊΠΎΠΉ Π²Ρ ΠΎΠ΄Π° Π² PySpark, Π½Π°ΡΠΈΠ½Π°Ρ Ρ Π²Π΅ΡΡΠΈΠΈ 2.0: ΡΠ°Π½Π΅Π΅ Π΄Π»Ρ ΡΡΠΎΠ³ΠΎ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π»ΡΡ SparkContext. SparkSession β ΡΡΠΎ ΡΠΏΠΎΡΠΎΠ± ΠΈΠ½ΠΈΡΠΈΠ°Π»ΠΈΠ·Π°ΡΠΈΠΈ Π±Π°Π·ΠΎΠ²ΠΎΠΉ ΡΡΠ½ΠΊΡΠΈΠΎΠ½Π°Π»ΡΠ½ΠΎΡΡΠΈ PySpark Π΄Π»Ρ ΠΏΡΠΎΠ³ΡΠ°ΠΌΠΌΠ½ΠΎΠ³ΠΎ ΡΠΎΠ·Π΄Π°Π½ΠΈΡ PySpark RDD, DataFrame ΠΈ Dataset. ΠΠ³ΠΎ ΠΌΠΎΠΆΠ½ΠΎ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ Π²ΠΌΠ΅ΡΡΠΎ SQLContext, HiveContext ΠΈ Π΄ΡΡΠ³ΠΈΡ ΠΊΠΎΠ½ΡΠ΅ΠΊΡΡΠΎΠ², ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½Π½ΡΡ Π΄ΠΎ 2.0.
Π‘ΠΎΠ·Π΄Π°Π½ΠΈΠ΅ SparkSession
How to Install Apache Spark on Windows 10
Home Β» DevOps and Development Β» How to Install Apache Spark on Windows 10
Apache Spark is an open-source framework that processes large volumes of stream data from multiple sources. Spark is used in distributed computing with machine learning applications, data analytics, and graph-parallel processing.
This guide will show you how to install Apache Spark on Windows 10 and test the installation.
Install Apache Spark on Windows
Installing Apache Spark on Windows 10 may seem complicated to novice users, but this simple tutorial will have you up and running. If you already have Java 8 and Python 3 installed, you can skip the first two steps.
Step 1: Install Java 8
Apache Spark requires Java 8. You can check to see if Java is installed using the command prompt.
Open the command line by clicking Start > type cmd > click Command Prompt.
Type the following command in the command prompt:
If Java is installed, it will respond with the following output:
Your version may be different. The second digit is the Java version β in this case, Java 8.
If you donβt have Java installed:
1. Open a browser window, and navigate to https://java.com/en/download/.
2. Click the Java Download button and save the file to a location of your choice.
3. Once the download finishes double-click the file to install Java.
Note: At the time this article was written, the latest Java version is 1.8.0_251. Installing a later version will still work. This process only needs the Java Runtime Environment (JRE) β the full Development Kit (JDK) is not required. The download link to JDK is https://www.oracle.com/java/technologies/javase-downloads.html.
Step 2: Install Python
1. To install the Python package manager, navigate to https://www.python.org/ in your web browser.
2. Mouse over the Download menu option and click Python 3.8.3. 3.8.3 is the latest version at the time of writing the article.
3. Once the download finishes, run the file.
4. Near the bottom of the first setup dialog box, check off Add Python 3.8 to PATH. Leave the other box checked.
5. Next, click Customize installation.
6. You can leave all boxes checked at this step, or you can uncheck the options you do not want.
7. Click Next.
8. Select the box Install for all users and leave other boxes as they are.
9. Under Customize install location, click Browse and navigate to the C drive. Add a new folder and name it Python.
10. Select that folder and click OK.
11. Click Install, and let the installation complete.
12. When the installation completes, click the Disable path length limit option at the bottom and then click Close.
13. If you have a command prompt open, restart it. Verify the installation by checking the version of Python:
Note: For detailed instructions on how to install Python 3 on Windows or how to troubleshoot potential issues, refer to our Install Python 3 on Windows guide.
Step 3: Download Apache Spark
2. Under the Download Apache Spark heading, there are two drop-down menus. Use the current non-preview version.
3. Click the spark-2.4.5-bin-hadoop2.7.tgz link.
4. A page with a list of mirrors loads where you can see different servers to download from. Pick any from the list and save the file to your Downloads folder.
Step 4: Verify Spark Software File
1. Verify the integrity of your download by checking the checksum of the file. This ensures you are working with unaltered, uncorrupted software.
2. Navigate back to the Spark Download page and open the Checksum link, preferably in a new tab.
3. Next, open a command line and enter the following command:
5. Compare the code to the one you opened in a new browser tab. If they match, your download file is uncorrupted.
Step 5: Install Apache Spark
Installing Apache Spark involves extracting the downloaded file to the desired location.
1. Create a new folder named Spark in the root of your C: drive. From a command line, enter the following:
2. In Explorer, locate the Spark file you downloaded.
3. Right-click the file and extract it to C:\Spark using the tool you have on your system (e.g., 7-Zip).
4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the necessary files inside.
Step 6: Add winutils.exe File
Download the winutils.exe file for the underlying Hadoop version for the Spark installation you downloaded.
1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder, locate winutils.exe, and click it.
2. Find the Download button on the right side to download the file.
3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the Command Prompt.
4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
Step 7: Configure Environment Variables
Configuring environment variables in Windows adds the Spark and Hadoop locations to your system PATH. It allows you to run the Spark shell directly from a command prompt window.
1. Click Start and type environment.
2. Select the result labeled Edit the system environment variables.
3. A System Properties dialog box appears. In the lower-right corner, click Environment Variables and then click New in the next window.
4. For Variable Name type SPARK_HOME.
5. For Variable Value type C:\Spark\spark-2.4.5-bin-hadoop2.7 and click OK. If you changed the folder path, use that one instead.
6. In the top box, click the Path entry, then click Edit. Be careful with editing the system path. Avoid deleting any entries already on the list.
7. You should see a box with entries on the left. On the right, click New.
8. The system highlights a new line. Enter the path to the Spark folder C:\Spark\spark-2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin to avoid possible issues with the path.
9. Repeat this process for Hadoop and Java.
10. Click OK to close all open windows.
Note: Star by restarting the Command Prompt to apply changes. If that doesn’t work, you will need to reboot the system.
Step 8: Launch Spark
1. Open a new command-prompt window using the right-click and Run as administrator:
2. To start Spark, enter:
If you set the environment path correctly, you can type spark-shell to launch Spark.
3. The system should display several lines indicating the status of the application. You may get a Java pop-up. Select Allow access to continue.
Finally, the Spark logo appears, and the prompt displays the Scala shell.
4., Open a web browser and navigate to http://localhost:4040/.
5. You can replace localhost with the name of your system.
6. You should see an Apache Spark shell Web UI. The example below shows the Executors page.
7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
Note: If you installed Python, you can run Spark using Python with this command:
Test Spark
In this example, we will launch the Spark shell and use Scala to read the contents of a file. You can use an existing file, such as the README file in the Spark directory, or you can create your own. We created pnaptest with some text.
1. Open a command-prompt window and navigate to the folder with the file you want to use and launch the Spark shell.
2. First, state a variable to use in the Spark context with the name of the file. Remember to add the file extension if there is any.
3. The output shows an RDD is created. Then, we can view the file contents by using this command to call an action:
This command instructs Spark to print 11 lines from the file you specified. To perform an action on this file (value x), add another value y, and do a map transformation.
4. For example, you can print the characters in reverse with this command:
5. The system creates a child RDD in relation to the first one. Then, specify how many lines you want to print from the value y:
The output prints 11 lines of the pnaptest file in the reverse order.
You should now have a working installation of Apache Spark on Windows 10 with all dependencies installed. Get started running an instance of Spark in your Windows environment.
Our suggestion is to also learn more about what Spark DataFrame is, the features, and how to use Spark DataFrame when collecting data.