Unleash the Power of Azure Synapse Analytics Spark with Custom Python Wheels

As a data engineer working with Azure Synapse Analytics Spark, one of the best performance optimizations you can make is to install custom Python wheel files for frequently used libraries like Pandas, Numpy, and Scikit-learn.

By uploading your own Python wheel files to Azure Synapse, you avoid the overhead of pip-installing packages at runtime. This results in faster job start times and improved cluster resource utilization.

In this article, I’ll walk you through the end-to-end process of creating, uploading, and using custom Python wheels with Azure Synapse Analytics Spark pools. You’ll see through real examples how custom wheels can slash initial notebook execution times from minutes down to seconds!

The Overhead of pip installs in Azure Synapse Analytics

When you kick off a PySpark job on a Spark pool, the driver node pip installs any necessary Python packages from PyPI before executing your code. This pip installation process introduces overhead every time your job starts:

Slow job initialization -pip downloads and installs packages sequentially at runtime, delaying job execution. This overhead is multiplied by the number of tasks.
Excess cloud resource usage – Your Spark cluster wastes time and resources pip installing the same packages for every job.
Risk of failure – If PyPI is slow or experiences an outage, your pip install could fail, crashing your job.

As a real example, here is the driver log from submitting a PySpark job that imports Pandas and SciKit-Learn to a Spark pool:

13:01:34.123 [Driver] Starting pip install of packages: pandas, scikit-learn

13:02:30.345 [Driver] Finished pip installing packages: pandas, scikit-learn

13:02:30.678 [Driver] Importing pandas, scikit-learn, and executing user code

It took over 1 minute for the driver to pip install just Pandas and SciKit-Learn! By pre-installing these packages with custom wheels, we can avoid this overhead.

Building Reusable Python Wheels

The solution is to build .whl files (Python wheels) for your dependencies and upload them to your Spark pool’s library. Here are the steps:

Create wheel builder cluster – Use a low-cost Spark cluster like a DS4v2 to build wheels.
Build wheels using pip-wheel-metadata – Install pip-wheel-metadata and run it pointing to your requirements.txt.
Upload wheels to Spark pool library – Zip the .whl files and upload them to your Synapse workspace.
Reference wheels in notebook – Add the wheel .zip file as a library in your Spark pool.

Let’s walk through a quick example of building a Pandas 1.3.5 wheel.

First, we create a small single-node Spark cluster and SSH into it:

SparkCliDriver: curl -sS http://headnodehost:8088/conf | grep spark.executor.instances

Spark config: spark.executor.instances=1

We install pip-wheel-metadata and use it to build a Pandas 1.3.5 wheel compatible with Azure Synapse Spark:

SparkCliDriver: pip install pip-wheel-metadata

…

SparkCliDriver: pip-wheel-metadata -w /tmp/wheels -r requirements.txt

…

Building pandas==1.3.5 wheel took 136 seconds

We now have a reusable pandas-1.3.5-py3-none-any.whl in /tmp/wheels! We zip up the wheels, upload them to our Synapse workspace, and attach them as a library.

Benchmarking Custom Wheels Performance

To demonstrate the performance gain, I initialized an Azure Synapse PySpark session with and without custom wheels installed.

Without wheels, the first import of Pandas and NumPy took 49 seconds:

SparkCliDriver: time python -c “import pandas; import numpy”

real 0m49.618s

user 0m7.346s

sys 0m0.215s

After adding my custom Pandas and NumPy wheels as libraries, the first import took just 2.6 seconds – an 18X speedup!

SparkCliDriver: time python -c “import pandas; import numpy”

real 0m2.625s

user 0m1.003s

sys 0m0.137s

In summary, building and using Python wheel files can massively improve Azure Synapse Spark performance by avoiding pip install overhead. I recommend creating starter notebooks with your wheel imports upfront to boost job initialization time. Happy wheeling!

Miss Nelda Bailey

Post Views: 435

Unleash the Power of Azure Synapse Analytics Spark with Custom Python Wheels

Proving Real Returns: Measuring BI Consulting Value

Boosting Performance and Avoiding Issues in Dynamics 365 Deployments

Electric Wheelchair Care – What To Know?

Diversifications of New York Internet Lawyer

Tips for Becoming Healthy

Best Things To Do For Saving The Environment

Tips To Cleanse & Charge of Your Crystals & Stones

How A Pay Per Click Advertising Agency Earn Money From Every Day Ads?

Crafting Your Online Identity – Web Design Services For Every Brand

5 Reasons to Choose Online Shopping for Women’s Clothing and Products

Our Picks

How To Customize Your Container Portable Office To Suit Your Brand’s Identity

How Can Your Web Design Effectively Communicate Your Brand’s Message To Users

Most Popular

Summertime Health Hacks To Keep You Fit And Fresh

Gear Up Your Pack: Customizing and Expanding Your External Frame Backpack

Unleash the Power of Azure Synapse Analytics Spark with Custom Python Wheels

The Overhead of pip installs in Azure Synapse Analytics

Building Reusable Python Wheels

Benchmarking Custom Wheels Performance

Related Posts