Polars - Do we need another Python data science library?

If you have been working with Python for data tasks for any length of time, your go-to toolset includes Pandas and Numpy with a little Matplotlib thrown in. Numpy arrays are a foundation piece of data analysis in Python, and the other two libraries integrate a lot of that functionality for different purposes. Pandas has long been the go-to library for data manipulation and analysis. And Matplotlib is a library to create static, animated, and interactive visualizations.

New on the scene is Polars, released initially in 2020. Its touted as being a lightning-fast data frame library, designed for efficient parallelism. Available for Python, R, Node.JS and Rust, Polars uses a columnar memory layout and multi-threaded execution, enabling it to process data significantly faster and with less memory usage than Pandas—especially for larger or more complex workloads. Essentially what that means is the data is organized or arranged in vertical columns, similar to how columns are arranged in a spreadsheet, rather than the traditional row based structure you find in most SQL databases. Columnar format has proven to be more efficient for compression, and quicker for data querying and retrieval.

As datasets have continued to grow and performance demands increase with the explosion of real-time data, the people behind Polars present it as a modern alternative. Pandas has been around for over fifteen years, and provides a large set of tools for data wrangling, but when used with very large datasets, its performance can suffer. This may be due to Pandas operating mostly on a single thread so it does not scale well on multi-core machines. In contrast, Polars is designed to be multi-threaded to take advantage of multi-cores to parallelize data tasks.

One disadvantage currently with Polars is that it does not support many of the popular commercial database products, although it does have functionality builtin to work with Cloud data sources. If you need to connect to SQL Server or Oracle, you will need to stick with Pandas over Polars, but if you are using PostgreSQL, MySQL, Amazon S3, Azure Blob, or one of many other cloud providers feel free to try out Polars.

An interesting innovation in Polars is the Lazy API. From the documentation webpage:

With the lazy API, Polars doesn't run each query line-by-line but instead processes the full query end-to-end. To get the most out of Polars it is important that you use the lazy API because:

the lazy API allows Polars to apply automatic query optimization with the query optimizer
the lazy API allows you to work with larger than memory datasets using streaming
the lazy API can catch schema errors before processing the data.

The ability to optimize the query plan automatically may be a game changer. Unfortunately, other than not using it, I dont see any way to override it to try different options.

One thing I want to experiment with is the ability to convert an existing data frame from Pandas to Polars and back. That may be a way around the lack of support for SQL Server, by using Pandas and SQLAlchemy to read in to a data frame, and then convert it to a Polars version. Writing back out to SQL Server after another conversion. My motivation is due to my primary data source tends to be SQL Server so if there are ways to improve on that, I am all for them. It may be possible, but is it worth the effort to do so? I'll post the results once I have worked on it. Stay tuned!

So back to the question I posed - Do we need another Python Data Science library? As with most things, the answer is, it depends. The multi-thread processing, and the automatic query optimization are both interesting features. They aren't for every use case, but in particular instances, they may be beneficial.

AnalytixVault

Search This Blog

Polars - Do we need another Python data science library?

Comments

Post a Comment