7 newer data science tools you should be using with Python

Python’s rich ecosystem of data science tools is a big draw for users. The only downside of such a broad and deep collection is that sometimes the best tools can get overlooked.

Here’s a rundown of some of the best newer or less-known data science projects available for Python. Some, like Polars, are getting more attention but still deserve wider notice. Others, like ConnectorX, are hidden gems.

ConnectorX

Most data sits in a database somewhere, but computation typically happens outside of it. Getting data to and from the database for actual work can be a slowdown. ConnectorX loads data from databases into many common data-wrangling tools in Python, and it keeps things fast by minimizing the work required. Most of the data loading can be done in just a couple of lines of Python code and an SQL query.

Like Polars (which I’ll discuss shortly), ConnectorX uses a Rust library at its core. This allows for optimizations like being able to load from a data source in parallel with partitioning. Data in PostgreSQL, for instance, can be loaded this way by specifying a partition column.

DuckDB

Data science folks who use Python ought to be aware of SQLite—a small, but powerful and speedy relational database packaged with Python. Since it runs as an in-process library, rather than a separate application, SQLite is lightweight and responsive.

DuckDB is a little like someone answered the question, “What if we made SQLite for OLAP?” Like other OLAP database engines, it uses a columnar datastore and is optimized for long-running analytical query workloads. But DuckDB gives you all the things you expect from a conventional database, like ACID transactions. And there’s no separate software suite to configure; you can get it running in a Python environment with a single pip install duckdb command.

DuckDB can directly ingest data in CSV, JSON, or Parquet format, as well as a slew of other common data sources. The resulting databases can also be partitioned into multiple physical files for efficiency, based on keys (e.g., by year and month). Querying works like any other SQL-powered relational database, but with additional built-in features like the ability to take random samples of data or construct window functions.

Ask

Optimus can use Pandas, Dask, CUDF (and Dask + CUDF), Vaex, or Spark as its underlying data engine. Data can be loaded in from and saved back out to Arrow, Parquet, Excel, a variety of common database sources, or flat-file formats like CSV and JSON.

The data manipulation API resembles Pandas, but adds .rows() and .cols() accessors to make it easy to do things like sort a DataFrame, filter by column values, alter data according to criteria, or narrow the range of operations based on some criteria. Optimus also comes bundled with processors for handling common real-world data types like email addresses and URLs.

Polars

If you spend much time working with DataFrames and you’re frustrated by the performance limits of Pandas, reach for Polars. This DataFrame library for Python offers a convenient syntax similar to Pandas.

Unlike Pandas, though, Polars uses a library written in Rust that takes maximum advantage of your hardware out of the box. You don’t need to use special syntax to take advantage of performance-enhancing features like parallel processing or SIMD; it’s all automatic. Even simple operations like reading from a CSV file are faster. Rust developers can craft their own Polars extensions using pyo3.

DVC

A major and pervasive issue with data science experiments is version control—not of the project’s code, but its data. DVC, short for Data Version Control, lets you attach version descriptors to datasets, check them into Git as you would the rest of your code, and keep versions of data and code consistent together.

DVC can track most any kind of dataset as long as they can be expressed as a file, whether kept in local storage or in a remote storage service like an Amazon S3 bucket. You can describe how data models are managed and used by way of a “pipeline,” which DVC’s documentation describes as being like “a Makefile system for machine learning projects.”

Cleanlab

Good machine learning datasets are hard to come by, because it’s expensive and time-consuming to create clean, properly labeled data. Sometimes, though, you have no choice but to use data that’s raw and inconsistent. Cleanlab (as in, “cleans labels”) was made for this scenario.

Cleanlab uses existing, high-quality machine learning datasets to analyze lower-quality, unlabeled (or poorly labeled) datasets. You create a model based on the original dataset, use Cleanlab to figure out what needs to be improved in the original dataset, then re-train using your automatically cleaned and adjusted dataset to see the difference.

Snakemake

Data science workflows are hard to set up, and that’s even harder to do in a consistent, predictable way. Snakemake was created to automate the process, setting up data analysis workflows in ways that ensure everyone gets the same results. Many existing data science projects rely on Snakemake. The more moving parts you have in your data science workflow, the more likely you’ll benefit from automating that workflow with Snakemake.

Snakemake workflows resemble GNU Make workflows—you define the steps of the workflow with rules, which specify what they take in, what they put out, and what commands to execute to accomplish that. Workflow rules can be multithreaded (assuming that gives them any benefit), and configuration data can be piped in from JSON or YAML files. You can also define functions in your workflows to transform data used in rules, and write the actions taken at each step to logs.

📌 Visit Us:
🌐 Website: https://statisticsaward.com/
🏆 Nomination: https://statisticsaward.com/award-nomination/
📝 Registration: https://statisticsaward.com/award-registration/

Search This Blog

Statistics Awards

7 newer data science tools you should be using with Python

Comments

Post a Comment