7 newer data science tools you should be using with Python
Python’s rich ecosystem of data science tools is a big draw
for users. The only downside of such a broad and deep collection is that
sometimes the best tools can get overlooked.
Here’s a rundown of some of the best newer or less-known
data science projects available for Python. Some, like Polars, are getting
more attention but still deserve wider notice. Others, like ConnectorX, are
hidden gems.
Most data sits in a database somewhere, but computation
typically happens outside of it. Getting data to and from the database for
actual work can be a slowdown. ConnectorX loads data from databases
into many common data-wrangling tools in Python, and it keeps things fast by
minimizing the work required. Most of the data loading can be done in just a
couple of lines of Python code and an SQL query.
Like Polars (which I’ll discuss shortly), ConnectorX uses
a Rust library at its core. This allows for optimizations like being
able to load from a data source in parallel with partitioning. Data in PostgreSQL,
for instance, can be loaded this way by specifying a partition column.
Data science folks who use Python ought to be aware of SQLite—a
small, but powerful and speedy relational database packaged with Python. Since
it runs as an in-process library, rather than a separate application, SQLite is
lightweight and responsive.
DuckDB is a little like someone answered the question,
“What if we made SQLite for OLAP?” Like other OLAP database engines,
it uses a columnar datastore and is optimized for long-running analytical query
workloads. But DuckDB gives you all the things you expect from a conventional
database, like ACID transactions. And there’s no separate software suite to
configure; you can get it running in a Python environment with a single pip
install duckdb command.
DuckDB can directly ingest data in CSV, JSON, or Parquet format, as well as a slew of other common data sources. The resulting databases can also be partitioned into multiple physical files for efficiency, based on keys (e.g., by year and month). Querying works like any other SQL-powered relational database, but with additional built-in features like the ability to take random samples of data or construct window functions.
Optimus can use Pandas, Dask, CUDF (and Dask + CUDF),
Vaex, or Spark as its underlying data engine. Data can be loaded in
from and saved back out to Arrow, Parquet, Excel, a variety of common database
sources, or flat-file formats like CSV and JSON.
The data manipulation API resembles Pandas, but adds .rows() and .cols() accessors
to make it easy to do things like sort a DataFrame, filter by column values,
alter data according to criteria, or narrow the range of operations based on
some criteria. Optimus also comes bundled with processors for handling common
real-world data types like email addresses and URLs.
If you spend much time working with DataFrames and you’re
frustrated by the performance limits of Pandas, reach for Polars.
This DataFrame library for Python offers a convenient syntax similar to Pandas.
Unlike Pandas, though, Polars uses a library written
in Rust that takes maximum advantage of your hardware out of the box.
You don’t need to use special syntax to take advantage of performance-enhancing
features like parallel processing or SIMD; it’s all automatic. Even simple
operations like reading from a CSV file are faster. Rust developers can craft
their own Polars extensions using pyo3.
A major and pervasive issue with data science experiments
is version control—not of the project’s code, but its data. DVC,
short for Data Version Control, lets you attach version descriptors to
datasets, check them into Git as you would the rest of your code, and keep
versions of data and code consistent together.
DVC can track most any kind of dataset as long as they can
be expressed as a file, whether kept in local storage or in a remote
storage service like an Amazon S3 bucket. You can describe how data models
are managed and used by way of a “pipeline,” which DVC’s documentation
describes as being like “a Makefile system for machine learning projects.”
Good machine learning datasets are hard to come by, because
it’s expensive and time-consuming to create clean, properly labeled data.
Sometimes, though, you have no choice but to use data that’s raw and
inconsistent. Cleanlab (as in, “cleans labels”) was made for this
scenario.
Cleanlab uses existing, high-quality machine learning
datasets to analyze lower-quality, unlabeled (or poorly labeled) datasets. You
create a model based on the original dataset, use Cleanlab to figure out what
needs to be improved in the original dataset, then re-train using your
automatically cleaned and adjusted dataset to see the difference.
Data science workflows are hard to set up, and that’s even
harder to do in a consistent, predictable way. Snakemake was created
to automate the process, setting up data analysis workflows in ways that ensure
everyone gets the same results. Many existing data science projects rely on
Snakemake. The more moving parts you have in your data science workflow, the
more likely you’ll benefit from automating that workflow with Snakemake.
Snakemake workflows resemble GNU Make workflows—you define
the steps of the workflow with rules, which specify what they take in, what
they put out, and what commands to execute to accomplish that. Workflow rules
can be multithreaded (assuming that gives them any benefit), and configuration
data can be piped in from JSON or YAML files. You can also
define functions in your workflows to transform data used in rules, and write
the actions taken at each step to logs.
π Visit Us:
π
Website: https://statisticsaward.com/
π
Nomination:
https://statisticsaward.com/award-nomination/
π
Registration: https://statisticsaward.com/award-registration/
Comments
Post a Comment