A data manipulation library, implemented in C++. It’s heavy SQL-based and is in-memory. This makes it an ultra performant replacement for sqllite.
It’s functionality overlaps a lot with polars.
It is available as a standalone CLI application with bindings to python, amongst other languages. Since it implements the Arrow protocol, it can interoperate with other packages like pandas and polars easily.
API Quickstart
The module-level .sql
function utilizes lazy execution, and only executes when .show()
is called.
.show
returnsNone
The
__repr__
of a “relation” (see “Variables”) itself produces the same output as.show
. The difference is that it allows further formatting of the representation.To get the underlying data, see Conversions.
”Variables”
Unexecuted queries are known as “relations”. These store logic and can be referenced in subsequent queries. Can be thought of as a common table expression.
Interoperability
DuckDB code can be interleaved with other libraries.
This can be done (likely with no cost), due to the Arrow protocol. This works only on dataframe-like objects.23
If a variable is not accessible in the top-level scope, you can manually register it as a virtual table4 into the namespace:
Conversions
The lazy query can be executed and converted to various objects:
Ingestion
Generally speaking, there are a few ways to ingest data for a given file type/source
One cool thing is the glob-style reads:
Persistence
Similarly, there are a few ways to serialize data to disk
Connections
Users can choose between a global in-memory DB or a persistent storage, simply by using the appropriate methods.
Global In-Memory DB
Persistent Storage
User-Defined Function
You can register a Python function:
If type hints are not provided, then you’ll have to specify additional optional parameters for .create_function
.
This approach also assumes the function to be pure. Otherwise, you’ll have to mark .create_function(..., side_effects=True)
though how duckdb
will treat it differently other than not re-running the function as liberally.
See docs for additional details.
Expression API
This is quite similar that in polars, where transformation logic can be iteratively composed. However, from the docs it seems to not be as feature rich as polars
.
Footnotes
-
Obtained from their documentation. ↩
-
Not sure if they must have implemented the
__dataframe__
protocol or not. ↩ -
This approach is very similar to pandas’s query method
df.query("@my_var + 1")
↩ -
Like a SQL
VIEW
↩