A data manipulation library, implemented in C++. It’s heavy SQL-based and is in-memory. This makes it an ultra performant replacement for sqllite.

It’s functionality overlaps a lot with polars.

It is available as a standalone CLI application with bindings to python, amongst other languages. Since it implements the Arrow protocol, it can interoperate with other packages like pandas and polars easily.

API Quickstart

for the python client:1

import duckdb
duckdb.sql("SELECT 10").show()

The module-level .sql function utilizes lazy execution, and only executes when .show() is called.

.show returns None

The __repr__ of a “relation” (see “Variables”) itself produces the same output as .show. The difference is that it allows further formatting of the representation.

To get the underlying data, see Conversions.

”Variables”

import duckdb
r1 = duckdb.sql("SELECT 10 AS i")
duckdb.sql("SELECT i * 2 AS k FROM r1").show()

Unexecuted queries are known as “relations”. These store logic and can be referenced in subsequent queries. Can be thought of as a common table expression.

Interoperability

DuckDB code can be interleaved with other libraries.

import duckdb
import pandas as pd
df = pd.DataFrame({"a": [42]})
duckdb.sql("SELECT * FROM df")

This can be done (likely with no cost), due to the Arrow protocol. This works only on dataframe-like objects.23

If a variable is not accessible in the top-level scope, you can manually register it as a virtual table4 into the namespace:

duckdb.register("my_df", my_dict["my_df"])  # or `MyClass.my_df` etc.
duckdb.sql("... FROM my_df")

Conversions

The lazy query can be executed and converted to various objects:

import duckdb
duckdb.sql("SELECT 42").fetchall()   # Python objects (row tuples)
duckdb.sql("SELECT 42").df()         # Pandas DataFrame
duckdb.sql("SELECT 42").pl()         # Polars DataFrame
duckdb.sql("SELECT 42").arrow()      # Arrow Table
duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays

Ingestion

Generally speaking, there are a few ways to ingest data for a given file type/source

duckdb.read_csv("file.csv")
duckdb.sql("SELECT * FROM `file.csv`")
duckdb.sql("SELECT * FROM read_csv('file.csv')")

One cool thing is the glob-style reads:

duckdb.read_csv("folder/*.csv")

Persistence

Similarly, there are a few ways to serialize data to disk

duckdb.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
duckdb.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
duckdb.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file

Connections

Users can choose between a global in-memory DB or a persistent storage, simply by using the appropriate methods.

Global In-Memory DB

duckdb.sql("...")

Persistent Storage

with duckdb.connect("file.db") as con:
    con.sql("...")

User-Defined Function

You can register a Python function:

import duckdb
 
def my_function(x: int) -> str:
    return x
 
duckdb.create_function("my_func", my_function)
print(duckdb.sql("SELECT my_func(42)"))

If type hints are not provided, then you’ll have to specify additional optional parameters for .create_function.

This approach also assumes the function to be pure. Otherwise, you’ll have to mark .create_function(..., side_effects=True) though how duckdb will treat it differently other than not re-running the function as liberally.

See docs for additional details.

Expression API

This is quite similar that in polars, where transformation logic can be iteratively composed. However, from the docs it seems to not be as feature rich as polars.

Footnotes

  1. Obtained from their documentation. ↩

  2. Not sure if they must have implemented the __dataframe__ protocol or not. ↩

  3. This approach is very similar to pandas’s query method df.query("@my_var + 1") ↩

  4. Like a SQL VIEW ↩