A data manipulation library, implemented in C++. It’s heavy SQL-based and is in-memory. This makes it an ultra performant replacement for sqllite.
It’s functionality overlaps a lot with polars.
It is available as a standalone CLI application with bindings to python, amongst other languages. Since it implements the Arrow protocol, it can interoperate with other packages like pandas and polars easily.
API Quickstart
import duckdb
duckdb.sql("SELECT 10").show()
The module-level .sql
function utilizes lazy execution, and only executes when .show()
is called.
.show
returnsNone
The
__repr__
of a “relation” (see “Variables”) itself produces the same output as.show
. The difference is that it allows further formatting of the representation.To get the underlying data, see Conversions.
”Variables”
import duckdb
r1 = duckdb.sql("SELECT 10 AS i")
duckdb.sql("SELECT i * 2 AS k FROM r1").show()
Unexecuted queries are known as “relations”. These store logic and can be referenced in subsequent queries. Can be thought of as a common table expression.
Interoperability
DuckDB code can be interleaved with other libraries.
import duckdb
import pandas as pd
df = pd.DataFrame({"a": [42]})
duckdb.sql("SELECT * FROM df")
This can be done (likely with no cost), due to the Arrow protocol. This works only on dataframe-like objects.23
If a variable is not accessible in the top-level scope, you can manually register it as a virtual table4 into the namespace:
duckdb.register("my_df", my_dict["my_df"]) # or `MyClass.my_df` etc.
duckdb.sql("... FROM my_df")
Conversions
The lazy query can be executed and converted to various objects:
import duckdb
duckdb.sql("SELECT 42").fetchall() # Python objects (row tuples)
duckdb.sql("SELECT 42").df() # Pandas DataFrame
duckdb.sql("SELECT 42").pl() # Polars DataFrame
duckdb.sql("SELECT 42").arrow() # Arrow Table
duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays
Ingestion
Generally speaking, there are a few ways to ingest data for a given file type/source
duckdb.read_csv("file.csv")
duckdb.sql("SELECT * FROM `file.csv`")
duckdb.sql("SELECT * FROM read_csv('file.csv')")
One cool thing is the glob-style reads:
duckdb.read_csv("folder/*.csv")
Persistence
Similarly, there are a few ways to serialize data to disk
duckdb.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
duckdb.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
duckdb.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file
Connections
Users can choose between a global in-memory DB or a persistent storage, simply by using the appropriate methods.
Global In-Memory DB
duckdb.sql("...")
Persistent Storage
with duckdb.connect("file.db") as con:
con.sql("...")
User-Defined Function
You can register a Python function:
import duckdb
def my_function(x: int) -> str:
return x
duckdb.create_function("my_func", my_function)
print(duckdb.sql("SELECT my_func(42)"))
If type hints are not provided, then you’ll have to specify additional optional parameters for .create_function
.
This approach also assumes the function to be pure. Otherwise, you’ll have to mark .create_function(..., side_effects=True)
though how duckdb
will treat it differently other than not re-running the function as liberally.
See docs for additional details.
Expression API
This is quite similar that in polars, where transformation logic can be iteratively composed. However, from the docs it seems to not be as feature rich as polars
.
Footnotes
-
Obtained from their documentation. ↩
-
Not sure if they must have implemented the
__dataframe__
protocol or not. ↩ -
This approach is very similar to pandas’s query method
df.query("@my_var + 1")
↩ -
Like a SQL
VIEW
↩