Optimization Techniques
Techniques | Description | API |
---|---|---|
Caching1 | Save intermediate dataframes for re-use or branching compute | df.cache() OR df.persist(StorageLevel.MEMORY_AND_DIKS) for more fine-grained control |
Broadcast variables | Copy data over to all workers to minimize data transfer. This applies to the smaller operands. | from pyspark.sql.functions import broadcast |
Update to the latest Spark version | To receive more optimizations from the Tungsten engine and Catalyst optimizer | |
Use dataframe APIs | It’s higher-level and more efficient than RDDs | |
Tune memory configuration | This feels a bit like an art though | SparkConf().set("spark.executor.memory", "4g").set("spark.driver.memory", "2g") |
Avoid UDFs as much as possible | They are not optimized compared to the native APIs | |
Repartition | Helpful when data is really skewed, leading to one node having more work to do than other nodes. Rule of thumb is that each partition should be about 128MB. |
Footnotes
-
While Spark’s lazy API shares some optimization techniques like predicate pushdown and projection pushdown with polars, this is not one of them. ↩