See Types of Databases.

Relational vs Non-relational databases

Deciding between the two falls mostly around:

  • Uniformity of data coming in
  • Volatility of data schema
  • Access pattern uniformity of downstream use cases
AspectsRelationalNon-relational
Schema definitionWell-definedNot well-defined
Schema assumptionAt write-timeAt read-time
Data source flexibilityLow; incoming data must abide by schemaHigh; Any data can come in without meeting schema requirements
Data consumption flexibilityLowHigh; Multiple consumers can each assume their own schema
Schema change implicationsUpstream and downstream processes need to be updatedOnly downstream processes that are affected need to update their schema
Storage enginesData warehousesData lakes
ScalableYes

Once downstream access patterns are well-defined
No

Shifting through lots of raw data is inefficient

Historical vs Latest Tables

  • Whether the fields in the tables themselves are evolving over time

So far, this applies to tabular data. Haven’t been tested against other forms of data modalities.

Data lakes + Data warehouse

Extracted data stays in data lakes.

Use ELT instead of ETL

This way, when incoming schema changes, raw data (from a data lake) can be re-transformed and loaded into the data warehouse.

Use dbt to manage transformation steps

This also serves as a business logic to machine learning translation layer.