See Types of Databases.
Relational vs Non-relational databases
Deciding between the two falls mostly around:
- Uniformity of data coming in
- Volatility of data schema
- Access pattern uniformity of downstream use cases
Aspects | Relational | Non-relational |
---|---|---|
Schema definition | Well-defined | Not well-defined |
Schema assumption | At write-time | At read-time |
Data source flexibility | Low; incoming data must abide by schema | High; Any data can come in without meeting schema requirements |
Data consumption flexibility | Low | High; Multiple consumers can each assume their own schema |
Schema change implications | Upstream and downstream processes need to be updated | Only downstream processes that are affected need to update their schema |
Storage engines | Data warehouses | Data lakes |
Scalable | Yes Once downstream access patterns are well-defined | No Shifting through lots of raw data is inefficient |
Historical vs Latest Tables
- Whether the fields in the tables themselves are evolving over time
Recommended Setup
So far, this applies to tabular data. Haven’t been tested against other forms of data modalities.
Data lakes + Data warehouse
Extracted data stays in data lakes.
Use ELT instead of ETL
This way, when incoming schema changes, raw data (from a data lake) can be re-transformed and loaded into the data warehouse.
Use dbt
to manage transformation steps
This also serves as a business logic to machine learning translation layer.