Star Profile Data Warehouse

Chaim Turkel
Israeli Tech Radar
Published in
4 min readNov 23, 2022

--

Facts vs Dimensions

Before we describe the design issue, we need to first start with some clarification on what facts vs dimensions are.

Problem description

The issue at hand is that for each fact table, there is some process that will be updating this fact table. There are many factors as to how often you want to update your data, based on the underlying data and granularity. Once we have a fact table, in order to generate reports you need to join the table to the dimension tables. If there is only one fact table then there is no issue. You update the fact table whenever it needs to be, and then a join is created with all the relevant dimension tables. Once you have more than one fact table, that needs to be updated at different intervals — you now have an issue. Since my end report will need data points from multiple fact sources, how do I join all the fact tables to one place based on time, and then join them all to the dimension table?

Current Solutions

Currently, we have 2 basic solutions for this:

  1. Create an outer join between all facts based on time, then join to all dimension tables (this solution does not scale)

2. Create one mega table, and each process with sending an upsert command to update the relevant cell in the table (not good for big data — does not scale). Any new metrics will need to update the schema of the table.

Data Mesh Solution to problem

In order to solve the issue we need to have a table per fact. This way each team can create new facts based on their needs without the need to synchronize their work with other teams. Once we have a collection of fact tables, we now need a solution to efficiently combine all the tables in one big table without the need to join all the tables (might be tens of tables). The solution to this is to create a union between all the fact tables (add all columns, each table that does not have the column will have a null in place). The union will generate a very big table where each row will have the value from the fact table but all other columns will be null. To solve this issue we do a group by the foreign key and time column. You can see the summary of this in the following diagram.

Objective:

Supply the ability to build multiple daily aggregate fact tables that have their own lifecycle and to efficiently merge all aggregations to single wide table.

Dimensions will be added on the report level (looker etc…).

End model

The wide table will be exported to Looker as an explore, with the joins to the dimension tables.

The end model of the profile schema in Looker will be the Star Schema.

In a future blog, I will show how this can easily be achieved using dbt

--

--