Navigating the Data Landscape: A Guide to Developing a Data Platform Roadmap

Chaim Turkel
Israeli Tech Radar
Published in
7 min readMar 26, 2023

--

Data has become one of the most valuable assets for businesses in the digital age. With the increasing amount of data being generated and stored, it’s crucial for organizations to have a clear plan for managing this information. This is where a Data Management Roadmap comes into play. A Data Management Roadmap is a strategic plan that outlines how an organization will collect, store, use, and secure its data over a specified period of time. This guide is designed to help programmers navigate the complex landscape of data management and develop a roadmap that will help them achieve their data management goals.

What is Data Management and What Should I Keep in Mind When Building a Data Platform?

The Data platform’s main blocks consist of data ingestion, data transformation, and data access. Each block comes with a lot of challenges and a lot of technologies that need to be learned. When building a data platform, you need to join tasks that include not only developers, but also devops, and finance. You need all parts of the company on board for it to work since the platform will affect all layers of the company.

So let's try to take apart the different layers, and try to see different aspects of each layer, and what technologies and tasks need to be done.

Data Ingestion

It’s important to identify the sources of data that need to be ingested into your data lake.

Internal data

The first go-to data is usually data generated internally from different processes in the company.

A standard company data source is an internal micro-service. Since each micro-service has its own data (using a 12-factor methodology), and we usually need reports that are cross-micro-services, we need to bring all the data to one place for the report analysis.

The data of each micro-service can be a database (multiple types: structured, unstructured). The data can also be from a message queue in the service. All of this data needs to be synced to your data lake (maybe using a CDC tool like debezium).

External data

It is common for companies to require data not only from their internal sources but also from external services utilized by the company. This can be crm’s, external applications, and more.

Handling Ingestion

We can of course create our own data pipeline to ingest all this data, but today there are a lot of tools that do this for us.

A small list of these are: upsolver, keboola, fivetran, stitch, rivery, and others. Some open sources are airbyte, benthos, and others.

Source of Truth

Today we have in addition to the data lake, a data warehouse (BigQuery, Databricks, Snowflake…). In my opinion, a data warehouse is not a data lake, and you should save all your data into a data lake (immutable files) as the source of truth, and use the data warehouse for all other layers. So make sure that your ingestion tools know how to save to s3 as well as syncing to your data warehouse.

The reason behind the source of truth in files is that the data warehouse is mutable storage in nature. This leads to problematic areas where data was changed, and we cannot get back to the original data.

Data Lake Issues

There are some issues that you need to consider when building your data lake:

Bucket vs Folder

  • Buckets and objects are the primary resources
  • Amazon S3 has a flat structure instead of a hierarchy like you would see in a file system
  • Up to 100 buckets in your account
  • Each bucket name must be unique across all AWS accounts in all the AWS Regions within a partition
  • Buckets — permissions, lifecycle…

Data Storage Structure Considerations

  • Security (per bucket)
  • Data Transfer Rates
  • Partitioning
  • Make multiple copies of the data (unstructured -> staging -> db)
  • Retention policy (Compliance, cost)
  • Readable file formats
  • Merge small files
  • Data governance and access control

Data Modeling

Once your data is in your data warehouse, you need to set up your bronze, silver, and gold layers. Your data needs to be curated by creating standard naming conventions for your tables and columns. Columns might need to be casted to the correct formats.

The best tool for data modeling is dbt. To set up a dbt project for your data see DBT — Project Structure, DBT — Database Schemas.

Data Modeling Patterns

We will not go into all the different modeling patterns, but just mention a few so that you can start reading about them, and get into them.

Star schema

From wikipedia:

The star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

Slowly Changing Dimension

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered one of the most critical ETL tasks in tracking the history of dimension records.

There are three types of SCDs and you should choose the one that most suit your needs.

Type 1 SCDs — Overwriting

In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.

Type 2 SCDs — Creating another dimension record

A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.

Type 3 SCDs — Creating a current value field

A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

Data Quality

Once you brought your data to the warehouse you need to validate the data itself. Validation of the data consists of making sure that your data arrives on time, is complete and accurate, and is consistent with all other data input.

There are multiple tools out there to help you with your data quality.

You can start with aws Deequ, use great expectations, or even use machine learning to check for discrepancies with tools like Monte Carlo.

Data Analysis

Once you have all your data, and the quality is validated, you now need a tool that will help you create dashboards and analysis of the data. There are multiple vendors that will help you with this.

Data Governance

A concern that needs to be addressed once you begin, is who can see what data. You need the tools to limit sensitive data to those that need access to it. Some data warehouses support acl functionality (like snowflake and bigquery) and some don’t.

Data Catalogs

A Data Catalog is a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate the fitness of data for intended uses.

Data catalogs are very important and vary greatly in the feature set that they give. Take into consideration that you have multiple sources of data and pipelines, and all should make their way into the catalog.

Summary

Here is a mind map of technologies that will be needed as part of your data platform journey.

References

Slides for the presentation can be found in Data Management Presentation.

Sites mentioned in the article:

--

--