Knowledge Base

 Data & Platform

Pipelines, vector stores, data governance, and privacy-first design.

Main site

Data & Platform

51 results
Article ★ Featured

Data Warehouse Modelling — Star Schema and Dimensional Design

Facts, dimensions, slowly changing dimensions, and why modelling choices matter for query performance.

data warehouse star schema dimensional modeling SCD fact table
1 views Mar 30, 2026
Article ★ Featured

Apache Iceberg — The Open Table Format Explained

Snapshots, schema evolution, partition evolution, time travel, and compaction.

Apache Iceberg open table format time travel schema evolution ACID
1 views Mar 30, 2026
Article ★ Featured

PostgreSQL Performance Tuning Fundamentals

Indexing strategy, EXPLAIN ANALYZE, vacuum, and configuration settings that matter most.

PostgreSQL performance indexing EXPLAIN vacuum
1 views Mar 30, 2026
Article ★ Featured

Graph Databases — When to Use Neo4j Over Relational

Nodes, edges, Cypher queries, and use cases where graph beats SQL.

Neo4j graph database Cypher knowledge graph fraud detection
1 views Mar 30, 2026
Article ★ Featured

Introduction to Data Pipelines

What a data pipeline is, the core stages, and when to build vs buy.

data pipeline ETL ELT ingest transform
1 views Mar 30, 2026
Article ★ Featured

Privacy-First Data Design — PII Handling Patterns

Tokenisation, pseudonymisation, encryption at rest, and right-to-deletion workflows.

PII privacy GDPR pseudonymisation tokenisation
1 views Mar 30, 2026
Article ★ Featured

Apache Kafka — Core Concepts and When to Use It

Topics, partitions, consumer groups, and the use cases where Kafka excels.

Kafka streaming event log topics partitions
1 views Mar 30, 2026
Article ★ Featured

Data Governance — Principles and Practical Implementation

Ownership, cataloguing, lineage tracking, and access control at scale.

data governance data catalog lineage DataHub access control
1 views Mar 30, 2026
Article ★ Featured

Choosing a vector database: pgvector vs Pinecone vs Weaviate

A practical comparison across dimensions that matter for production RAG systems.

vector database pgvector Pinecone Weaviate embeddings
3 views Mar 30, 2026
Article ★ Featured

Building a Data Quality Framework

Dimensions of data quality, validation layers, and monitoring in production pipelines.

data quality Great Expectations dbt tests validation completeness
1 views Mar 30, 2026
Article

Getting Started with dbt (data build tool)

Models, tests, documentation, and the dbt workflow for transforming warehouse data.

dbt data build tool ELT SQL transformation
1 views Mar 30, 2026
Article

Change Data Capture (CDC) — Debezium and Log-Based CDC

How CDC works, why it beats polling, and how to implement it with Debezium.

CDC change data capture Debezium Kafka WAL
1 views Mar 30, 2026
Article

Data Mesh — Principles and Practical Implementation

Domain ownership, data products, self-serve infrastructure, and federated governance.

data mesh domain ownership data product self-serve governance
1 views Mar 30, 2026
Article

DuckDB — Blazing Fast Local Analytics

When to reach for DuckDB instead of Spark, and how to use it effectively.

DuckDB analytics local Parquet S3
1 views Mar 30, 2026
Article

Redis Caching Patterns for Production Applications

Cache-aside, write-through, TTL strategy, and cache invalidation approaches.

Redis caching cache-aside TTL invalidation
1 views Mar 30, 2026
Article

Apache Spark — Core Concepts and When to Use It

RDDs, DataFrames, Spark SQL, and the use cases where Spark is the right tool.

Spark Apache Spark DataFrames distributed compute Spark SQL
1 views Mar 30, 2026
Article

Implementing Data Lineage Tracking

Column-level lineage, tools, and why it is critical for debugging and compliance.

data lineage OpenLineage DataHub dbt column lineage
1 views Mar 30, 2026
Article

Data Lake vs Data Warehouse vs Lakehouse

Practical comparison of the three architectures and how to choose.

data lake data warehouse lakehouse Delta Lake Iceberg
1 views Mar 30, 2026
Article

Snowflake Best Practices for Cost and Performance

Virtual warehouses, clustering, query optimization, and controlling spend.

Snowflake cost optimization virtual warehouse clustering query tuning
2 views Mar 30, 2026
Article

Data Observability — Detecting Silent Pipeline Failures

Freshness, volume, distribution, schema, and lineage monitoring for data reliability.

data observability freshness volume distribution Monte Carlo
1 views Mar 30, 2026
Article

Secrets Management for Data Platforms

HashiCorp Vault, AWS Secrets Manager, and patterns for rotating credentials safely.

secrets management Vault AWS Secrets Manager credentials rotation
1 views Mar 30, 2026
Article

Elasticsearch Indexing Strategy and Performance

Mapping, sharding, bulk indexing, and query optimization for Elasticsearch.

Elasticsearch indexing mapping shards bulk
1 views Mar 30, 2026
Article

Implementing Data Retention Policies

Legal requirements, technical implementation, and automated deletion workflows.

data retention GDPR CCPA deletion compliance
1 views Mar 30, 2026
Article

Trino (formerly PrestoSQL) — Federated SQL Across Data Sources

Architecture, connectors, query federation, and performance tuning.

Trino Presto federated query SQL Iceberg
2 views Mar 30, 2026
Article

Delta Lake — ACID Transactions for Your Data Lake

Transaction log, upserts, schema enforcement, and time travel on S3.

Delta Lake ACID upsert MERGE time travel
1 views Mar 30, 2026
Article

Airflow Best Practices for Production Pipelines

Idempotency, backfilling, SLA misses, and common pitfalls to avoid.

Airflow best practices idempotency backfill SLA
1 views Mar 30, 2026
Article

Designing a Data Lake on AWS S3

Folder structure, naming conventions, lifecycle policies, and access patterns.

S3 data lake AWS partitioning lifecycle
1 views Mar 30, 2026
Article

Real-Time Analytics Architecture Patterns

Lambda, Kappa, HTAP, and choosing the right pattern for sub-second analytics.

real-time analytics ClickHouse Druid Flink HTAP
1 views Mar 30, 2026
Article

ETL vs ELT — Which Pattern Should You Use?

Understand the difference between Extract-Transform-Load and Extract-Load-Transform and when each fits.

ETL ELT data warehouse dbt Snowflake
1 views Mar 30, 2026
Article

Schema Registry and Avro for Kafka Data Contracts

Why schema management matters for streaming pipelines and how to implement it.

Avro Schema Registry Kafka data contracts schema evolution
1 views Mar 30, 2026
Article

Infrastructure as Code for Data Platforms with Terraform

Managing cloud data infrastructure reproducibly with Terraform.

Terraform IaC infrastructure as code AWS S3
1 views Mar 30, 2026
Article

Feature Stores — Bridging Data Engineering and ML

What a feature store is, online vs offline stores, and when to build vs buy.

feature store ML platform Feast training-serving skew online store
1 views Mar 30, 2026
Article

Migrating from MySQL to PostgreSQL

Schema translation, data migration, and common incompatibilities to address.

MySQL PostgreSQL migration pgloader schema translation
1 views Mar 30, 2026
Article

Event-Driven Data Architecture Patterns

Event sourcing, CQRS, outbox pattern, and when event-driven beats request/response.

event sourcing CQRS outbox pattern event-driven Kafka
1 views Mar 30, 2026
Article

Batch vs Streaming Pipelines — Choosing the Right Pattern

Lambda architecture, Kappa architecture, and practical guidance for choosing.

batch streaming Lambda architecture Kappa architecture Flink
1 views Mar 30, 2026
Article

Time-Series Databases — InfluxDB vs TimescaleDB vs ClickHouse

Comparing purpose-built and general-purpose solutions for time-series data.

time-series InfluxDB TimescaleDB ClickHouse metrics
1 views Mar 30, 2026
Article

Running Data Workloads on Kubernetes

Spark on K8s, Airflow on K8s, resource requests, and storage patterns.

Kubernetes K8s Spark Airflow KubernetesExecutor
1 views Mar 30, 2026
Article

MongoDB Schema Design Patterns

Embedding vs referencing, the subset pattern, and indexing strategy.

MongoDB schema design embedding referencing bucket pattern
1 views Mar 30, 2026
Article

Amazon Redshift — Architecture and Query Optimization

Distribution styles, sort keys, VACUUM, ANALYZE, and WLM tuning.

Redshift AWS distribution key sort key VACUUM
1 views Mar 30, 2026
Article

Monitoring and Alerting for Data Pipelines

What to monitor, SLIs/SLOs for data, and building effective alerting.

monitoring alerting SLI SLO Prometheus
1 views Mar 30, 2026
Article

Orchestrating Pipelines with Apache Airflow

DAGs, operators, scheduling, and production best practices for Airflow.

Airflow orchestration DAG scheduling pipeline
1 views Mar 30, 2026
Article

Parquet vs CSV — Why Columnar Storage Matters

How Parquet's columnar format reduces storage costs and speeds up analytical queries.

Parquet CSV columnar storage compression PyArrow
1 views Mar 30, 2026
Article

Vector Embeddings — How They Work and Where They Live

From text to vectors, similarity search, and choosing the right embedding model.

embeddings vector search ANN HNSW MTEB
1 views Mar 30, 2026
Article

Data Platform Cost Optimization Strategies

Reducing Snowflake, S3, Spark, and Kafka spend without sacrificing performance.

cost optimization Snowflake S3 Spark Kafka
1 views Mar 30, 2026
Article

Materialised Views — When and How to Use Them

Incremental refresh, use cases, and implementation across Postgres, Snowflake, and dbt.

materialised views incremental refresh PostgreSQL Snowflake dbt
1 views Mar 30, 2026
Article

Testing Strategy for Data Pipelines

Unit tests, integration tests, data contract tests, and regression testing for pipelines.

testing data pipeline dbt unit tests integration tests
1 views Mar 30, 2026
Article

Stream Processing with Apache Flink

Event time vs processing time, windows, stateful operators, and production deployment.

Flink stream processing event time watermarks windows
2 views Mar 30, 2026
Article

Building a Data Catalog with DataHub

Ingestion, metadata, search, and making your catalog actually useful.

DataHub data catalog metadata lineage discoverability
1 views Mar 30, 2026
Article

PostgreSQL Replication — Streaming, Logical, and Read Replicas

Set up read replicas, understand WAL, and choose between streaming and logical replication.

PostgreSQL replication streaming replication logical replication Patroni
1 views Mar 30, 2026
Article

Data Contracts — Formalising Agreements Between Producers and Consumers

Schema, SLAs, semantics, and how to enforce data contracts in practice.

data contracts ODCS schema SLA producer
1 views Mar 30, 2026
Article

BigQuery Cost and Performance Optimization

Partitioned tables, clustered tables, slot usage, and avoiding full scans.

BigQuery GCP partitioning clustering cost optimization
1 views Mar 30, 2026