Data & Platform
51 resultsChoosing a vector database: pgvector vs Pinecone vs Weaviate
A practical comparison across dimensions that matter for production RAG systems.
Data Governance — Principles and Practical Implementation
Ownership, cataloguing, lineage tracking, and access control at scale.
Privacy-First Data Design — PII Handling Patterns
Tokenisation, pseudonymisation, encryption at rest, and right-to-deletion workflows.
Graph Databases — When to Use Neo4j Over Relational
Nodes, edges, Cypher queries, and use cases where graph beats SQL.
Apache Iceberg — The Open Table Format Explained
Snapshots, schema evolution, partition evolution, time travel, and compaction.
Data Warehouse Modelling — Star Schema and Dimensional Design
Facts, dimensions, slowly changing dimensions, and why modelling choices matter for query performance.
Building a Data Quality Framework
Dimensions of data quality, validation layers, and monitoring in production pipelines.
Apache Kafka — Core Concepts and When to Use It
Topics, partitions, consumer groups, and the use cases where Kafka excels.
Introduction to Data Pipelines
What a data pipeline is, the core stages, and when to build vs buy.
PostgreSQL Performance Tuning Fundamentals
Indexing strategy, EXPLAIN ANALYZE, vacuum, and configuration settings that matter most.
Building a Data Catalog with DataHub
Ingestion, metadata, search, and making your catalog actually useful.
Airflow Best Practices for Production Pipelines
Idempotency, backfilling, SLA misses, and common pitfalls to avoid.
Orchestrating Pipelines with Apache Airflow
DAGs, operators, scheduling, and production best practices for Airflow.
Amazon Redshift — Architecture and Query Optimization
Distribution styles, sort keys, VACUUM, ANALYZE, and WLM tuning.
Data Contracts — Formalising Agreements Between Producers and Consumers
Schema, SLAs, semantics, and how to enforce data contracts in practice.
Data Lake vs Data Warehouse vs Lakehouse
Practical comparison of the three architectures and how to choose.
Implementing Data Lineage Tracking
Column-level lineage, tools, and why it is critical for debugging and compliance.
Data Mesh — Principles and Practical Implementation
Domain ownership, data products, self-serve infrastructure, and federated governance.
Data Observability — Detecting Silent Pipeline Failures
Freshness, volume, distribution, schema, and lineage monitoring for data reliability.
Testing Strategy for Data Pipelines
Unit tests, integration tests, data contract tests, and regression testing for pipelines.
Implementing Data Retention Policies
Legal requirements, technical implementation, and automated deletion workflows.
PostgreSQL Replication — Streaming, Logical, and Read Replicas
Set up read replicas, understand WAL, and choose between streaming and logical replication.
Getting Started with dbt (data build tool)
Models, tests, documentation, and the dbt workflow for transforming warehouse data.
ETL vs ELT — Which Pattern Should You Use?
Understand the difference between Extract-Transform-Load and Extract-Load-Transform and when each fits.
Event-Driven Data Architecture Patterns
Event sourcing, CQRS, outbox pattern, and when event-driven beats request/response.
Apache Spark — Core Concepts and When to Use It
RDDs, DataFrames, Spark SQL, and the use cases where Spark is the right tool.
Batch vs Streaming Pipelines — Choosing the Right Pattern
Lambda architecture, Kappa architecture, and practical guidance for choosing.
Change Data Capture (CDC) — Debezium and Log-Based CDC
How CDC works, why it beats polling, and how to implement it with Debezium.
Data Platform Cost Optimization Strategies
Reducing Snowflake, S3, Spark, and Kafka spend without sacrificing performance.
Delta Lake — ACID Transactions for Your Data Lake
Transaction log, upserts, schema enforcement, and time travel on S3.
DuckDB — Blazing Fast Local Analytics
When to reach for DuckDB instead of Spark, and how to use it effectively.
Elasticsearch Indexing Strategy and Performance
Mapping, sharding, bulk indexing, and query optimization for Elasticsearch.
BigQuery Cost and Performance Optimization
Partitioned tables, clustered tables, slot usage, and avoiding full scans.
Feature Stores — Bridging Data Engineering and ML
What a feature store is, online vs offline stores, and when to build vs buy.
Monitoring and Alerting for Data Pipelines
What to monitor, SLIs/SLOs for data, and building effective alerting.
Parquet vs CSV — Why Columnar Storage Matters
How Parquet's columnar format reduces storage costs and speeds up analytical queries.
Real-Time Analytics Architecture Patterns
Lambda, Kappa, HTAP, and choosing the right pattern for sub-second analytics.
Infrastructure as Code for Data Platforms with Terraform
Managing cloud data infrastructure reproducibly with Terraform.
Running Data Workloads on Kubernetes
Spark on K8s, Airflow on K8s, resource requests, and storage patterns.
Materialised Views — When and How to Use Them
Incremental refresh, use cases, and implementation across Postgres, Snowflake, and dbt.
Migrating from MySQL to PostgreSQL
Schema translation, data migration, and common incompatibilities to address.
Redis Caching Patterns for Production Applications
Cache-aside, write-through, TTL strategy, and cache invalidation approaches.
Designing a Data Lake on AWS S3
Folder structure, naming conventions, lifecycle policies, and access patterns.
Schema Registry and Avro for Kafka Data Contracts
Why schema management matters for streaming pipelines and how to implement it.
Secrets Management for Data Platforms
HashiCorp Vault, AWS Secrets Manager, and patterns for rotating credentials safely.
Time-Series Databases — InfluxDB vs TimescaleDB vs ClickHouse
Comparing purpose-built and general-purpose solutions for time-series data.
Vector Embeddings — How They Work and Where They Live
From text to vectors, similarity search, and choosing the right embedding model.
Snowflake Best Practices for Cost and Performance
Virtual warehouses, clustering, query optimization, and controlling spend.
Trino (formerly PrestoSQL) — Federated SQL Across Data Sources
Architecture, connectors, query federation, and performance tuning.
MongoDB Schema Design Patterns
Embedding vs referencing, the subset pattern, and indexing strategy.
Stream Processing with Apache Flink
Event time vs processing time, windows, stateful operators, and production deployment.