What a data catalog provides

  • Discoverability — analysts find datasets without asking engineers.
  • Context — what does this column mean? Who owns this table?
  • Lineage — where does this data come from and who uses it?
  • Trust — freshness indicators and quality scores help consumers decide if a dataset is reliable.

DataHub setup

DataHub (open-source, LinkedIn) deploys via docker-compose or Helm. Ingestion is configured with YAML recipes:

source:
  type: snowflake
  config:
    account_id: myaccount
    username: datahub_reader
    password: ${SNOWFLAKE_PASS}
    database_pattern:
      allow: ["ANALYTICS"]
sink:
  type: datahub-rest
  config:
    server: http://datahub-gms:8080

Making it stick

A catalog only works if people use it. Mandate it: all new datasets require catalog registration before going to production. Add a "find in catalog" step to the analyst onboarding checklist. Metrics: catalog coverage % and search-to-find rate.