What a data catalog provides
- Discoverability — analysts find datasets without asking engineers.
- Context — what does this column mean? Who owns this table?
- Lineage — where does this data come from and who uses it?
- Trust — freshness indicators and quality scores help consumers decide if a dataset is reliable.
DataHub setup
DataHub (open-source, LinkedIn) deploys via docker-compose or Helm. Ingestion is configured with YAML recipes:
source:
type: snowflake
config:
account_id: myaccount
username: datahub_reader
password: ${SNOWFLAKE_PASS}
database_pattern:
allow: ["ANALYTICS"]
sink:
type: datahub-rest
config:
server: http://datahub-gms:8080
Making it stick
A catalog only works if people use it. Mandate it: all new datasets require catalog registration before going to production. Add a "find in catalog" step to the analyst onboarding checklist. Metrics: catalog coverage % and search-to-find rate.