Folder structure

s3://my-data-lake/
  raw/          # immutable source data, partitioned by source and date
    source=salesforce/year=2024/month=03/
  processed/    # cleaned, deduplicated
  curated/      # modelled, analytics-ready (Parquet, partitioned)
  sandbox/      # analyst experiments, TTL 30 days

Partitioning strategy

Partition by the columns most commonly used in filters: date (year/month/day for high-volume, year/month for lower), then source or entity type. Hive-style partitioning (year=2024/month=03) is understood natively by Athena, Glue, and Spark.

Lifecycle policies

  • Raw zone: S3 Standard → Standard-IA after 30 days → Glacier after 1 year.
  • Sandbox: delete after 30 days.
  • Curated: Standard only — hot query path.

Access control

IAM roles per workload, not per user. S3 bucket policies block public access at account level. Use S3 Access Points for fine-grained prefix-level permissions without complex bucket policies.