Folder structure
s3://my-data-lake/
raw/ # immutable source data, partitioned by source and date
source=salesforce/year=2024/month=03/
processed/ # cleaned, deduplicated
curated/ # modelled, analytics-ready (Parquet, partitioned)
sandbox/ # analyst experiments, TTL 30 days
Partitioning strategy
Partition by the columns most commonly used in filters: date (year/month/day for high-volume, year/month for lower), then source or entity type. Hive-style partitioning (year=2024/month=03) is understood natively by Athena, Glue, and Spark.
Lifecycle policies
- Raw zone: S3 Standard → Standard-IA after 30 days → Glacier after 1 year.
- Sandbox: delete after 30 days.
- Curated: Standard only — hot query path.
Access control
IAM roles per workload, not per user. S3 bucket policies block public access at account level. Use S3 Access Points for fine-grained prefix-level permissions without complex bucket policies.