Designing a Data Lake on AWS S3

Folder structure, naming conventions, lifecycle policies, and access patterns.

Updated May 24, 2026 41 views

Folder structure

s3://my-data-lake/
  raw/          # immutable source data, partitioned by source and date
    source=salesforce/year=2024/month=03/
  processed/    # cleaned, deduplicated
  curated/      # modelled, analytics-ready (Parquet, partitioned)
  sandbox/      # analyst experiments, TTL 30 days

Partitioning strategy

Partition by the columns most commonly used in filters: date (year/month/day for high-volume, year/month for lower), then source or entity type. Hive-style partitioning (year=2024/month=03) is understood natively by Athena, Glue, and Spark.

Lifecycle policies

Raw zone: S3 Standard → Standard-IA after 30 days → Glacier after 1 year.
Sandbox: delete after 30 days.
Curated: Standard only — hot query path.

Access control

IAM roles per workload, not per user. S3 bucket policies block public access at account level. Use S3 Access Points for fine-grained prefix-level permissions without complex bucket policies.

Designing a Data Lake on AWS S3

Folder structure

Partitioning strategy

Lifecycle policies

Access control

Related articles

Graph Databases — When to Use Neo4j Over Relational

Data Governance — Principles and Practical Implementation

PostgreSQL Performance Tuning Fundamentals

Choosing a vector database: pgvector vs Pinecone vs Weaviate