dbt Data Catalog¶
- Medallion Architecture
RAW → Staging → Intermediate → Marts — every transformation fully documented and tested.
- 22 Models
9 staging views · 4 intermediate tables · 1 fact table · 5 dimensions · 3 seeds
- 282 Automated Tests
not_null · unique · relationships · accepted_values · singular business-rule assertions
- 8 Source Tables
1.55M raw rows ingested from Azure Blob Storage via Snowflake COPY INTO
Model Inventory¶
Light-touch cleaning only — 1:1 with RAW tables, no business logic applied.
| Model | Source Table | Key Transforms |
|---|---|---|
stg_olist__orders |
raw_orders |
UPPER status, cast timestamps |
stg_olist__order_items |
raw_order_items |
Rename columns, cast decimals |
stg_olist__customers |
raw_customers |
Standardise state codes |
stg_olist__products |
raw_products |
Cast weight/dimensions |
stg_olist__sellers |
raw_sellers |
Standardise state codes |
stg_olist__payments |
raw_payments |
Cast amounts, validate types |
stg_olist__reviews |
raw_reviews |
Deduplicate, cast timestamps |
stg_olist__geolocation |
raw_geolocation |
Cast lat/long coordinates |
stg_olist__category_translations |
raw_category_names |
EN/PT name mapping |
Materialization: view — zero storage cost · Schema: STAGING
Business logic, joins, and aggregations — reusable building blocks for marts.
| Model | Purpose |
|---|---|
int_sales__order_items_joined |
Joins orders + items; adds DQ flags (is_valid, is_delayed) |
int_products__enriched |
Adds is_verified flag, fills missing category translations |
int_customers__prep |
Computes is_repeat_customer (≥ 2 delivered orders) |
int_order_reviews |
Deduplicates reviews to one row per order via MIN(review_score) |
Materialization: table · Schema: INTERMEDIATE
Star schema (Kimball) — production-ready for Power BI and SQL consumers.
Fact Table
| Model | Grain | Rows | Materialization |
|---|---|---|---|
fct_order_items |
One row per delivered order line item | ~112K | incremental |
Dimension Tables
| Model | Description | Materialization |
|---|---|---|
dim_customers |
Customer attributes + is_repeat_customer flag |
table |
dim_products |
Product catalog + English category translations | table |
dim_sellers |
Seller attributes + state geography | table |
dim_date |
Full calendar spine (2016–2018) with fiscal periods | table |
dim_rls_bridge |
Row-level security bridge — state → role mapping | table |
Schema: MARTS · Database: OLIST_ANALYTICS_DB
Pre-joined, pre-filtered tables purpose-built for Python EDA and data science workflows — avoids complex Pandas merges downstream.
| Model | Grain | Columns | Tests | Purpose |
|---|---|---|---|---|
obt_logistics_diagnostics |
One row per delivered order item | 18 | 35 | Root-cause logistics EDA — delivery delays, freight costs, review correlations |
Phase 2 — Python EDA
This layer is consumed by the ecommerce-logistics-diagnostics Python notebook project, which investigates the Amazonas 66.7% delivery delay anomaly using Pandera data contracts and cohort analysis.
Materialization: table · Schema: MARTS · Database: OLIST_ANALYTICS_DB
Key Models to Explore¶
-
fct_order_items
Core fact table. Grain: one row per delivered order line item. Pre-calculated measures: order_total, freight_value, is_delayed, days_to_deliver. Joins to all 5 dimensions.
-
dim_customers
Customer dimension with customer_unique_id for cross-order identity resolution. Includes is_repeat_customer flag (≥ 2 delivered orders).
-
dim_products
Product catalog enriched with English category translations. Includes is_verified quality flag for products with complete dimensions and photos.
-
dim_date
Full calendar spine covering 2016–2018. Includes day-of-week, week, month, quarter, and year columns for Power BI time intelligence.
Navigating the Lineage DAG
In the catalog, click any model name → then click "See Lineage" to see its full upstream/downstream dependency graph. Use +model_name syntax in the search to select a model with all its parents highlighted.
Phase 2 — Analytics Layer¶
obt_logistics_diagnostics is a One Big Table purpose-built as the single data source for the Python EDA project. It pre-joins 7 upstream models so no Pandas merges are required in the notebook.
| Dimension | Detail |
|---|---|
| Grain | One row per delivered order item (order_id + order_item_id) |
| Columns | 18 — identifiers, timestamps, financials, geography, product specs, review score, DQ flags |
| Tests | 35 — generic + 3 singular business-rule assertions |
| Upstream deps | fct_order_items · stg_olist__orders · stg_olist__order_items · int_products__enriched · dim_customers · dim_sellers · int_order_reviews |
| Key rule | WHERE order_status = 'DELIVERED' — only logistically clean records |
| DQ flags | is_valid_logistics + is_valid_product — must filter = 1 before computing transit time stats |