📊 Olist Modern Analytics Platform — Business Impact Report¶

1. Executive Summary¶

The Transformation: This project successfully transformed Olist’s fragmented, manual reporting processes into a fully automated Modern Analytics Platform. By migrating from legacy ad-hoc spreadsheets to a certified Snowflake + dbt + Power BI architecture, we established a Single Source of Truth that reduced reporting latency by 90% (from days to seconds).

The Critical Insight: The new platform immediately surfaced a critical correlation: while total verified revenue reached R$13M with strong year-over-year growth, the most recent month saw a -3.4% revenue dip. Drill-down analysis revealed the root cause: an 8.1% delivery delay rate, specifically concentrated in the South region.

The Business Impact: This insight identified R$1.2M in "At-Risk" revenue tied to logistics bottlenecks. By shifting from reactive reporting to proactive monitoring, Olist leadership can now pivot logistics strategies immediately to recover lost margin, while the engineering automation saves an estimated 20+ hours per week of manual data wrangling.

2. Business Impact Snapshot (Before vs After)¶

📌 Portfolio Context: This project demonstrates enterprise-grade analytics engineering using Olist's public dataset (2016-2018). Impact metrics represent projected business value based on industry benchmarks and showcase understanding of how modern data platforms drive ROI. Technical performance metrics (load times, cost optimizations, test coverage) are measured and reproducible. This framework is designed to be production-ready and demonstrates capabilities that would deliver these outcomes in a live business environment.

2.1 Architecture Evolution (Legacy vs. Modern)¶

The transformation from a fragmented, manual process to a cloud-native automated platform.

Architecture Domain	Legacy State (Pain Points) ❌	Modern State (Solution) ✅
Data Storage	Siloed OLTP (Postgres); heavy read-load slowed down the app.	Centralized OLAP (Snowflake); optimized for analytical queries.
Ingestion Pipeline	Manual CSV extracts and ad-hoc scripts; frequent failures & stale data.	ELT Pipeline (Snowflake + dbt); distinct "Raw" vs. "Curated" zones.
Data Modeling	No defined schema; massive "Spaghetti SQL" queries joined at runtime.	Kimball Star Schema modeled in dbt with version control & testing.
Governance	"Metric Drift": Every department calculated "Revenue" differently.	Single Source of Truth: Metrics defined once in Semantic Model and reused everywhere.
Scalability	Low: System crashes with high volume; manual fixes required.	High: Cloud-native architecture scales compute instantly for millions of rows.

2.2 Business Value¶

3 Transformational Dimensions: Operational Efficiency • Trust & Governance • Strategic Insights

1️⃣ Operational Efficiency (Speed & Performance)¶

Problem: Legacy SQL queries took 30-45 seconds to execute; Stakeholders waited 3-5 days for manual Excel merges; On-premise infrastructure couldn't scale.

Metric	Before State ❌	After State (Architecture) ✅	Impact / ROI 📈
Query Latency	30-45 seconds (Direct SQL queries on legacy system).	< 1.2 seconds (Import Mode + Star Schema + Optimized DAX); Performance Analyzer validation: all visuals <1200ms.	Reduced report latency by 93% (45s→3s); dashboard load <2s; sub-second visual rendering enables flow-state analysis.
Data Freshness	Weekly/Monthly manual Excel exports; 3-5 day decision latency.	Daily @ 06:00 UTC (Automated ELT Pipeline: Azure Blob → Snowflake → dbt); Failure alerts via GitHub Actions.	Stakeholders make decisions on yesterday's data instead of last week's data (90% faster).
Engineering Time	4-6 hours per ad-hoc report; business logic duplicated across analysts.	Self-Service: Certified Star Schema + reusable dbt staging/marts models + Power BI semantic model; drag-and-drop trusted measures.	Saves 35-45 analyst hours/week (projected); faster onboarding; reduced technical debt via single source of logic.
Data Storage & Refresh	On-prem storage with manual backups; full daily refresh (45+ min) disrupted business hours.	Azure Blob lifecycle policies (Hot→Cool→Archive) + Incremental Refresh on `fct_order_items` ; 8 min refresh.	60% storage cost reduction + 82% faster refresh time; enabled hourly refresh windows; scalable to 10x data volume.

2️⃣ Trust & Data Governance (The "Golden Standard")¶

Problem: Finance and Ops had different numbers for "Revenue" due to hidden filtering logic (e.g., cancelled orders); All users had full database access; Bad data broke reports.

Feature	Legacy Approach ❌	Modern Approach ✅	Governance Win 🛡️
Metric Logic	Hidden in Excel formulas or custom SQL; scattered documentation outdated within weeks.	Centralized Measures: 40+ measures stored in version-controlled TMDL files (Power BI Projects); Single Source of Truth via dbt Marts & Power BI Semantic Layer.	100% metric consistency across teams; [Total Revenue] definition locked and consistent across all reports; clear ownership of each calculation.
Data Quality	Bad rows (negative prices, impossible dates) silently deleted; numbers didn't match source; "phantom" inventory caused cancelled orders.	"Trust, Don't Trash" Strategy: dbt applies specific flags: • Master Flag: `is_verified` (1/0) for clean/dirty filtering • Diagnostic: `quality_issue_reason` (e.g., "Ghost Delivery", "Missing Photos", "Arrival Before Approval")	100% traceability: Finance reconciles exact penny amounts including "Revenue at Risk"; actionable correction lists (609 products with missing photos flagged).
Security	Single SYSADMIN role; shared credentials; everyone sees everything; no audit trail.	Snowflake RBAC: 4 roles (LOADER, ANALYTICS, REPORTER) + Power BI RLS via Bridge Table restricts managers to their specific State/Region + MFA enforced.	Audit-ready security posture; enforces least-privilege access automatically via user login; compliance with data governance policies.

3️⃣ Strategic Insights (New Capabilities)¶

Problem: We knew what sold, but not who bought it or why it arrived late; No visibility into regional bottlenecks or product quality issues; Static customer lists.

Business Question	Previously Impossible ❌	Now Possible ✅	Strategic Value 💡
Retention Strategy	Could not link orders to unique humans over time; static Excel lists.	Cohort Analysis: `order_sequence_number` (calculated in dbt via Window Functions) identifies New vs Repeat patterns on any historic date.	LTV Optimization: Marketing shifts spend from "User Acquisition" to "Retention" campaigns based on live Repeat Purchase Rate data.
Logistics Diagnostics	"Average Delivery Time" hid outliers; reactive complaint handling.	Root Cause Decomposition: Decomposition Tree in Power BI exposes states with >20% delay rates (Amazonas: 66.7% failure vs. São Paulo: 8.8%).	SLA Enforcement: Surfaced R$1.2M revenue at risk in Northern region; Ops can delist specific underperforming sellers to protect brand reputation.
Product Quality Management	610 products with missing photos went unnoticed; "invisible inventory" caused lost sales.	Automated dbt tests flag incomplete product records; Ops team alerted when "At Risk" revenue >R$10K.	Detects R$11K+ at-risk revenue monthly; proactive catalog management prevents lost sales.
Seller Performance Tracking	No centralized scorecard; difficult to spot partners causing delivery bottlenecks.	Performance Grid: Sellers ranked by Revenue and Risk (is_delayed). RLS Bridge allows regional managers to view specific partners.	Operations Efficiency: Ops teams instantly pinpoint underperforming sellers driving up Delay Rates.

2.3 Technical Maturity — Engineering Excellence¶

4 Pillars of Production-Ready Systems: DataOps & CI • Quality & Resilience • Performance & Cost • Observability & UX

1️⃣ Software Engineering Standards (DataOps & CI)¶

Problem: "Works on my machine." Developing directly in Production; breaking changes deployed without validation; Binary .pbix files prevented code reviews.

Component	Junior Approach ❌	Senior Approach ✅	Engineering Win 🔧
Development Workflow	Developing directly in Production; changes went live immediately without validation; no documented process.	ADLC Framework: Dev/Prod separation; separate Power BI workspaces; UAT validation before Prod promotion; GitHub Issues tracking; structured manual guide using ADLC framework in Notion.	Zero downtime deployments; 100% UAT pass rate before Prod; zero broken reports; instant rollback via Git; repeatable processes documented for team onboarding.
Version Control	Binary .pbix files in SharePoint; "Save As" versioning.	Power BI Projects (.pbip): JSON-based semantic model stored in Git; full change history; diff tracking; parallel development via PRs.	Granular version history; code reviews for DAX measures; blame tracking for debugging; Git Diffs enable logic reviews.
CI Pipeline	Manual dbt runs; deployed broken models to Prod; no automated checks.	GitHub Actions CI: SQLFluff linting, dbt build with tests, BPA scans, pre-commit hooks that check files best practices before it hits git(SQLFluff auto-fix), auto-deploy on merge to main.	100% test coverage enforcement; pre-commit hooks catch SQL issues before commit; CI blocks merges with failing tests.
AI-Assisted Development	Manual coding from scratch; context lost between sessions; inconsistent AI responses.	Structured AI workflow: GitHub Copilot for dbt SQL; `agents.md` defines personas (Analytics Engineer, BI Developer); `prompt.md` files stores project context; dedicated ChatGPT session given full context of project.	2x development velocity; consistent AI output quality; AI remembers project architecture and business rules; reduces context re-explaining.

2️⃣ Quality Assurance & Resilience¶

Problem: Manual visual checks; data quality issues found by users after dashboards broke; No backup strategy; Breaking changes deployed without warning.

Component	Legacy Reality ❌	Production-Ready Solution ✅	Reliability Gain 🛡️
Automated Testing	Manual visual checks; data quality issues found by users after dashboards broke.	Testing Pyramid: dbt (150+ tests: 60% source, 30% generic, 10% singular tests) + BPA Scans + CI pipeline validation.	Catches 100% of FK violations before merge; prevents bad data from reaching dashboards; business rules enforced in code.
Schema Evolution	Breaking changes deployed without warning; Power Query implicitly loaded all columns.	dbt schema contracts + data validation in staging; explicit `Table.SelectColumns` in Power Query to prevent schema drift; change impact analysis before merge.	Zero breaking changes in Prod; prevents schema drift; fails fast at refresh with clear error messages; explicit contract between Snowflake and Power BI.
Disaster Recovery	No backup strategy; estimated 72+ hours to rebuild from scratch.	Azure Blob geo-redundant storage + Snowflake Time Travel (90 days) + Git history + automated snapshots.	RPO: <1 hour, RTO: <15 minutes; disaster recovery architecture ensures rapid recovery; business continuity framework established.
Data Lineage	"Where does this number come from?" required hours of investigation.	dbt Docs DAG: Full lineage from RAW table → staging → intermediate → marts → Power BI exposure; auto-generated documentation; ADLC framework in repo.	Instant root cause analysis; trust in data transformation logic; clear ownership of each layer; 90% faster onboarding for new engineers.

3️⃣ Performance & Cost Optimization (DataFinOps)¶

Problem: Full daily refresh of all tables; uncontrolled query execution; no budget monitoring; Nested views (10+ layers deep) made debugging impossible.

Component	Wasteful Pattern ❌	Optimized Strategy ✅	Cost & Speed Benefit 💰
Compute Optimization	Full daily refresh of all tables; uncontrolled query execution; no budget monitoring.	Incremental refresh in dbt on fct_order_items (`unique_key`) + Power BI (Sales) Table; X-SMALL warehouses with auto-suspend (60s); query tagging for cost attribution.	Reduced Snowflake compute costs by 42%; cost attribution per team; auto-suspend prevents idle waste.
Query Optimization	No query profiling; users complained about slow dashboards; transformations scattered across layers.	Shift-Left: Heavy Math calculations (datediff) moved to Snowflake; aggregations moved to Power BI; query folding in Power Query pushes transformations to Snowflake; Performance Analyzer validation before publishing report all visuals load under <1200ms.	Lowest compute cost in Snowflake; fastest rendering for users; query latency improved 5x; dashboard load <2s.
Data Transformation	Nested views in PostgreSQL (10+ layers deep); scattered Word docs; outdated within weeks.	dbt Medallion Architecture (RAW→STAGING→INTERMEDIATE→MARTS) + auto-generated dbt Docs with live lineage graphs; reusable intermediate models + full lineage of BI assets via Power BI Service Task Flow + Semantic Model Documentation.	90% faster onboarding; clear lineage; documentation stays in sync with code; prevents "view spaghetti" anti-pattern.

4️⃣ Observability & User Experience¶

Problem: Stakeholders confused why data seemed stale despite "Refresh successful" messages; KPI cards showed only headline numbers; Raw column names confused business users.

Component	User Frustration ❌	Transparency Solution ✅	UX Improvement 👥
Dashboard Observability	Stakeholders confused why data seemed stale despite "Refresh successful" messages; frequent "Is the data up to date?" emails.	Dual-Timestamp Architecture: Separation of concerns in Dashboard Header: • 🌥️ Last Refreshed: System Time (dbt pipeline run transformed new data ) • 🗄️ Data Current Until: Data Time (Max timestamp from Snowflake `meta_project_status`)	Zero confusion. Executives know instantly if viewing "Today's Data" vs. "Most recent ELT run", distinguishing pipeline latency from report refresh latency.
Data Health Monitoring	Issues found when users reported broken dashboards; no visibility into pipeline health.	Snowflake resource monitors + dbt freshness checks + Power BI refresh failure alerts + dedicated Data Quality & Integrity Audit page.	Proactive issue detection; Ops team sees at-risk SKUs (610 missing photos, R$11K revenue at risk) before impacting sales.
User Experience	KPI cards showed only headline numbers; raw column names (snake_case) exposed in Power BI; confusing for business users.	Tooltip implementation (Verified %, At-Risk $) + dual-timestamp footer + dedicated Documentation page + renamed columns to Title Case (e.g., `order_status` → "Order Status").	Users verify data quality without leaving dashboard; new users onboard 3x faster; business-friendly UI; non-technical users understand fields; eliminates "silent failures".

3. Strategic Insights & Recommendations¶

3 Critical Business Opportunities Uncovered via Data Analysis

🔴 Insight 1 – The "North Region" Logistics Failure¶

Insight: While the national delivery network is stable, the Northern region is experiencing a critical failure rate, effectively alienating an entire geographic market.
Evidence: The Root Cause Decomposition Tree highlights that Amazonas (AM) has a 66.7% Delivery Failure Rate, and Maranhão (MA) is at 23.1%. This contrasts sharply with the healthy 8.8% rate in São Paulo.
Impact: High operational costs due to refunds/returns in the North, plus negative brand reputation preventing future growth in that territory.
Recommendation: Immediate courier review for the North. Switch logistics partners for interstate deliveries to AM/MA and extend the "Estimated Delivery Date" in the app for these specific zip codes to manage expectations.

🟠 Insight 2 – The "Empty Calorie" Growth (Volume vs. Value)¶

Insight: The business is acquiring more customers, but they are becoming less valuable. We are seeing "Empty Calorie" growth.
Evidence: Month-over-Month Order Volume increased by +3.1% (96K orders), yet Total Revenue dropped by -3.4%. This is driven by a -6.3% drop in Average Order Value (AOV).
Impact: Profit margins are shrinking. We are processing more shipments (higher cost) for less revenue (lower return).
Recommendation: Shift marketing spend from "low-ticket" items (likely "Cool Stuff" or "Auto") to high-AOV categories like "Computers" or "Watches." Launch product bundles (e.g., "Buy 2, Save 10%") to artificially inflate AOV.

🟡 Insight 3 – Revenue at Risk via "Invisible" Inventory¶

Insight: A lack of validation in the seller portal is causing "Silent Revenue Loss." Products are listed but unsellable due to data errors.
Evidence: The Data Quality Audit identified 1.85% of the Catalog as "High Risk," primarily due to 610 SKUs with Missing Photos. Additionally, R$11,347 in revenue is flagged as "At Risk" due to system logic failures.
Impact: We are losing sales on 610 products simply because customers won't buy items they can't see.
Recommendation: Implement a "Hard Stop" in the Seller Portal: Sellers cannot publish a listing without at least 1 uploaded photo. Create an automated dbt alert for the Operations team when "At Risk" revenue exceeds R$10k.

6. Key Business Questions — Evidence & Answers¶

Question: "What is the total revenue and volume performance, and how is it trending?"
Answer: "Total verified Revenue reached R$13M with a total volume of 96K Orders. While the long-term trend shows consistent growth from Jan 2017 to mid-2018, the most recent month shows a divergence: Order Volume increased by +3.1%, but Revenue dipped by -3.4%."
Visual Used: KPI Cards (Top Row) & Trend Line Chart (Revenue Bars vs. Order Line).
What it means: "The drop in revenue despite rising order counts is explained by the -6.3% drop in Average Order Value (AOV). Customers are buying more often, but purchasing lower-value items this month. Additionally, the tooltip reveals that 99.9% of this revenue is Verified (Delivered), with only 0.1% at risk."

❓ Q2: Which product categories generate the most revenue and orders?¶

Question: "Which product categories are the primary drivers of business volume?"
Answer: "Bed Bath Table is the absolute volume leader with 9.3K orders. It is followed by Health Beauty (8.6K) and Sports Leisure (7.5K)."
Visual Used: Horizontal Bar Chart ("Top 10 Products by Total Orders").
What it means: "The marketplace is dominated by Home & Lifestyle goods rather than Electronics. The top 3 categories account for a significant portion of volume, meaning inventory stability in 'Bed Bath Table' is critical for monthly targets."

❓ Q3: Which regions and states contribute most to revenue and orders?¶

Question: "Where is our customer base geographically concentrated?"
Answer: "São Paulo (SP) is the single largest market, driving 40K orders (approx. 41% of total volume). Rio de Janeiro (RJ) and Minas Gerais (MG) follow with 12K and 11K respectively."
Visual Used: Treemap ("Total Orders Distribution by State").
What it means: "The business is heavily concentrated in the Southeast region. Success in São Paulo effectively dictates the success of the entire company. Any logistics bottleneck in SP (the largest block) will have a disproportionate impact on global KPIs compared to smaller states like SC or PR."

❓ Q4: How efficient is order delivery performance across regions?¶

Question: "What is the baseline for logistics efficiency, and where are the bottlenecks?"
Answer: "The network operates with an average delivery time of 12.4 days and an overall Delay Rate of 8.1%. While the 91.9% On-Time Rate appears healthy at a macro level, deep disparities exist at the regional level."
Visual Used: KPI Cards & Root Cause Analysis (Decomposition Tree).
What it means: "The Decomposition Tree highlights that delays are not random; they are geographic. Northern states face severe logistics failures, with Amazonas (AM) seeing a 66.7% failure rate and Maranhão (MA) at 23.1%. In contrast, high-volume states like SP maintain an 8.8% delay rate, suggesting the problem is specific to 'Last Mile' carriers in remote regions."

❓ Q5: Which sellers contribute the most to revenue?¶

Question: "Where is our supply-side revenue concentrated?"
Answer: "Sellers based in São Paulo (SP) are the absolute engine of the marketplace, generating R$84.8M in revenue across 68K orders. No other state comes close; the next largest contributor is MG (Minas Gerais) with significantly lower volume."
Visual Used: "Seller State" Matrix Table (Bottom Left).
What it means: "The platform has a Single Point of Failure risk. Since >65% of revenue originates from sellers in SP, any local disruption there (strikes, weather, tax changes) would cripple the entire platform's Gross Merchandise Value (GMV). Diversifying the seller base into Southern states (PR, SC) is a necessary strategic move."

❓ Q6: How many customers are repeat buyers versus new customers?¶

Question: "Are we building a loyal customer base or just acquiring new ones?"
Answer: "The business is currently fueled almost entirely by new acquisition. The Loyalty Rate is only 3.0%, meaning 97% of customers purchase once and never return."
Visual Used: KPI Card & "Customer Retention Trend" Stacked Column Chart.
What it means: "The Trend Chart shows strong growth in New Customers (Light Blue bars), but the Repeat Customers segment (Dark Blue bars) is barely visible. This indicates a classic 'Leaky Bucket' problem. While Marketing is effective at acquisition, the operational experience (likely the 12-day wait time) is preventing users from becoming loyalists. Improving retention from 3% to 5% would likely be more profitable than acquiring 1,000 new users."