Home/ Part XII — Building Real Products (End-to-End Projects)/36. Project 5: Data-to-Insights Analyst Tool

36. Project 5: Data-to-Insights Analyst Tool

Overview and links for this section of the guide.

The Vision

Business users have data (Excel, CSV, databases) but don't know Python or SQL. They want to ask questions in plain English:

"Show me sales by region for Q3"

"Which product had the highest growth last month?"

"Create a chart comparing revenue vs. costs over time"

We build a tool where:

  1. User uploads a CSV (or connects to a database)
  2. User asks a question in plain English
  3. AI writes a Python script (using Pandas/Matplotlib) to answer it
  4. System executes the script in a sandbox
  5. User sees the chart, table, and optionally downloads the analysis

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                    DATA ANALYST PIPELINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐     │
│  │  UPLOAD  │──▶│ INFER    │──▶│ GENERATE │──▶│ EXECUTE  │     │
│  │  DATA    │   │ SCHEMA   │   │   CODE   │   │ SANDBOX  │     │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘     │
│       │               │               │               │         │
│       ▼               ▼               ▼               ▼         │
│   CSV/Excel       Column         Python code      Charts,       │
│   files           types, names   for analysis     tables,       │
│                   samples                          insights     │
│                                                                  │
│                                                                  │
│                      ┌──────────┐                               │
│                      │ NOTEBOOK │                               │
│                      │  EXPORT  │                               │
│                      └──────────┘                               │
│                           │                                      │
│                           ▼                                      │
│                    Reproducible                                  │
│                    .ipynb file                                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Architecture

The system has three critical components:

Component Purpose Implementation
Schema Analyzer Understand the data structure Pandas dtype inference + LLM enrichment
Code Generator Translate questions to Python Gemini with structured prompting
Sandbox Executor Run untrusted code safely Docker container or WebAssembly

Security: The Main Challenge

This is the most dangerous project because the AI writes code that we execute. Without proper safeguards:

  • Code could read/write arbitrary files
  • Code could make network requests (exfiltrate data)
  • Code could consume infinite resources (denial of service)
  • Code could contain malicious payloads

Security layers:

┌─────────────────────────────────────────────────────────────────┐
│                     SECURITY LAYERS                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. CODE VALIDATION (before execution)                          │
│     ├─ Allowlist of imports (pandas, numpy, matplotlib)         │
│     ├─ Block dangerous functions (exec, eval, open, os.*)       │
│     └─ AST analysis for suspicious patterns                     │
│                                                                  │
│  2. SANDBOX ISOLATION (during execution)                        │
│     ├─ No network access                                        │
│     ├─ No filesystem access (except /tmp/work)                  │
│     ├─ Memory limit (e.g., 512MB)                               │
│     └─ CPU time limit (e.g., 30 seconds)                        │
│                                                                  │
│  3. OUTPUT VALIDATION (after execution)                         │
│     ├─ Maximum output size                                       │
│     ├─ File type validation (only PNG/CSV/JSON)                 │
│     └─ Scan for sensitive data in output                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
Never Skip the Sandbox

Even with code validation, sophisticated attacks can bypass static analysis. The sandbox is your last line of defense. Use Docker with network isolation, or better yet, a WebAssembly runtime like Pyodide.

Technology Stack

Component Technology Why
LLM Gemini 1.5 Pro Strong code generation, large context for data samples
Data Processing Pandas + NumPy Industry standard, LLM knows them well
Visualization Matplotlib + Seaborn Flexible, well-documented, LLM-friendly
Sandbox (Option A) Docker + gVisor Strong isolation, easy to configure
Sandbox (Option B) Pyodide (WebAssembly) Browser-based, no server needed
Frontend React + AG Grid Good table rendering, chart integration

What You'll Learn

By the end of this project, you'll have hands-on experience with:

  • Code generation: Prompting LLMs to write correct, safe Python code
  • Schema inference: Understanding data structures from samples
  • Sandboxed execution: Running untrusted code safely
  • Reproducibility: Generating shareable notebooks from AI output
  • Guard rails: Preventing confident math errors and hallucinations
Start with Pyodide

For prototyping, use Pyodide (Python in WebAssembly). It runs entirely in the browser with built-in isolation. No server, no Docker, no security headaches. Upgrade to Docker when you need more power.

Where to go next