36. Project 5: Data-to-Insights Analyst Tool

On this page

The Vision
How It Works
Architecture
Security: The Main Challenge
Technology Stack
What You'll Learn
Where to go next

The Vision

Business users have data (Excel, CSV, databases) but don't know Python or SQL. They want to ask questions in plain English:

"Show me sales by region for Q3"

"Which product had the highest growth last month?"

"Create a chart comparing revenue vs. costs over time"

We build a tool where:

User uploads a CSV (or connects to a database)
User asks a question in plain English
AI writes a Python script (using Pandas/Matplotlib) to answer it
System executes the script in a sandbox
User sees the chart, table, and optionally downloads the analysis

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                    DATA ANALYST PIPELINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐     │
│  │  UPLOAD  │──▶│ INFER    │──▶│ GENERATE │──▶│ EXECUTE  │     │
│  │  DATA    │   │ SCHEMA   │   │   CODE   │   │ SANDBOX  │     │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘     │
│       │               │               │               │         │
│       ▼               ▼               ▼               ▼         │
│   CSV/Excel       Column         Python code      Charts,       │
│   files           types, names   for analysis     tables,       │
│                   samples                          insights     │
│                                                                  │
│                                                                  │
│                      ┌──────────┐                               │
│                      │ NOTEBOOK │                               │
│                      │  EXPORT  │                               │
│                      └──────────┘                               │
│                           │                                      │
│                           ▼                                      │
│                    Reproducible                                  │
│                    .ipynb file                                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Architecture

The system has three critical components:

Component	Purpose	Implementation
Schema Analyzer	Understand the data structure	Pandas dtype inference + LLM enrichment
Code Generator	Translate questions to Python	Gemini with structured prompting
Sandbox Executor	Run untrusted code safely	Docker container or WebAssembly

Security: The Main Challenge

This is the most dangerous project because the AI writes code that we execute. Without proper safeguards:

Code could read/write arbitrary files
Code could make network requests (exfiltrate data)
Code could consume infinite resources (denial of service)
Code could contain malicious payloads

Security layers:

┌─────────────────────────────────────────────────────────────────┐
│                     SECURITY LAYERS                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. CODE VALIDATION (before execution)                          │
│     ├─ Allowlist of imports (pandas, numpy, matplotlib)         │
│     ├─ Block dangerous functions (exec, eval, open, os.*)       │
│     └─ AST analysis for suspicious patterns                     │
│                                                                  │
│  2. SANDBOX ISOLATION (during execution)                        │
│     ├─ No network access                                        │
│     ├─ No filesystem access (except /tmp/work)                  │
│     ├─ Memory limit (e.g., 512MB)                               │
│     └─ CPU time limit (e.g., 30 seconds)                        │
│                                                                  │
│  3. OUTPUT VALIDATION (after execution)                         │
│     ├─ Maximum output size                                       │
│     ├─ File type validation (only PNG/CSV/JSON)                 │
│     └─ Scan for sensitive data in output                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Never Skip the Sandbox

Even with code validation, sophisticated attacks can bypass static analysis. The sandbox is your last line of defense. Use Docker with network isolation, or better yet, a WebAssembly runtime like Pyodide.

Technology Stack

Component	Technology	Why
LLM	Gemini 1.5 Pro	Strong code generation, large context for data samples
Data Processing	Pandas + NumPy	Industry standard, LLM knows them well
Visualization	Matplotlib + Seaborn	Flexible, well-documented, LLM-friendly
Sandbox (Option A)	Docker + gVisor	Strong isolation, easy to configure
Sandbox (Option B)	Pyodide (WebAssembly)	Browser-based, no server needed
Frontend	React + AG Grid	Good table rendering, chart integration

What You'll Learn

By the end of this project, you'll have hands-on experience with:

Code generation: Prompting LLMs to write correct, safe Python code
Schema inference: Understanding data structures from samples
Sandboxed execution: Running untrusted code safely
Reproducibility: Generating shareable notebooks from AI output
Guard rails: Preventing confident math errors and hallucinations

Start with Pyodide

For prototyping, use Pyodide (Python in WebAssembly). It runs entirely in the browser with built-in isolation. No server, no Docker, no security headaches. Upgrade to Docker when you need more power.

36. Project 5: Data-to-Insights Analyst Tool

The Vision

How It Works

Architecture

Security: The Main Challenge

Technology Stack

What You'll Learn

Where to go next

36. Project 5: Data-to-Insights Analyst Tool sub-sections