36. Project 5: Data-to-Insights Analyst Tool
Overview and links for this section of the guide.
On this page
The Vision
Business users have data (Excel, CSV, databases) but don't know Python or SQL. They want to ask questions in plain English:
"Show me sales by region for Q3"
"Which product had the highest growth last month?"
"Create a chart comparing revenue vs. costs over time"
We build a tool where:
- User uploads a CSV (or connects to a database)
- User asks a question in plain English
- AI writes a Python script (using Pandas/Matplotlib) to answer it
- System executes the script in a sandbox
- User sees the chart, table, and optionally downloads the analysis
How It Works
┌─────────────────────────────────────────────────────────────────┐
│ DATA ANALYST PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ UPLOAD │──▶│ INFER │──▶│ GENERATE │──▶│ EXECUTE │ │
│ │ DATA │ │ SCHEMA │ │ CODE │ │ SANDBOX │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ CSV/Excel Column Python code Charts, │
│ files types, names for analysis tables, │
│ samples insights │
│ │
│ │
│ ┌──────────┐ │
│ │ NOTEBOOK │ │
│ │ EXPORT │ │
│ └──────────┘ │
│ │ │
│ ▼ │
│ Reproducible │
│ .ipynb file │
│ │
└─────────────────────────────────────────────────────────────────┘
Architecture
The system has three critical components:
| Component | Purpose | Implementation |
|---|---|---|
| Schema Analyzer | Understand the data structure | Pandas dtype inference + LLM enrichment |
| Code Generator | Translate questions to Python | Gemini with structured prompting |
| Sandbox Executor | Run untrusted code safely | Docker container or WebAssembly |
Security: The Main Challenge
This is the most dangerous project because the AI writes code that we execute. Without proper safeguards:
- Code could read/write arbitrary files
- Code could make network requests (exfiltrate data)
- Code could consume infinite resources (denial of service)
- Code could contain malicious payloads
Security layers:
┌─────────────────────────────────────────────────────────────────┐
│ SECURITY LAYERS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. CODE VALIDATION (before execution) │
│ ├─ Allowlist of imports (pandas, numpy, matplotlib) │
│ ├─ Block dangerous functions (exec, eval, open, os.*) │
│ └─ AST analysis for suspicious patterns │
│ │
│ 2. SANDBOX ISOLATION (during execution) │
│ ├─ No network access │
│ ├─ No filesystem access (except /tmp/work) │
│ ├─ Memory limit (e.g., 512MB) │
│ └─ CPU time limit (e.g., 30 seconds) │
│ │
│ 3. OUTPUT VALIDATION (after execution) │
│ ├─ Maximum output size │
│ ├─ File type validation (only PNG/CSV/JSON) │
│ └─ Scan for sensitive data in output │
│ │
└─────────────────────────────────────────────────────────────────┘
Even with code validation, sophisticated attacks can bypass static analysis. The sandbox is your last line of defense. Use Docker with network isolation, or better yet, a WebAssembly runtime like Pyodide.
Technology Stack
| Component | Technology | Why |
|---|---|---|
| LLM | Gemini 1.5 Pro | Strong code generation, large context for data samples |
| Data Processing | Pandas + NumPy | Industry standard, LLM knows them well |
| Visualization | Matplotlib + Seaborn | Flexible, well-documented, LLM-friendly |
| Sandbox (Option A) | Docker + gVisor | Strong isolation, easy to configure |
| Sandbox (Option B) | Pyodide (WebAssembly) | Browser-based, no server needed |
| Frontend | React + AG Grid | Good table rendering, chart integration |
What You'll Learn
By the end of this project, you'll have hands-on experience with:
- Code generation: Prompting LLMs to write correct, safe Python code
- Schema inference: Understanding data structures from samples
- Sandboxed execution: Running untrusted code safely
- Reproducibility: Generating shareable notebooks from AI output
- Guard rails: Preventing confident math errors and hallucinations
For prototyping, use Pyodide (Python in WebAssembly). It runs entirely in the browser with built-in isolation. No server, no Docker, no security headaches. Upgrade to Docker when you need more power.
Where to go next
Explore next
36. Project 5: Data-to-Insights Analyst Tool sub-sections
36.1 Upload/ingest flow for CSV-like data
Open page
36.2 Ask questions in natural language; answer in charts/tables
Open page
36.3 Schema inference and validation
Open page
36.4 Reproducible "analysis notebooks" from AI output
Open page
36.5 Guardrails: avoid confident math errors
Open page