34.2 Repo indexing strategy | Vibe Coding Wiki

On this page

The Indexing Problem
Level 1: Brute Force (Small Repos)
Level 2: The File Map (Medium Repos)
Level 3: Embeddings (Large Repos)
Implementation: Building the Indexer
Where to go next

The Indexing Problem

The fundamental challenge: How do we give the model enough context about our codebase without exceeding the context window or spending too much on tokens?

There are three strategies, each suited to different repo sizes:

Strategy	Repo Size	Complexity	Accuracy
Brute Force	<50 files	Trivial	Perfect
File Map	50-500 files	Low	Very Good
Embeddings	500+ files	High	Good

Level 1: Brute Force (Small Repos)

If your project is <50 files, don't overthink it. Just traverse the directory, concatenate all non-ignored files into a structured format, and stuff it into the context window.

Gemini 1.5 Pro has a 2M token window. That fits 99% of side projects entirely.

// brute-force-indexer.js
const fs = require('fs');
const path = require('path');
const ignore = require('ignore');

function indexRepo(rootDir) {
  const ig = ignore().add(fs.readFileSync(
    path.join(rootDir, '.gitignore'), 'utf8'
  ).split('\n'));
  
  // Always ignore these
  ig.add(['node_modules', '.git', '.env', '*.lock']);
  
  const files = [];
  
  function walk(dir) {
    for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
      const fullPath = path.join(dir, entry.name);
      const relativePath = path.relative(rootDir, fullPath);
      
      if (ig.ignores(relativePath)) continue;
      
      if (entry.isDirectory()) {
        walk(fullPath);
      } else if (isTextFile(entry.name)) {
        files.push({
          path: relativePath,
          content: fs.readFileSync(fullPath, 'utf8')
        });
      }
    }
  }
  
  walk(rootDir);
  return files;
}

function formatForPrompt(files) {
  return files.map(f => `

${f.content}
`).join('\n');
}

XML Wrapping

Wrapping each file in XML tags (<file path="...">) helps the model understand structure. It can now reference files by path and knows where one file ends and another begins.

Level 2: The File Map (Medium Repos)

If you have a 100MB monorepo, you can't send everything. Instead, we use a two-step approach:

Index phase: Generate a "File Map"—a tree structure with paths and optional descriptions.
Query phase: Send the map first, let the model request specific files.

// file-map-example.json
{
  "files": [
    { "path": "src/index.ts", "description": "Entry point, sets up Express server" },
    { "path": "src/routes/auth.ts", "description": "Login, logout, register endpoints" },
    { "path": "src/routes/users.ts", "description": "CRUD operations for users" },
    { "path": "src/models/User.ts", "description": "User model with Prisma schema" },
    { "path": "src/middleware/auth.ts", "description": "JWT verification middleware" },
    { "path": "src/utils/validation.ts", "description": "Email, password validators" }
  ]
}

How the Conversation Works

SYSTEM: You are a code assistant. Here is the file map for this project:
[file map JSON]

When you need to see file contents, use the read_file tool.

USER: Update the login page to add a "Remember Me" checkbox.

AI: Looking at the file map, I need to examine:
1. src/routes/auth.ts - to see the login endpoint
2. src/components/LoginForm.tsx - to update the UI

[TOOL CALL: read_file("src/routes/auth.ts")]
[TOOL CALL: read_file("src/components/LoginForm.tsx")]

AI: I've reviewed the files. Here's my plan:
1. Add a `rememberMe` parameter to the login endpoint
2. If true, set token expiry to 30 days instead of 1 day
3. Add checkbox to LoginForm that passes this parameter

Here's the diff for src/routes/auth.ts:
[diff]

Level 3: Embeddings (Large Repos)

For massive codebases (think: Google scale, millions of files), you need vector search:

Chunk every file into smaller pieces (functions, classes, or fixed-size chunks).
Generate embeddings for each chunk using a model like text-embedding-004.
Store in a vector database (Pinecone, Weaviate, or even a local FAISS index).
At query time, embed the user's question, find the K nearest chunks, and include them.

// embeddings-example.js (simplified)
const { GoogleGenerativeAI } = require("@google/generative-ai");

async function embedChunk(text) {
  const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
  const model = genAI.getGenerativeModel({ model: "text-embedding-004" });
  const result = await model.embedContent(text);
  return result.embedding.values;
}

async function findRelevantChunks(query, vectorDB, k = 5) {
  const queryEmbedding = await embedChunk(query);
  return vectorDB.search(queryEmbedding, k);
}

Embeddings Are Complex

Vector search adds significant complexity: cache invalidation when files change, chunking strategy decisions, re-ranking for quality. Only use this for truly massive repos. For this project, we'll stick to Level 1 or 2.

Implementation: Building the Indexer

Here's a complete, production-ready indexer that supports both Level 1 and Level 2:

// indexer.ts
import * as fs from 'fs';
import * as path from 'path';
import ignore, { Ignore } from 'ignore';

interface FileInfo {
  path: string;
  size: number;
  content?: string;  // Only included in brute-force mode
}

interface IndexResult {
  totalFiles: number;
  totalSize: number;
  files: FileInfo[];
}

const BINARY_EXTENSIONS = new Set([
  '.png', '.jpg', '.jpeg', '.gif', '.ico', '.pdf',
  '.zip', '.tar', '.gz', '.exe', '.dll', '.so',
  '.woff', '.woff2', '.ttf', '.eot'
]);

const ALWAYS_IGNORE = [
  'node_modules', '.git', '.env', '.env.*',
  '*.lock', 'package-lock.json', 'yarn.lock',
  'dist', 'build', '.next', '__pycache__'
];

export function createIndexer(rootDir: string) {
  const ig = ignore();
  
  // Load .gitignore if it exists
  const gitignorePath = path.join(rootDir, '.gitignore');
  if (fs.existsSync(gitignorePath)) {
    ig.add(fs.readFileSync(gitignorePath, 'utf8'));
  }
  ig.add(ALWAYS_IGNORE);
  
  function shouldInclude(relativePath: string): boolean {
    if (ig.ignores(relativePath)) return false;
    const ext = path.extname(relativePath).toLowerCase();
    if (BINARY_EXTENSIONS.has(ext)) return false;
    return true;
  }
  
  function walk(dir: string, files: FileInfo[] = []): FileInfo[] {
    for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
      const fullPath = path.join(dir, entry.name);
      const relativePath = path.relative(rootDir, fullPath);
      
      if (!shouldInclude(relativePath)) continue;
      
      if (entry.isDirectory()) {
        walk(fullPath, files);
      } else {
        const stats = fs.statSync(fullPath);
        files.push({
          path: relativePath,
          size: stats.size
        });
      }
    }
    return files;
  }
  
  return {
    // Level 1: Get everything
    getFullContext(): IndexResult {
      const files = walk(rootDir).map(f => ({
        ...f,
        content: fs.readFileSync(path.join(rootDir, f.path), 'utf8')
      }));
      return {
        totalFiles: files.length,
        totalSize: files.reduce((sum, f) => sum + f.size, 0),
        files
      };
    },
    
    // Level 2: Get just the map
    getFileMap(): IndexResult {
      const files = walk(rootDir);
      return {
        totalFiles: files.length,
        totalSize: files.reduce((sum, f) => sum + f.size, 0),
        files
      };
    },
    
    // Read specific files on demand
    readFiles(paths: string[]): FileInfo[] {
      return paths.map(p => ({
        path: p,
        size: fs.statSync(path.join(rootDir, p)).size,
        content: fs.readFileSync(path.join(rootDir, p), 'utf8')
      }));
    }
  };
}

Where to go next

34.3 Change proposal workflow (plan → diff → tests)