34.2 Repo indexing strategy
Overview and links for this section of the guide.
On this page
The Indexing Problem
The fundamental challenge: How do we give the model enough context about our codebase without exceeding the context window or spending too much on tokens?
There are three strategies, each suited to different repo sizes:
| Strategy | Repo Size | Complexity | Accuracy |
|---|---|---|---|
| Brute Force | <50 files | Trivial | Perfect |
| File Map | 50-500 files | Low | Very Good |
| Embeddings | 500+ files | High | Good |
Level 1: Brute Force (Small Repos)
If your project is <50 files, don't overthink it. Just traverse the directory, concatenate all non-ignored files into a structured format, and stuff it into the context window.
Gemini 1.5 Pro has a 2M token window. That fits 99% of side projects entirely.
// brute-force-indexer.js
const fs = require('fs');
const path = require('path');
const ignore = require('ignore');
function indexRepo(rootDir) {
const ig = ignore().add(fs.readFileSync(
path.join(rootDir, '.gitignore'), 'utf8'
).split('\n'));
// Always ignore these
ig.add(['node_modules', '.git', '.env', '*.lock']);
const files = [];
function walk(dir) {
for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
const fullPath = path.join(dir, entry.name);
const relativePath = path.relative(rootDir, fullPath);
if (ig.ignores(relativePath)) continue;
if (entry.isDirectory()) {
walk(fullPath);
} else if (isTextFile(entry.name)) {
files.push({
path: relativePath,
content: fs.readFileSync(fullPath, 'utf8')
});
}
}
}
walk(rootDir);
return files;
}
function formatForPrompt(files) {
return files.map(f => `
${f.content}
`).join('\n');
}
Wrapping each file in XML tags (<file path="...">) helps the model understand structure. It can
now reference files by path and knows where one file ends and another begins.
Level 2: The File Map (Medium Repos)
If you have a 100MB monorepo, you can't send everything. Instead, we use a two-step approach:
- Index phase: Generate a "File Map"—a tree structure with paths and optional descriptions.
- Query phase: Send the map first, let the model request specific files.
// file-map-example.json
{
"files": [
{ "path": "src/index.ts", "description": "Entry point, sets up Express server" },
{ "path": "src/routes/auth.ts", "description": "Login, logout, register endpoints" },
{ "path": "src/routes/users.ts", "description": "CRUD operations for users" },
{ "path": "src/models/User.ts", "description": "User model with Prisma schema" },
{ "path": "src/middleware/auth.ts", "description": "JWT verification middleware" },
{ "path": "src/utils/validation.ts", "description": "Email, password validators" }
]
}
How the Conversation Works
SYSTEM: You are a code assistant. Here is the file map for this project:
[file map JSON]
When you need to see file contents, use the read_file tool.
USER: Update the login page to add a "Remember Me" checkbox.
AI: Looking at the file map, I need to examine:
1. src/routes/auth.ts - to see the login endpoint
2. src/components/LoginForm.tsx - to update the UI
[TOOL CALL: read_file("src/routes/auth.ts")]
[TOOL CALL: read_file("src/components/LoginForm.tsx")]
AI: I've reviewed the files. Here's my plan:
1. Add a `rememberMe` parameter to the login endpoint
2. If true, set token expiry to 30 days instead of 1 day
3. Add checkbox to LoginForm that passes this parameter
Here's the diff for src/routes/auth.ts:
[diff]
Level 3: Embeddings (Large Repos)
For massive codebases (think: Google scale, millions of files), you need vector search:
- Chunk every file into smaller pieces (functions, classes, or fixed-size chunks).
- Generate embeddings for each chunk using a model like
text-embedding-004. - Store in a vector database (Pinecone, Weaviate, or even a local FAISS index).
- At query time, embed the user's question, find the K nearest chunks, and include them.
// embeddings-example.js (simplified)
const { GoogleGenerativeAI } = require("@google/generative-ai");
async function embedChunk(text) {
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "text-embedding-004" });
const result = await model.embedContent(text);
return result.embedding.values;
}
async function findRelevantChunks(query, vectorDB, k = 5) {
const queryEmbedding = await embedChunk(query);
return vectorDB.search(queryEmbedding, k);
}
Vector search adds significant complexity: cache invalidation when files change, chunking strategy decisions, re-ranking for quality. Only use this for truly massive repos. For this project, we'll stick to Level 1 or 2.
Implementation: Building the Indexer
Here's a complete, production-ready indexer that supports both Level 1 and Level 2:
// indexer.ts
import * as fs from 'fs';
import * as path from 'path';
import ignore, { Ignore } from 'ignore';
interface FileInfo {
path: string;
size: number;
content?: string; // Only included in brute-force mode
}
interface IndexResult {
totalFiles: number;
totalSize: number;
files: FileInfo[];
}
const BINARY_EXTENSIONS = new Set([
'.png', '.jpg', '.jpeg', '.gif', '.ico', '.pdf',
'.zip', '.tar', '.gz', '.exe', '.dll', '.so',
'.woff', '.woff2', '.ttf', '.eot'
]);
const ALWAYS_IGNORE = [
'node_modules', '.git', '.env', '.env.*',
'*.lock', 'package-lock.json', 'yarn.lock',
'dist', 'build', '.next', '__pycache__'
];
export function createIndexer(rootDir: string) {
const ig = ignore();
// Load .gitignore if it exists
const gitignorePath = path.join(rootDir, '.gitignore');
if (fs.existsSync(gitignorePath)) {
ig.add(fs.readFileSync(gitignorePath, 'utf8'));
}
ig.add(ALWAYS_IGNORE);
function shouldInclude(relativePath: string): boolean {
if (ig.ignores(relativePath)) return false;
const ext = path.extname(relativePath).toLowerCase();
if (BINARY_EXTENSIONS.has(ext)) return false;
return true;
}
function walk(dir: string, files: FileInfo[] = []): FileInfo[] {
for (const entry of fs.readdirSync(dir, { withFileTypes: true })) {
const fullPath = path.join(dir, entry.name);
const relativePath = path.relative(rootDir, fullPath);
if (!shouldInclude(relativePath)) continue;
if (entry.isDirectory()) {
walk(fullPath, files);
} else {
const stats = fs.statSync(fullPath);
files.push({
path: relativePath,
size: stats.size
});
}
}
return files;
}
return {
// Level 1: Get everything
getFullContext(): IndexResult {
const files = walk(rootDir).map(f => ({
...f,
content: fs.readFileSync(path.join(rootDir, f.path), 'utf8')
}));
return {
totalFiles: files.length,
totalSize: files.reduce((sum, f) => sum + f.size, 0),
files
};
},
// Level 2: Get just the map
getFileMap(): IndexResult {
const files = walk(rootDir);
return {
totalFiles: files.length,
totalSize: files.reduce((sum, f) => sum + f.size, 0),
files
};
},
// Read specific files on demand
readFiles(paths: string[]): FileInfo[] {
return paths.map(p => ({
path: p,
size: fs.statSync(path.join(rootDir, p)).size,
content: fs.readFileSync(path.join(rootDir, p), 'utf8')
}));
}
};
}