Case Study: Building an AI-Powered Lead Discovery System
The Problem with Traditional Lead Generation
Most B2B companies rely on the same tired playbook: cold email lists, LinkedIn outreach, paid ads. The problem isn't that these don't work—it's that they're reactive and increasingly saturated.
By the time you reach a prospect through these channels, they've already been contacted by dozens of competitors. You're fighting for attention in an overcrowded space.
The real opportunity is finding prospects when they're actively looking for a solution—before they've been bombarded by sales teams. That's where Reddit becomes interesting.
Why Reddit? Signal in the Noise
Reddit hosts thousands of niche communities where potential customers discuss real problems in real time. Someone posts "Our team is drowning in manual data entry" in r/SaaS—that's a buying signal.
But Reddit is massive: millions of posts daily across hundreds of thousands of subreddits. Manually monitoring conversations doesn't scale. You need automation.
The challenge: build a system that can scan relevant discussions, identify genuine buying intent, qualify leads based on your ICP, and surface them before they become saturated with responses.
System Architecture Overview
The system consists of five core components:
1. Data ingestion pipeline - Real-time Reddit monitoring
2. Intent detection - Filtering signal from noise
3. LLM-based scoring - Qualifying leads against ICP
4. Automation layer - Scheduled scans and alerting
5. Personalization engine - Context-aware response generation
Let's break down each layer.
1. Data Ingestion & Reddit Scanning
The Constraint
Reddit's API has rate limits. You can't brute-force scan every post. You need to be strategic about what you monitor and how often.
Approach
interface SubredditConfig {
name: string;
keywords: string[];
scanInterval: number; // minutes
priority: 'high' | 'medium' | 'low';
}
class RedditScanner {
private reddit: Snoowrap;
private db: Database;
private queue: Queue;
async scanSubreddits(configs: SubredditConfig[]): Promise {
// Prioritize high-value subreddits
const sortedConfigs = configs.sort((a, b) =>
this.getPriorityWeight(b.priority) - this.getPriorityWeight(a.priority)
);
for (const config of sortedConfigs) {
const posts = await this.fetchRecentPosts(config);
const filtered = await this.filterByKeywords(posts, config.keywords);
// Queue for processing
await Promise.all(
filtered.map(post => this.queue.add('process-post', {
postId: post.id,
subreddit: config.name,
content: post.selftext,
title: post.title,
author: post.author.name,
createdAt: post.created_utc,
url: post.url,
score: post.score,
numComments: post.num_comments
}))
);
}
}
private async fetchRecentPosts(config: SubredditConfig): Promise {
const lastScan = await this.getLastScanTime(config.name);
const subreddit = await this.reddit.getSubreddit(config.name);
// Fetch new posts since last scan
return subreddit
.getNew({ limit: 100 })
.filter(post => post.created_utc > lastScan);
}
private async filterByKeywords(
posts: Post[],
keywords: string[]
): Promise {
return posts.filter(post => {
const text = ${post.title} ${post.selftext}.toLowerCase();
return keywords.some(keyword => text.includes(keyword.toLowerCase()));
});
}
}
Key Design Decisions
2. Intent Detection & Filtering
Not every post mentioning your keywords is a lead. Someone asking "What's the best project management tool?" is a lead. Someone saying "I hate project management tools" is not.
LLM-Powered Intent Classification
interface IntentAnalysis {
hasBuyingIntent: boolean;
confidence: number;
intentType: 'problem' | 'comparison' | 'recommendation' | 'complaint' | 'other';
urgency: 'high' | 'medium' | 'low';
reasoning: string;
}
class IntentDetector {
private llm: LLMClient;
async analyzeIntent(post: RedditPost): Promise {
const prompt = this.buildIntentPrompt(post);
const response = await this.llm.complete({
model: 'gpt-4',
messages: [
{
role: 'system',
content: `You are an expert at detecting buying intent in social media posts.
Analyze posts for genuine buying signals:
Problem statements indicating pain points Requests for tool/solution recommendations Comparisons of different solutions Mentions of budget or timeline
Distinguish from:
Complaints without solution-seeking General discussions Casual mentions Off-topic content
Return analysis as JSON.`
},
{
role: 'user',
content: prompt
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
private buildIntentPrompt(post: RedditPost): string {
return `Analyze this Reddit post for buying intent:
Title: ${post.title}
Content: ${post.content}
Subreddit: ${post.subreddit}
Engagement: ${post.score} upvotes, ${post.numComments} comments
Provide:
1. hasBuyingIntent (boolean)
2. confidence (0-1)
3. intentType (problem/comparison/recommendation/complaint/other)
4. urgency (high/medium/low)
5. reasoning (brief explanation)`;
}
}
Why GPT-4 Over Fine-Tuned Models
Early on, we tested smaller fine-tuned models for intent detection. GPT-4 was more expensive but significantly better at nuance—distinguishing between "I need a solution" and "I'm just venting."
For a lead generation system, false positives are expensive (wasted time). False negatives are lost opportunities. GPT-4's accuracy justified the cost.
3. LLM-Based Lead Scoring
Once we know a post has buying intent, we need to qualify it against your Ideal Customer Profile (ICP).
ICP Matching with Structured Prompts
interface ICPCriteria {
companySize?: {
min?: number;
max?: number;
};
industry?: string[];
roles?: string[];
painPoints?: string[];
budget?: {
min?: number;
max?: number;
};
techStack?: string[];
}
interface LeadScore {
overallScore: number; // 0-100
icpMatch: boolean;
matchedCriteria: string[];
missingCriteria: string[];
reasoning: string;
recommendedAction: 'high-priority' | 'medium-priority' | 'low-priority' | 'disqualify';
}
class LeadScorer {
private llm: LLMClient;
async scoreAgainstICP(
post: RedditPost,
intent: IntentAnalysis,
icp: ICPCriteria
): Promise {
const prompt = this.buildScoringPrompt(post, intent, icp);
const response = await this.llm.complete({
model: 'gpt-4',
messages: [
{
role: 'system',
content: `You are an expert at qualifying B2B leads.
Analyze posts against ICP criteria and score leads based on:
Fit with target company size, industry, role Alignment with pain points your product solves Indication of budget/decision-making authority Urgency and timeline Technical environment compatibility
Scoring guidelines:
90-100: Perfect ICP match, high urgency 70-89: Strong match, good timing 50-69: Partial match, worth monitoring Below 50: Poor fit, disqualify
Return analysis as JSON.`
},
{
role: 'user',
content: prompt
}
],
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
private buildScoringPrompt(
post: RedditPost,
intent: IntentAnalysis,
icp: ICPCriteria
): string {
return `Score this lead against our ICP:
POST:
Title: ${post.title}
Content: ${post.content}
Context: Posted in r/${post.subreddit}
Intent: ${intent.intentType} (confidence: ${intent.confidence})
Urgency: ${intent.urgency}
ICP CRITERIA:
Company Size: ${this.formatCompanySize(icp.companySize)}
Industries: ${icp.industry?.join(', ') || 'Any'}
Target Roles: ${icp.roles?.join(', ') || 'Any'}
Pain Points: ${icp.painPoints?.join(', ')}
Budget Range: ${this.formatBudget(icp.budget)}
Tech Stack: ${icp.techStack?.join(', ') || 'Any'}
Provide:
1. overallScore (0-100)
2. icpMatch (boolean)
3. matchedCriteria (array of criteria that match)
4. missingCriteria (array of criteria we can't confirm)
5. reasoning (detailed explanation)
6. recommendedAction (high-priority/medium-priority/low-priority/disqualify)`;
}
}
Prompt Engineering for Consistency
The quality of lead scoring depends heavily on prompt structure. We iterate on prompts based on:
4. Automation & Scheduling
Cron-Based Scanning
class ScanScheduler {
private scanner: RedditScanner;
private intentDetector: IntentDetector;
private leadScorer: LeadScorer;
private notifier: NotificationService;
async runScanJob(clientId: string): Promise {
const client = await this.getClientConfig(clientId);
// 1. Scan configured subreddits
const posts = await this.scanner.scanSubreddits(client.subreddits);
// 2. Detect intent in parallel
const intents = await Promise.all(
posts.map(post => this.intentDetector.analyzeIntent(post))
);
// 3. Filter to posts with buying intent
const qualified = posts.filter((post, i) =>
intents[i].hasBuyingIntent && intents[i].confidence > 0.7
);
// 4. Score against ICP
const scores = await Promise.all(
qualified.map((post, i) =>
this.leadScorer.scoreAgainstICP(post, intents[i], client.icp)
)
);
// 5. Filter high-value leads
const highPriorityLeads = qualified.filter((post, i) =>
scores[i].overallScore >= 70
);
// 6. Notify client
if (highPriorityLeads.length > 0) {
await this.notifier.sendLeadAlert(client, highPriorityLeads, scores);
}
// 7. Store results
await this.storeLeads(clientId, qualified, intents, scores);
return {
postsScanned: posts.length,
intentDetected: qualified.length,
highPriorityLeads: highPriorityLeads.length,
timestamp: new Date()
};
}
}
Cost Optimization
LLM calls are expensive. At scale, you need to optimize:
Caching: If we've analyzed a post before, don't re-analyze
Batch processing: Queue posts and process in batches during off-peak API hours
Tiered analysis: Use cheaper models (GPT-3.5) for intent detection, reserve GPT-4 for lead scoring
Smart filtering: Aggressive keyword filtering before LLM calls
class CostOptimizer {
private cache: Cache;
async analyzeBatch(posts: RedditPost[]): Promise {
// Check cache first
const cached = await this.checkCache(posts);
const uncached = posts.filter(p => !cached.has(p.id));
if (uncached.length === 0) {
return this.getCachedResults(posts);
}
// Batch uncached posts
const batches = this.createBatches(uncached, 10);
const results = [];
for (const batch of batches) {
const batchResults = await this.processBatch(batch);
results.push(...batchResults);
// Rate limiting
await this.sleep(1000);
}
// Cache new results
await this.cacheResults(uncached, results);
// Combine cached + new
return this.mergeResults(posts, cached, results);
}
}
5. Personalization Engine
Once you have a qualified lead, you need a relevant response. Generic copy-paste doesn't work on Reddit.
Context-Aware Response Generation
class ResponseGenerator {
private llm: LLMClient;
async generateResponse(
post: RedditPost,
leadScore: LeadScore,
product: ProductInfo
): Promise {
const prompt = `Generate a helpful, non-salesy Reddit comment for this post:
POST:
Title: ${post.title}
Content: ${post.content}
Context: User is looking for ${leadScore.matchedCriteria.join(', ')}
OUR PRODUCT:
${product.description}
Key features: ${product.features.join(', ')}
Best for: ${product.idealFor}
GUIDELINES:
Be genuinely helpful, not promotional Address their specific pain point Mention the product naturally if relevant Sound like a real Reddit user Keep it concise (2-3 sentences) Don't use marketing speak
Write the response:`;
const response = await this.llm.complete({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'You write helpful, authentic Reddit comments.' },
{ role: 'user', content: prompt }
],
temperature: 0.7
});
return response.choices[0].message.content;
}
}
Human-in-the-Loop
We don't auto-post. The system generates suggested responses, but a human reviews and approves. This maintains authenticity and avoids spammy behavior.
Infrastructure & Deployment
const techStack = {
runtime: 'Node.js 20 + TypeScript',
framework: 'Next.js 14 (API routes)',
database: 'PostgreSQL 15',
cache: 'Redis 7',
queue: 'BullMQ',
llm: 'OpenAI API (GPT-4 + GPT-3.5-turbo)',
deployment: 'Vercel (web) + AWS Lambda (cron jobs)',
monitoring: 'Axiom (logs) + Sentry (errors)'
};
Why This Stack
Results & Learnings
What Worked
LLM-based intent detection was far more accurate than keyword rules. The ability to understand context (sarcasm, complaints vs. genuine problems) was critical.
Aggressive pre-filtering reduced API costs significantly. Simple keyword matching before LLM calls cut costs by ~80%.
Structured prompts with JSON output made parsing reliable. Early experiments with free-form LLM responses were inconsistent.
What Didn't Work
Fine-tuned models underperformed GPT-4 for our use case. The marginal cost savings weren't worth the accuracy loss.
Auto-posting felt spammy. Human review of generated responses maintained quality and brand safety.
Scanning too many subreddits diluted focus. Better to deeply monitor 20 high-value communities than shallowly scan 200.
Architecture Tradeoffs
Serverless vs. Long-Running Workers
We started with Lambda for everything. Cron jobs work well serverless, but long-running LLM batch processing hit timeout limits. Moved batch processing to ECS Fargate tasks.
Real-Time vs. Batch Processing
Real-time scanning sounds appealing but isn't necessary. Most Reddit posts don't require immediate response. Hourly batch scans are sufficient and much cheaper.
GPT-4 vs. GPT-3.5
For intent detection: GPT-3.5 is good enough (60% cost savings).
For lead scoring: GPT-4 is worth it (much better at nuanced ICP matching).
For response generation: GPT-4 for quality.
Key Takeaways
AI automation doesn't mean "set it and forget it." The system finds leads, but humans qualify and engage. It's augmentation, not replacement.
Prompt engineering is a first-class concern. As important as your application code. Version, test, and iterate on prompts like you would any other critical component.
Cost optimization matters at scale. One careless prompt can cost hundreds in API fees. Cache aggressively, batch intelligently, use cheaper models where possible.
Context is everything. Reddit users can smell marketing from a mile away. Personalized, helpful responses work. Generic sales pitches get downvoted.
Conclusion
Building an AI-powered lead discovery system required combining multiple disciplines: data engineering (Reddit API), ML engineering (LLM integration), and product thinking (what makes a good lead?).
The result is a system that continuously monitors thousands of conversations, identifies genuine buying intent, qualifies leads against your ICP, and surfaces high-value opportunities—all automated.
No more manual prospecting. No more cold outreach to people who aren't ready. Just qualified leads, delivered when they're actively looking for solutions.
---
Building AI-powered automation for your business? This kind of system design—combining LLMs, data pipelines, and intelligent filtering—is exactly the work I do.