ai-powered goal setting with smart caching
29 Jul 2025built a goal-setting app that transforms problems into actionable smart goals, then provides conversational ai guidance for each task. the challenge? making ai interactions fast, cost-effective, and genuinely helpful. the solution? a multi-layer caching approach with semantic similarity matching.
the problem
traditional goal-setting apps are static lists. when users get stuck on “start a 4-day upper lower split” or “learn spanish,” they’re on their own. adding ai help seems obvious, but creates new problems:
- cost escalation: every “how do I do this?” question hits expensive openai apis
- response inconsistency: same question gets different answers due to ai randomness
- context loss: follow-up questions lack memory of the original task
- user frustration: waiting 2+ seconds for common questions like “what about sets and reps?”
how it works
implemented a simple flow: users describe problems, get structured goals, then ask follow-up questions about how to complete them.
layer 1: problem to smart goals
transforms user problems into specific, measurable, achievable, relevant, and time-bound goals:
// user input: "How do I quit smoking?"
const response = await fetch('/api/generate-goals', {
method: 'POST',
headers: { 'Authorization': `Bearer ${token}` },
body: JSON.stringify({ problem: userInput })
});
// generates structured output:
{
"today": ["throw away all cigarettes and smoking accessories"],
"month": ["complete 30-day nicotine replacement therapy program"],
"year": ["maintain smoke-free lifestyle for 365 consecutive days"]
}
uses embedding-based caching - similar problems get similar goals without redundant api calls.
layer 2: contextual how-to guides
each goal gets an interactive ❓ button for ai-powered guidance:
async function showHowTo(goalSetId, listType, taskId) {
const task = goalSets[goalSetId][listType].find(t => t.id === taskId);
const response = await fetch('/api/how-to', {
method: 'POST',
body: JSON.stringify({
taskText: task.text,
goalContext: {
goalSetName: goalSet.name,
timeframe: listType // today, month, year
}
})
});
return response.json(); // structured guide with steps, tips, timing
}
returns structured guidance:
- overview: brief explanation of the task
- steps: actionable numbered instructions
- proTip: expert advice or common pitfalls
- timeNeeded: realistic estimates
- difficulty: complexity assessment
layer 3: conversational follow-ups
users can ask clarifying questions that maintain full context:
// conversation state preserved across questions
currentConversation = {
taskId: "start_4_day_split_123",
goalSetId: "fitness_goals",
listType: "today",
messages: [
{ type: 'assistant', content: originalGuide },
{ type: 'user', content: "What about sets and reps?" },
{ type: 'assistant_followup', content: "For upper body days..." }
]
};
understanding openai embeddings
before diving into caching, it’s crucial to understand what embeddings are and why they’re revolutionary for text similarity.
what are embeddings?
embeddings convert text into high-dimensional vectors that capture semantic meaning. think of them as “fingerprints” for concepts - similar ideas get similar fingerprints.
openai’s text-embedding-3-small model transforms any text into a 1536-dimensional array of floating-point numbers:
async function getEmbedding(text, apiKey) {
const response = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'text-embedding-3-small',
input: text.toLowerCase().trim(),
encoding_format: 'float'
})
});
const data = await response.json();
return data.data[0].embedding; // array of 1536 numbers
}
// "start a 4 day upper lower split" becomes:
// [0.12, -0.05, 0.73, 0.41, -0.19, 0.84, ...] // 1536 floating-point numbers
the magic of semantic similarity
the breakthrough insight: semantically similar text produces similar vectors, even with completely different words.
// these all get embedded into nearby points in 1536-dimensional space:
"start a 4 day upper lower split" // [0.12, -0.05, 0.73, 0.41, ...]
"begin 4-day upper/lower routine" // [0.11, -0.04, 0.74, 0.42, ...]
"initiate four day upper-lower workout" // [0.13, -0.06, 0.72, 0.40, ...]
// while completely different concepts are far apart:
"bake chocolate chip cookies" // [0.89, 0.34, -0.12, -0.67, ...]
the ai model has learned through massive training that these phrases represent the same underlying concept, despite using different words.
measuring similarity with cosine similarity
cosine similarity measures the angle between two vectors in high-dimensional space, returning a score from 0.0 to 1.0:
function cosineSimilarity(vecA, vecB) {
// calculate dot product (how aligned the vectors are)
const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
// calculate magnitudes (lengths of the vectors)
const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
// cosine similarity = dot product / (magnitude_a * magnitude_b)
return dotProduct / (magnitudeA * magnitudeB);
}
// similarity score meanings:
// 0.95-1.0: nearly identical meaning ("start workout" vs "begin workout")
// 0.8-0.95: very similar concepts ("sets and reps" vs "repetitions")
// 0.6-0.8: related but different ("workout" vs "exercise routine")
// 0.0-0.6: unrelated concepts ("workout" vs "baking cookies")
real similarity examples from the app
here are actual similarity scores measured during development:
// high task similarity (0.92) - correctly matched
cosineSimilarity(
embedding("start a 4 day upper lower split"),
embedding("begin 4-day upper/lower routine")
) // returns 0.92
// high question similarity (0.87) - correctly matched
cosineSimilarity(
embedding("what about sets and reps?"),
embedding("how many sets and repetitions?")
) // returns 0.87
// low task similarity (0.23) - correctly rejected
cosineSimilarity(
embedding("start a 4 day upper lower split"),
embedding("bake chocolate chip cookies")
) // returns 0.23
// medium question similarity (0.64) - correctly rejected
cosineSimilarity(
embedding("what about sets and reps?"),
embedding("what about rest periods?")
) // returns 0.64 (related but different question)
why embeddings beat traditional approaches
traditional string matching would completely fail on natural language variations:
// string matching approach (brittle and inflexible)
function isSetRepsQuestion(question) {
const q = question.toLowerCase();
return q.includes("sets") && q.includes("reps");
}
isSetRepsQuestion("what about sets and reps?"); // ✅ true
isSetRepsQuestion("how many repetitions per set?"); // ❌ false (missed!)
isSetRepsQuestion("what's the rep and set scheme?"); // ❌ false (missed!)
// embedding approach (semantic understanding)
function isSetRepsQuestion(question) {
const questionEmbedding = getEmbedding(question);
const referenceEmbedding = getEmbedding("sets and reps");
return cosineSimilarity(questionEmbedding, referenceEmbedding) >= 0.8;
}
isSetRepsQuestion("what about sets and reps?"); // ✅ true (0.95)
isSetRepsQuestion("how many repetitions per set?"); // ✅ true (0.84)
isSetRepsQuestion("what's the rep and set scheme?"); // ✅ true (0.81)
semantic caching approach
armed with understanding embeddings, we can now implement intelligent caching using dual similarity matching:
dual similarity matching
the breakthrough insight: we need both the task context and question intent to match for a valid cache hit.
asking “what about sets and reps?” means completely different things for:
- “start a 4 day upper lower split” → workout programming advice
- “bake chocolate chip cookies” → nonsensical question
async function findSimilarFollowUp(taskText, question, env) {
// convert both text inputs to semantic vectors
const taskEmbedding = await getEmbedding(taskText, env.OPENAI_API_KEY);
const questionEmbedding = await getEmbedding(question, env.OPENAI_API_KEY);
// search all cached responses
const list = await env.GOAL_CACHE.list({ prefix: 'followup_' });
for (const item of list.keys) {
const cached = await env.GOAL_CACHE.get(item.name, 'json');
// calculate semantic similarity for both dimensions
const taskSimilarity = cosineSimilarity(taskEmbedding, cached.taskEmbedding);
const questionSimilarity = cosineSimilarity(questionEmbedding, cached.questionEmbedding);
// both thresholds must be exceeded for cache hit
if (taskSimilarity >= 0.85 && questionSimilarity >= 0.8) {
console.log(`Cache hit: task=${taskSimilarity.toFixed(3)}, question=${questionSimilarity.toFixed(3)}`);
return cached.answer;
}
}
return null; // no semantic match found
}
threshold tuning through experimentation
the similarity thresholds were determined through empirical testing:
- task similarity: 0.85 - tasks must be very similar
- fitness tasks vs cooking tasks score ~0.2 (correctly rejected)
- “upper lower split” vs “upper/lower routine” score ~0.92 (correctly matched)
- question similarity: 0.8 - questions can vary more linguistically
- “sets and reps” vs “repetitions” score ~0.87 (correctly matched)
- “sets and reps” vs “rest periods” score ~0.64 (correctly rejected)
too low and you get wrong answers for the wrong context, too high and you miss valid linguistic variations.
cache structure evolution: from duplication to referential integrity
the initial cache structure stored complete task embeddings with every follow-up question:
// inefficient: task embeddings duplicated across follow-ups
"followup_1703512345_abc123" → {
taskText: "start a 4 day Upper Lower Split",
taskEmbedding: [0.1, 0.2, 0.3, ...], // 1536 dimensions - DUPLICATED
question: "What about sets and reps?",
questionEmbedding: [0.7, 0.8, 0.9, ...],
answer: "For upper body days, aim for 3-4 sets of 8-12 reps...",
hitCount: 23,
timestamp: "2024-01-15T10:30:00Z"
}
the optimized structure uses a relational approach with task hash references:
// task embeddings stored once and referenced by hash
"task_embedding_abc123" → {
taskText: "start a 4 day Upper Lower Split",
taskEmbedding: [0.1, 0.2, 0.3, ...], // 1536 dimensions - STORED ONCE
taskHash: "abc123",
usageCount: 15,
timestamp: "2024-01-15T10:30:00Z"
}
// follow-up entries reference the task by hash
"followup_v2_abc123_def456_1703512345" → {
taskHash: "abc123", // FOREIGN KEY REFERENCE
question: "What about sets and reps?",
questionEmbedding: [0.7, 0.8, 0.9, ...],
answer: "For upper body days, aim for 3-4 sets of 8-12 reps...",
hitCount: 23,
timestamp: "2024-01-15T10:30:00Z"
}
task embedding lookup mechanism
when a cached follow-up is found, the system performs a two-step lookup to verify task similarity:
// step 1: find potential follow-up cache matches
const followupList = await env.GOAL_CACHE.list({ prefix: 'followup_v2_' });
for (const item of followupList.keys) {
const cached = await env.GOAL_CACHE.get(item.name, 'json');
if (cached && cached.taskHash && cached.questionEmbedding) {
// step 2: lookup task embedding by hash reference
const cachedTaskEmbedding = await getTaskEmbedding(cached.taskHash, env);
if (cachedTaskEmbedding) {
// step 3: verify both task and question similarity
const taskSimilarity = cosineSimilarity(currentTaskEmbedding, cachedTaskEmbedding);
const questionSimilarity = cosineSimilarity(currentQuestionEmbedding, cached.questionEmbedding);
if (taskSimilarity >= 0.85 && questionSimilarity >= 0.8) {
return cached.answer; // cache hit!
}
}
}
}
the getTaskEmbedding() function resolves the hash reference:
async function getTaskEmbedding(taskHash, env) {
const taskCacheKey = `task_embedding_${taskHash}`;
const cachedTask = await env.GOAL_CACHE.get(taskCacheKey, 'json');
if (cachedTask && cachedTask.taskEmbedding) {
return cachedTask.taskEmbedding; // return the 1536-dimensional vector
}
return null; // task embedding not found (shouldn't happen in normal operation)
}
the reality: javascript does all the work
important distinction: cloudflare kv is a simple key-value store with no query capabilities. it only supports:
get(key)- retrieve value by exact key matchput(key, value)- store value at keylist({ prefix })- list keys with a given prefixdelete(key)- remove key
kv does not have:
- sql queries
- indexing beyond key prefixes
- semantic search capabilities
- similarity functions
- relational joins
this means all semantic matching happens in javascript:
// what actually happens during cache lookup:
// 1. javascript fetches ALL follow-up cache entries (brute force)
const followupList = await env.GOAL_CACHE.list({ prefix: 'followup_v2_', limit: 50 });
// 2. javascript loops through each entry one by one
for (const item of followupList.keys) {
const cached = await env.GOAL_CACHE.get(item.name, 'json'); // 🔥 KV API CALL
// 3. javascript makes ANOTHER kv call to get the task embedding
const cachedTaskEmbedding = await getTaskEmbedding(cached.taskHash, env); // 🔥 ANOTHER KV API CALL
// 4. javascript calculates cosine similarity in memory
const taskSimilarity = cosineSimilarity(currentTaskEmbedding, cachedTaskEmbedding);
const questionSimilarity = cosineSimilarity(currentQuestionEmbedding, cached.questionEmbedding);
// 5. javascript evaluates thresholds
if (taskSimilarity >= 0.85 && questionSimilarity >= 0.8) {
return cached.answer; // found match!
}
}
performance implications:
- searching 50 follow-up entries = 50
get()calls to kv - each entry requires another
get()call for task embedding = 50 more calls - total: 100 kv api calls for a single cache lookup
- each call has ~5-15ms latency from edge to kv storage
- semantic similarity calculations happen in cloudflare’s v8 javascript runtime
this is why we limit searches (limit: 50) and use early return on first match - the “database” is actually just a distributed hashtable with javascript doing all the intelligent work.
why O(1) lookup is impossible with embeddings
the fundamental problem: you cannot create deterministic cache keys from semantic similarity.
// two questions that mean the same thing to humans
const question1 = "what about sets and reps?";
const question2 = "how many repetitions?";
// but AI embeddings convert them to completely different number arrays
embedding(question1) → [0.123, 0.456, 0.789, ...] // 1536 numbers
embedding(question2) → [0.187, 0.423, 0.801, ...] // 1536 different numbers
// to create cache keys, we hash the text (not the embeddings)
hash(question1) → "abc123" // deterministic based on exact text
hash(question2) → "xyz789" // different text = different hash
// KV can only find exact key matches
await env.GOAL_CACHE.get("followup_task1_abc123"); // ✅ finds cached answer for "sets and reps"
await env.GOAL_CACHE.get("followup_task1_xyz789"); // ❌ cache miss for "repetitions"
// even though humans know these questions are asking the same thing!
what if we hashed the embeddings instead?
// you could hash the embedding arrays...
embedding(question1) → [0.123, 0.456, 0.789, ...]
hash([0.123, 0.456, 0.789, ...]) → "def456"
embedding(question2) → [0.187, 0.423, 0.801, ...]
hash([0.187, 0.423, 0.801, ...]) → "ghi789"
// but you still get different hashes for similar meanings!
// hashing doesn't make semantically similar vectors produce similar hashes
the problem: hash functions are designed to produce completely different outputs for even tiny input changes. this is the opposite of what we want for semantic similarity.
// tiny difference in embeddings = completely different hash
embedding("sets and reps") → hash → "abc123"
embedding("sets and reps!") → hash → "xyz789" // just added "!"
embedding("reps and sets") → hash → "def456" // just swapped order
// semantic similarity is about finding vectors that are *close* in high-dimensional space
// but hash functions are designed to make similar inputs produce *distant* outputs
to find semantic matches, you must compare embeddings:
// the only way to know if two questions are similar:
const similarity = cosineSimilarity(
embedding("how many repetitions?"), // user's question
embedding("what about sets and reps?") // cached question
); // returns 0.87 - they ARE similar!
// but you can't know this without calculating similarity for every cached question
attempted workarounds and why they fail:
// ❌ canonical mapping: requires manual maintenance, misses variations
const canonicalMap = {
"sets_and_reps": ["sets and reps", "repetitions", "how many reps"],
// what about "rep count"? "set/rep scheme"? "lifting numbers"?
};
// ❌ embedding bucketing: complex, approximate, still requires similarity search
function bucketEmbedding(embedding) {
return embedding.slice(0, 10).map(x => Math.round(x * 100)).join('_');
}
const bucket = bucketEmbedding(questionEmbedding); // still need O(k) search in bucket
// ❌ locality-sensitive hashing: difficult to implement correctly, approximate results
the harsh reality: if you need semantic similarity, you need either:
- O(n) search through all candidates (what we built)
- specialized vector database with optimized similarity algorithms
key-value stores excel at exact lookups, but semantic similarity requires mathematical comparison of high-dimensional vectors - a fundamentally different operation.
performance reality check
this approach has significant scalability issues:
| cache size | kv api calls | lookup latency | bottleneck |
|---|---|---|---|
| 10 entries | ~20 calls | ~100-300ms | acceptable |
| 50 entries | ~100 calls | ~500-1500ms | slower than openai |
| 100 entries | ~200 calls | ~1000-3000ms | unusable |
why this can be slower than calling openai directly:
- openai api: 1 call, ~2000ms response time
- our cache lookup: 100+ kv calls, potentially 1500ms+ just for network overhead
- plus javascript cpu time for 50+ cosine similarity calculations
architectural limitations:
- o(n) search complexity - performance degrades linearly with cache size
- api call explosion - each cache lookup requires dozens of network requests
- no indexing - cloudflare kv provides no query optimization beyond key prefixes
- cpu intensive - 1536-dimensional vector math in javascript runtime
better approaches for production scale
1. cloudflare vectorize (purpose-built for this):
// cloudflare's native vector database - perfect fit
const results = await env.VECTORIZE_INDEX.query(questionEmbedding, {
filter: { taskHash: { $eq: "abc123" } }, // filter by task first
topK: 3,
returnMetadata: true
}); // single api call, sub-100ms response, same infrastructure
// would store vectors like:
await env.VECTORIZE_INDEX.upsert([{
id: "followup_abc123_def456",
values: questionEmbedding, // 1536 dimensions
metadata: {
taskHash: "abc123",
question: "what about sets and reps?",
answer: "For upper body days, aim for 3-4 sets...",
taskSimilarity: 0.92 // pre-computed for filtering
}
}]);
2. third-party vector databases:
// pinecone, weaviate, or similar if you need features vectorize lacks
const results = await vectorDB.query({
vector: questionEmbedding,
filter: { taskSimilarity: { $gte: 0.85 } },
topK: 1
}); // single api call, sub-100ms response
3. pre-computed similarity indices:
// compute similarities at write-time, not read-time
"task_questions_abc123" → {
taskHash: "abc123",
questions: [
{ text: "sets and reps?", embedding: [...], answers: ["key1", "key2"] },
{ text: "how often?", embedding: [...], answers: ["key3"] }
]
} // one kv call gets all questions for a task
4. hybrid approach with similarity caching:
// cache the similarity calculations themselves
"question_matches_def456" → {
questionHash: "def456",
matches: [
{ followupKey: "followup_v2_abc123_ghi789", similarity: 0.87 },
{ followupKey: "followup_v2_xyz789_mno123", similarity: 0.82 }
],
timestamp: "2024-01-15T10:30:00Z"
} // amortize expensive similarity calculations
the vectorize migration
after running the kv-based system in production and validating user demand, we migrated to cloudflare vectorize for true semantic caching.
migration results
performance improvements:
- cache lookup: 500-1500ms → <100ms
- api calls: 100+ kv operations → 1 vectorize query
- cache accuracy: string matching → semantic similarity
real-world impact:
// before: brute force kv iteration
const followupList = await env.GOAL_CACHE.list({ prefix: 'followup_v2_', limit: 50 });
for (const item of followupList.keys) {
const cached = await env.GOAL_CACHE.get(item.name, 'json'); // 50+ api calls
// ... similarity calculations in javascript
}
// after: native vectorize query
const matches = await env.SEMANTIC_CACHE.query(questionEmbedding, {
filter: {
type: { $eq: "followup" },
taskHash: { $eq: taskHash }
},
topK: 3,
returnMetadata: "all"
}); // single api call with hardware-accelerated similarity search
unified semantic cache setup
single vectorize index handles all cache types:
// goal generation cache
{
id: "goal_abc123_timestamp",
values: problemEmbedding,
metadata: {
type: "goal_generation",
problem: "How do I learn Spanish?",
goals: "{...}"
}
}
// how-to guide cache
{
id: "guide_def456_timestamp",
values: taskEmbedding,
metadata: {
type: "how_to_guide",
taskText: "start 4 day upper lower split",
guide: "{...}"
}
}
// follow-up question cache
{
id: "followup_ghi789_timestamp",
values: questionEmbedding,
metadata: {
type: "followup",
taskHash: "def456",
question: "what about sets and reps?",
answer: "..."
}
}
production migration approach
phase 1: dual system with fallbacks
async function findSimilarFollowUpVectorize(taskText, question, env) {
if (!env.SEMANTIC_CACHE) {
// fallback to legacy kv caching if vectorize unavailable
return await findSimilarFollowUp(taskText, question, env);
}
try {
// vectorize query
const matches = await env.SEMANTIC_CACHE.query(questionEmbedding, {...});
return matches.length > 0 ? matches[0].metadata.answer : null;
} catch (error) {
console.error('vectorize error:', error);
// graceful fallback to kv on errors
return await findSimilarFollowUp(taskText, question, env);
}
}
phase 2: monitoring and validation
- cache headers distinguish systems:
X-Cache-Status: HIT-VECTORIZEvsHIT - performance monitoring shows dramatic improvements
- error rates remain low with reliable fallbacks
why the migration made sense:
- architectural mismatch resolved - we were forcing kv (key-value store) to do similarity search (vector database job)
- performance unpredictability - 100+ api calls per cache lookup could be slower than openai depending on network conditions
- complexity reduction - eliminated 200+ lines of similarity calculation code
- timing alignment - vectorize became available when we needed it
engineering reality: we built a working but inefficient system, then migrated when better infrastructure became available. the kv approach taught us the requirements and validated user demand before committing to specialized tooling.
why this matters: the taskHash acts as a foreign key that enables semantic deduplication while maintaining referential integrity. multiple follow-up questions about the same task concept share a single task embedding, but each has its own question embedding and cached answer.
this creates a many-to-one relationship where:
- one task embedding (e.g., “start 4 day upper lower split”)
- supports multiple follow-up caches (e.g., “sets and reps?”, “how often?”, “what weight?”)
- without duplicating the expensive 1536-dimensional task vector
cross-user cache benefits
the magic happens when multiple users ask similar questions:
user a: “start 4 day upper lower split” + “what about sets and reps?” → cache miss → ai generates → cached
user b: “begin 4-day upper/lower routine” + “sets and repetitions?” → cache hit → instant response
the semantic matching catches variations:
- “What about sets and reps?” ≈ “How many sets and repetitions?”
- “start 4 day upper lower split” ≈ “begin upper/lower 4-day routine”
- but rejects wrong contexts: “sets and reps” + “bake cookies” → no match
performance characteristics
two distinct response patterns:
| scenario | network | openai cost | latency | hit rate |
|---|---|---|---|---|
| cache miss + openai | openai api | ~$0.005 | ~6.5s | 0% |
| vectorize cache hit | vectorize query | $0 | ~2.1s | 40-60% |
vectorize caching implementation
the system now implements unified vectorize caching with semantic similarity search:
single index, three cache types
all caching flows through one vectorize index with metadata-based filtering:
// goal generation vectors
{
id: "goal_776b24e8b76a40ad_1753846452714",
values: problemEmbedding, // 1536 dimensions
metadata: {
type: "goal_generation",
problemHash: "776b24e8b76a40ad",
problem: "get jacked",
goals: "{\"today\":[...], \"month\":[...], \"year\":[...]}"
}
}
// how-to guide vectors
{
id: "guide_3933f9b7ab9b0f6f_1753847607690",
values: taskEmbedding, // 1536 dimensions
metadata: {
type: "how_to_guide",
taskHash: "3933f9b7ab9b0f6f",
taskText: "complete a 30-minute strength training workout",
guide: "{\"overview\":\"...\", \"steps\":[...], \"proTip\":\"...\"}"
}
}
// follow-up question vectors
{
id: "followup_3933f9b7ab9b0f6f_5d67953f891f8d41_1753847633760",
values: questionEmbedding, // 1536 dimensions
metadata: {
type: "followup",
taskHash: "3933f9b7ab9b0f6f", // links to how-to guide
question: "how to do a lunge",
answer: "To perform a lunge: 1. Stand with your feet hip-width apart..."
}
}
vectorize query patterns
goal generation lookup:
const matches = await env.SEMANTIC_CACHE.query(problemEmbedding, {
filter: { type: { $eq: "goal_generation" } },
topK: 3,
returnMetadata: "all"
});
how-to guide lookup:
const matches = await env.SEMANTIC_CACHE.query(taskEmbedding, {
filter: { type: { $eq: "how_to_guide" } },
topK: 3,
returnMetadata: "all"
});
follow-up question lookup:
const matches = await env.SEMANTIC_CACHE.query(questionEmbedding, {
filter: {
type: { $eq: "followup" },
taskHash: { $eq: taskHash }
},
topK: 3,
returnMetadata: "all"
});
performance: native vector similarity search with hardware acceleration, ~2.1s cache hits vs ~6.5s cache misses in production
cost analysis
cost breakdown for 1000 users on similar fitness tasks:
without any caching:
- initial guides: 1000 × $0.005 = $5.00
- follow-up questions: 1000 × $0.005 = $5.00
- task embeddings: 2000 × $0.0001 = $0.20
- total: $10.20
with complete semantic caching:
- initial guides: 1 × $0.005 = $0.005
- follow-up questions: varies by uniqueness ≈ $0.05
- task embeddings: 10 × $0.0001 = $0.0001
- total: $0.055 (99.5% savings)
the first user asking about any topic creates a “knowledge seed” that benefits all future users with similar needs.
implementation details
rate limiting with graceful degradation
if (response.status === 429) {
const retryAfter = errorData.retryAfter || 60;
const minutes = Math.ceil(retryAfter / 60);
throw new Error(
`Too many questions! Please wait ${minutes} minute${minutes !== 1 ? 's' : ''} before asking another.`
);
}
cache warming effects
the vectorize system creates organic cache warming through semantic similarity:
goal generation caching:
- “get jacked” → instant responses for “build muscle”, “gain strength”, “get buff”
- “learn spanish” → instant responses for “study spanish”, “spanish fluency”
- each problem type builds a knowledge base that benefits similar requests
how-to guide caching:
- “start 4 day upper lower split” → instant for “begin upper/lower routine”, “4-day workout plan”
- “bake chocolate chip cookies” → instant for “make chocolate cookies”, “cookie baking”
- semantic matching catches variations without exact string matches
follow-up question caching:
- “what about sets and reps?” cached once, serves “how many repetitions?”, “sets and repetitions?”
- “what temperature?” cached once, serves “baking temperature?”, “oven temp?”
- context-aware caching ensures answers match the original task domain
monitoring and debugging
cache monitoring tracks vectorize performance:
// cache status headers distinguish between systems
return new Response(JSON.stringify(response), {
headers: {
'X-Cache-Status': cached ? 'HIT-VECTORIZE' : 'MISS-VECTORIZE',
'X-Cache-Similarity-Score': bestMatch?.score?.toFixed(3),
'X-Cache-Vector-Count': matches.matches?.length
}
});
// console logging shows vectorize query results
console.log('=== GOAL GENERATION REQUEST START ===');
console.log('SEMANTIC_CACHE available:', !!env.SEMANTIC_CACHE);
console.log('Goal generation raw Vectorize response:', JSON.stringify(matches, null, 2));
console.log(`Goal generation search found ${matches.matches?.length || 0} potential matches`);
vectorize vector id structure enables easy debugging:
goal_{problemHash}_{timestamp}- goal generation vectorsguide_{taskHash}_{timestamp}- how-to guide vectorsfollowup_{taskHash}_{questionHash}_{timestamp}- follow-up q&a vectors
debug cli queries for troubleshooting:
# check metadata indexes
npx wrangler vectorize list-metadata-index semantic-cache
# query by type filter
npx wrangler vectorize query semantic-cache --vector [...] --filter '{"type": "goal_generation"}'
# check vector count
npx wrangler vectorize info semantic-cache
can monitor hit rates in browser devtools and server logs, plus use cli tools to debug vectorize filtering issues.
alternatives considered
simple string matching: too brittle, misses semantic variations single embedding per question: loses task context, wrong answers llm-based similarity: too expensive for cache lookup hash-based caching: can’t handle natural language variations redis/database: unnecessary infrastructure complexity
lessons learned
migration insights
- start simple, upgrade strategically - kv validation → vectorize optimization
- reliable fallbacks enable confidence - dual systems during migration prevent downtime
- semantic caching scales exponentially - each user benefits from all previous interactions
- infrastructure timing matters - vectorize ga made the migration viable
- monitoring distinguishes systems - cache headers enable performance comparison
technical learnings
- metadata filtering is crucial - type and taskHash filters prevent wrong context matches
- similarity thresholds matter - 0.8+ cosine similarity for reliable semantic matching
- cache warming is organic - no need to pre-populate, users do it naturally
- conversation state is fragile - clear on modal close to prevent bugs
- graceful degradation - if caching fails, still call openai
- vectorize beats kv - native similarity search vs javascript iterations
production results
performance improvements measured:
- cache hit latency: 6.5s → 2.1s (~3x faster)
- api call reduction: 100+ kv operations → 1 vectorize query
- cost reduction: 99.5% for popular interactions
- cache accuracy: improved semantic matching vs string similarity
user experience impact:
- instant responses for cached questions
- better cross-user knowledge sharing
- more consistent ai guidance
- reduced rate limiting due to cache hits
metadata indexing gotcha: vectors vs indexes timing
after deploying the vectorize migration, we discovered a critical timing issue that broke filtering for goal generation and follow-up caching.
the problem: retroactive indexing doesn’t work
what happened:
- vectors were inserted into vectorize with metadata like
type: "goal_generation" - metadata indexes were created later via
wrangler vectorize create-metadata-index - filtered queries returned 0 results despite vectors having the correct metadata
// vectors inserted BEFORE metadata index creation
// ❌ cannot be found by filtered queries
const matches = await env.SEMANTIC_CACHE.query(embedding, {
filter: { type: { $eq: "goal_generation" } }
}); // returns 0 results
// ✅ but can be found by unfiltered queries
const allMatches = await env.SEMANTIC_CACHE.query(embedding, {
topK: 10
}); // returns vectors with metadata intact
the discovery process
debugging revealed the issue:
# unfiltered query found 10 vectors including goals
npx wrangler vectorize query semantic-cache --vector [...] --top-k 10
# filtered query found only 2 newer vectors
npx wrangler vectorize query semantic-cache --vector [...] --filter '{"type": "goal_generation"}'
key insight: cloudflare vectorize metadata indexes only apply to vectors inserted after the index creation. existing vectors become invisible to filtered queries.
the solution: nuclear option
since the site had few users, we chose the clean slate approach:
# delete entire index
npx wrangler vectorize delete semantic-cache
# recreate with metadata indexes first
npx wrangler vectorize create semantic-cache --dimensions 1536 --metric cosine
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=type --type=string
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=taskHash --type=string
result: all new vectors are properly indexed and findable by filtered queries.
lessons for production systems
1. create metadata indexes before inserting vectors
# ✅ correct order for new vectorize setup
# step 1: create vectorize index
npx wrangler vectorize create semantic-cache --dimensions 1536 --metric cosine
# step 2: create metadata indexes BEFORE inserting any vectors
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=type --type=string
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=taskHash --type=string
# step 3: verify indexes are ready
npx wrangler vectorize list-metadata-index semantic-cache
# step 4: NOW safe to insert vectors with metadata
# vectors inserted after this point will be findable by filtered queries
in your application code:
// this will work correctly because metadata indexes exist
await env.SEMANTIC_CACHE.upsert([{
id: "goal_abc123_timestamp",
values: problemEmbedding,
metadata: {
type: "goal_generation", // ✅ indexed
problemHash: "abc123" // ✅ indexed via taskHash
}
}]);
// filtered queries will find the vector
const matches = await env.SEMANTIC_CACHE.query(embedding, {
filter: { type: { $eq: "goal_generation" } } // ✅ works
});
2. vectorize lacks reindexing capabilities unlike elasticsearch or other databases, vectorize doesn’t offer:
reindexcommand to rebuild metadata indexes- bulk export/import for data migration
- retroactive index application
3. migration strategies for production data
- small datasets: nuclear option (delete/recreate)
- large datasets: build custom export/import pipeline
- critical systems: implement dual-write during transition
4. monitoring metadata filtering add debug logging to detect filtering issues:
const filteredMatches = await env.SEMANTIC_CACHE.query(embedding, {
filter: { type: { $eq: "goal_generation" } }
});
if (filteredMatches.matches.length === 0) {
console.warn('Filtered query returned 0 results - check metadata indexes');
}
this metadata indexing gotcha cost us a few hours of debugging but taught valuable lessons about vectorize operational characteristics that aren’t well documented.
conclusion
semantic caching with vector databases transforms ai applications from expensive, slow interactions into fast, cost-effective systems that improve with every user.
key insights for ai application developers:
- use the right tool for the job - vector databases excel at similarity search, traditional databases at exact lookups
- semantic caching scales exponentially - each user interaction creates value for all future similar requests
- ai responses don’t need to be unique - they need to be contextually appropriate and fast
- metadata filtering is crucial - combine semantic similarity with structured filters for precise results
- create metadata indexes before inserting vectors - retroactive indexing doesn’t work in most vector databases
when to use vector databases for caching:
- ✅ user queries have natural language variations (“sets and reps” vs “repetitions”)
- ✅ exact string matching misses too many valid cache hits
- ✅ content generation is expensive (time or cost)
- ✅ semantic similarity matters more than exact matches
- ❌ simple key-value lookups work fine
- ❌ transactional consistency is required
bottom line: if you’re building ai applications with expensive generation costs, semantic caching with vector databases can deliver 3x performance improvements and 99%+ cost reductions for popular content. the infrastructure investment pays for itself quickly through improved user experience and reduced ai api costs.
try the live system: actuallydostuff.com - click the ❓ on any goal to experience ~2s cached responses vs ~6s generated responses, powered by cloudflare vectorize.