ai-powered goal setting with smart caching

29 Jul 2025

built a goal-setting app that transforms problems into actionable smart goals, then provides conversational ai guidance for each task. the challenge? making ai interactions fast, cost-effective, and genuinely helpful. the solution? a multi-layer caching approach with semantic similarity matching.

the problem

traditional goal-setting apps are static lists. when users get stuck on “start a 4-day upper lower split” or “learn spanish,” they’re on their own. adding ai help seems obvious, but creates new problems:

cost escalation: every “how do I do this?” question hits expensive openai apis
response inconsistency: same question gets different answers due to ai randomness
context loss: follow-up questions lack memory of the original task
user frustration: waiting 2+ seconds for common questions like “what about sets and reps?”

how it works

implemented a simple flow: users describe problems, get structured goals, then ask follow-up questions about how to complete them.

layer 1: problem to smart goals

transforms user problems into specific, measurable, achievable, relevant, and time-bound goals:

// user input: "How do I quit smoking?"
const response = await fetch('/api/generate-goals', {
  method: 'POST',
  headers: { 'Authorization': `Bearer ${token}` },
  body: JSON.stringify({ problem: userInput })
});

// generates structured output:
{
  "today": ["throw away all cigarettes and smoking accessories"],
  "month": ["complete 30-day nicotine replacement therapy program"],
  "year": ["maintain smoke-free lifestyle for 365 consecutive days"]
}

uses embedding-based caching - similar problems get similar goals without redundant api calls.

layer 2: contextual how-to guides

each goal gets an interactive ❓ button for ai-powered guidance:

async function showHowTo(goalSetId, listType, taskId) {
  const task = goalSets[goalSetId][listType].find(t => t.id === taskId);
  
  const response = await fetch('/api/how-to', {
    method: 'POST',
    body: JSON.stringify({
      taskText: task.text,
      goalContext: {
        goalSetName: goalSet.name,
        timeframe: listType // today, month, year
      }
    })
  });
  
  return response.json(); // structured guide with steps, tips, timing
}

returns structured guidance:

overview: brief explanation of the task
steps: actionable numbered instructions
proTip: expert advice or common pitfalls
timeNeeded: realistic estimates
difficulty: complexity assessment

layer 3: conversational follow-ups

users can ask clarifying questions that maintain full context:

// conversation state preserved across questions
currentConversation = {
  taskId: "start_4_day_split_123",
  goalSetId: "fitness_goals", 
  listType: "today",
  messages: [
    { type: 'assistant', content: originalGuide },
    { type: 'user', content: "What about sets and reps?" },
    { type: 'assistant_followup', content: "For upper body days..." }
  ]
};

understanding openai embeddings

before diving into caching, it’s crucial to understand what embeddings are and why they’re revolutionary for text similarity.

what are embeddings?

embeddings convert text into high-dimensional vectors that capture semantic meaning. think of them as “fingerprints” for concepts - similar ideas get similar fingerprints.

openai’s text-embedding-3-small model transforms any text into a 1536-dimensional array of floating-point numbers:

async function getEmbedding(text, apiKey) {
  const response = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${apiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'text-embedding-3-small',
      input: text.toLowerCase().trim(),
      encoding_format: 'float'
    })
  });
  
  const data = await response.json();
  return data.data[0].embedding; // array of 1536 numbers
}

// "start a 4 day upper lower split" becomes:
// [0.12, -0.05, 0.73, 0.41, -0.19, 0.84, ...]  // 1536 floating-point numbers

the magic of semantic similarity

the breakthrough insight: semantically similar text produces similar vectors, even with completely different words.

// these all get embedded into nearby points in 1536-dimensional space:
"start a 4 day upper lower split"     // [0.12, -0.05, 0.73, 0.41, ...]
"begin 4-day upper/lower routine"     // [0.11, -0.04, 0.74, 0.42, ...]  
"initiate four day upper-lower workout" // [0.13, -0.06, 0.72, 0.40, ...]

// while completely different concepts are far apart:
"bake chocolate chip cookies"         // [0.89, 0.34, -0.12, -0.67, ...]

the ai model has learned through massive training that these phrases represent the same underlying concept, despite using different words.

measuring similarity with cosine similarity

cosine similarity measures the angle between two vectors in high-dimensional space, returning a score from 0.0 to 1.0:

function cosineSimilarity(vecA, vecB) {
  // calculate dot product (how aligned the vectors are)
  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
  
  // calculate magnitudes (lengths of the vectors)
  const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
  
  // cosine similarity = dot product / (magnitude_a * magnitude_b)
  return dotProduct / (magnitudeA * magnitudeB);
}

// similarity score meanings:
// 0.95-1.0: nearly identical meaning ("start workout" vs "begin workout")
// 0.8-0.95:  very similar concepts ("sets and reps" vs "repetitions")  
// 0.6-0.8:   related but different ("workout" vs "exercise routine")
// 0.0-0.6:   unrelated concepts ("workout" vs "baking cookies")

real similarity examples from the app

here are actual similarity scores measured during development:

// high task similarity (0.92) - correctly matched
cosineSimilarity(
  embedding("start a 4 day upper lower split"),
  embedding("begin 4-day upper/lower routine")
) // returns 0.92

// high question similarity (0.87) - correctly matched
cosineSimilarity(
  embedding("what about sets and reps?"),
  embedding("how many sets and repetitions?")
) // returns 0.87

// low task similarity (0.23) - correctly rejected  
cosineSimilarity(
  embedding("start a 4 day upper lower split"),
  embedding("bake chocolate chip cookies")
) // returns 0.23

// medium question similarity (0.64) - correctly rejected
cosineSimilarity(
  embedding("what about sets and reps?"),
  embedding("what about rest periods?")
) // returns 0.64 (related but different question)

why embeddings beat traditional approaches

traditional string matching would completely fail on natural language variations:

// string matching approach (brittle and inflexible)
function isSetRepsQuestion(question) {
  const q = question.toLowerCase();
  return q.includes("sets") && q.includes("reps");
}

isSetRepsQuestion("what about sets and reps?");     // ✅ true
isSetRepsQuestion("how many repetitions per set?"); // ❌ false (missed!)
isSetRepsQuestion("what's the rep and set scheme?"); // ❌ false (missed!)

// embedding approach (semantic understanding)
function isSetRepsQuestion(question) {
  const questionEmbedding = getEmbedding(question);
  const referenceEmbedding = getEmbedding("sets and reps");
  return cosineSimilarity(questionEmbedding, referenceEmbedding) >= 0.8;
}

isSetRepsQuestion("what about sets and reps?");     // ✅ true (0.95)
isSetRepsQuestion("how many repetitions per set?"); // ✅ true (0.84)  
isSetRepsQuestion("what's the rep and set scheme?"); // ✅ true (0.81)

semantic caching approach

armed with understanding embeddings, we can now implement intelligent caching using dual similarity matching:

dual similarity matching

the breakthrough insight: we need both the task context and question intent to match for a valid cache hit.

asking “what about sets and reps?” means completely different things for:

“start a 4 day upper lower split” → workout programming advice
“bake chocolate chip cookies” → nonsensical question

async function findSimilarFollowUp(taskText, question, env) {
  // convert both text inputs to semantic vectors  
  const taskEmbedding = await getEmbedding(taskText, env.OPENAI_API_KEY);
  const questionEmbedding = await getEmbedding(question, env.OPENAI_API_KEY);
  
  // search all cached responses
  const list = await env.GOAL_CACHE.list({ prefix: 'followup_' });
  
  for (const item of list.keys) {
    const cached = await env.GOAL_CACHE.get(item.name, 'json');
    
    // calculate semantic similarity for both dimensions
    const taskSimilarity = cosineSimilarity(taskEmbedding, cached.taskEmbedding);
    const questionSimilarity = cosineSimilarity(questionEmbedding, cached.questionEmbedding);
    
    // both thresholds must be exceeded for cache hit
    if (taskSimilarity >= 0.85 && questionSimilarity >= 0.8) {
      console.log(`Cache hit: task=${taskSimilarity.toFixed(3)}, question=${questionSimilarity.toFixed(3)}`);
      return cached.answer;
    }
  }
  
  return null; // no semantic match found
}

threshold tuning through experimentation

the similarity thresholds were determined through empirical testing:

task similarity: 0.85 - tasks must be very similar
- fitness tasks vs cooking tasks score ~0.2 (correctly rejected)
- “upper lower split” vs “upper/lower routine” score ~0.92 (correctly matched)
question similarity: 0.8 - questions can vary more linguistically
- “sets and reps” vs “repetitions” score ~0.87 (correctly matched)
- “sets and reps” vs “rest periods” score ~0.64 (correctly rejected)

too low and you get wrong answers for the wrong context, too high and you miss valid linguistic variations.

cache structure evolution: from duplication to referential integrity

the initial cache structure stored complete task embeddings with every follow-up question:

// inefficient: task embeddings duplicated across follow-ups
"followup_1703512345_abc123" → {
  taskText: "start a 4 day Upper Lower Split",
  taskEmbedding: [0.1, 0.2, 0.3, ...], // 1536 dimensions - DUPLICATED
  question: "What about sets and reps?",
  questionEmbedding: [0.7, 0.8, 0.9, ...],
  answer: "For upper body days, aim for 3-4 sets of 8-12 reps...",
  hitCount: 23,
  timestamp: "2024-01-15T10:30:00Z"
}

the optimized structure uses a relational approach with task hash references:

// task embeddings stored once and referenced by hash
"task_embedding_abc123" → {
  taskText: "start a 4 day Upper Lower Split", 
  taskEmbedding: [0.1, 0.2, 0.3, ...], // 1536 dimensions - STORED ONCE
  taskHash: "abc123",
  usageCount: 15,
  timestamp: "2024-01-15T10:30:00Z"
}

// follow-up entries reference the task by hash
"followup_v2_abc123_def456_1703512345" → {
  taskHash: "abc123", // FOREIGN KEY REFERENCE
  question: "What about sets and reps?",
  questionEmbedding: [0.7, 0.8, 0.9, ...],
  answer: "For upper body days, aim for 3-4 sets of 8-12 reps...",
  hitCount: 23,
  timestamp: "2024-01-15T10:30:00Z"
}

task embedding lookup mechanism

when a cached follow-up is found, the system performs a two-step lookup to verify task similarity:

// step 1: find potential follow-up cache matches
const followupList = await env.GOAL_CACHE.list({ prefix: 'followup_v2_' });

for (const item of followupList.keys) {
  const cached = await env.GOAL_CACHE.get(item.name, 'json');
  
  if (cached && cached.taskHash && cached.questionEmbedding) {
    // step 2: lookup task embedding by hash reference
    const cachedTaskEmbedding = await getTaskEmbedding(cached.taskHash, env);
    
    if (cachedTaskEmbedding) {
      // step 3: verify both task and question similarity
      const taskSimilarity = cosineSimilarity(currentTaskEmbedding, cachedTaskEmbedding);
      const questionSimilarity = cosineSimilarity(currentQuestionEmbedding, cached.questionEmbedding);
      
      if (taskSimilarity >= 0.85 && questionSimilarity >= 0.8) {
        return cached.answer; // cache hit!
      }
    }
  }
}

the getTaskEmbedding() function resolves the hash reference:

async function getTaskEmbedding(taskHash, env) {
  const taskCacheKey = `task_embedding_${taskHash}`;
  const cachedTask = await env.GOAL_CACHE.get(taskCacheKey, 'json');
  
  if (cachedTask && cachedTask.taskEmbedding) {
    return cachedTask.taskEmbedding; // return the 1536-dimensional vector
  }
  
  return null; // task embedding not found (shouldn't happen in normal operation)
}

the reality: javascript does all the work

important distinction: cloudflare kv is a simple key-value store with no query capabilities. it only supports:

get(key) - retrieve value by exact key match
put(key, value) - store value at key
list({ prefix }) - list keys with a given prefix
delete(key) - remove key

kv does not have:

sql queries
indexing beyond key prefixes
semantic search capabilities
similarity functions
relational joins

this means all semantic matching happens in javascript:

// what actually happens during cache lookup:

// 1. javascript fetches ALL follow-up cache entries (brute force)
const followupList = await env.GOAL_CACHE.list({ prefix: 'followup_v2_', limit: 50 });

// 2. javascript loops through each entry one by one
for (const item of followupList.keys) {
  const cached = await env.GOAL_CACHE.get(item.name, 'json'); // 🔥 KV API CALL
  
  // 3. javascript makes ANOTHER kv call to get the task embedding
  const cachedTaskEmbedding = await getTaskEmbedding(cached.taskHash, env); // 🔥 ANOTHER KV API CALL
  
  // 4. javascript calculates cosine similarity in memory
  const taskSimilarity = cosineSimilarity(currentTaskEmbedding, cachedTaskEmbedding);
  const questionSimilarity = cosineSimilarity(currentQuestionEmbedding, cached.questionEmbedding);
  
  // 5. javascript evaluates thresholds
  if (taskSimilarity >= 0.85 && questionSimilarity >= 0.8) {
    return cached.answer; // found match!
  }
}

performance implications:

searching 50 follow-up entries = 50 get() calls to kv
each entry requires another get() call for task embedding = 50 more calls
total: 100 kv api calls for a single cache lookup
each call has ~5-15ms latency from edge to kv storage
semantic similarity calculations happen in cloudflare’s v8 javascript runtime

this is why we limit searches (limit: 50) and use early return on first match - the “database” is actually just a distributed hashtable with javascript doing all the intelligent work.

why O(1) lookup is impossible with embeddings

the fundamental problem: you cannot create deterministic cache keys from semantic similarity.

// two questions that mean the same thing to humans
const question1 = "what about sets and reps?";
const question2 = "how many repetitions?";

// but AI embeddings convert them to completely different number arrays
embedding(question1) → [0.123, 0.456, 0.789, ...]  // 1536 numbers
embedding(question2) → [0.187, 0.423, 0.801, ...]  // 1536 different numbers

// to create cache keys, we hash the text (not the embeddings)
hash(question1) → "abc123"  // deterministic based on exact text
hash(question2) → "xyz789"  // different text = different hash

// KV can only find exact key matches
await env.GOAL_CACHE.get("followup_task1_abc123"); // ✅ finds cached answer for "sets and reps"
await env.GOAL_CACHE.get("followup_task1_xyz789"); // ❌ cache miss for "repetitions"

// even though humans know these questions are asking the same thing!

what if we hashed the embeddings instead?

// you could hash the embedding arrays...
embedding(question1) → [0.123, 0.456, 0.789, ...]
hash([0.123, 0.456, 0.789, ...]) → "def456"

embedding(question2) → [0.187, 0.423, 0.801, ...]  
hash([0.187, 0.423, 0.801, ...]) → "ghi789"

// but you still get different hashes for similar meanings!
// hashing doesn't make semantically similar vectors produce similar hashes

the problem: hash functions are designed to produce completely different outputs for even tiny input changes. this is the opposite of what we want for semantic similarity.

// tiny difference in embeddings = completely different hash
embedding("sets and reps")      → hash → "abc123"  
embedding("sets and reps!")     → hash → "xyz789"  // just added "!" 
embedding("reps and sets")      → hash → "def456"  // just swapped order

// semantic similarity is about finding vectors that are *close* in high-dimensional space
// but hash functions are designed to make similar inputs produce *distant* outputs

to find semantic matches, you must compare embeddings:

// the only way to know if two questions are similar:
const similarity = cosineSimilarity(
  embedding("how many repetitions?"),     // user's question
  embedding("what about sets and reps?") // cached question
); // returns 0.87 - they ARE similar!

// but you can't know this without calculating similarity for every cached question

attempted workarounds and why they fail:

// ❌ canonical mapping: requires manual maintenance, misses variations
const canonicalMap = {
  "sets_and_reps": ["sets and reps", "repetitions", "how many reps"],
  // what about "rep count"? "set/rep scheme"? "lifting numbers"?
};

// ❌ embedding bucketing: complex, approximate, still requires similarity search
function bucketEmbedding(embedding) {
  return embedding.slice(0, 10).map(x => Math.round(x * 100)).join('_');
}
const bucket = bucketEmbedding(questionEmbedding); // still need O(k) search in bucket

// ❌ locality-sensitive hashing: difficult to implement correctly, approximate results

the harsh reality: if you need semantic similarity, you need either:

O(n) search through all candidates (what we built)
specialized vector database with optimized similarity algorithms

key-value stores excel at exact lookups, but semantic similarity requires mathematical comparison of high-dimensional vectors - a fundamentally different operation.

performance reality check

this approach has significant scalability issues:

cache size	kv api calls	lookup latency	bottleneck
10 entries	~20 calls	~100-300ms	acceptable
50 entries	~100 calls	~500-1500ms	slower than openai
100 entries	~200 calls	~1000-3000ms	unusable

why this can be slower than calling openai directly:

openai api: 1 call, ~2000ms response time
our cache lookup: 100+ kv calls, potentially 1500ms+ just for network overhead
plus javascript cpu time for 50+ cosine similarity calculations

architectural limitations:

o(n) search complexity - performance degrades linearly with cache size
api call explosion - each cache lookup requires dozens of network requests
no indexing - cloudflare kv provides no query optimization beyond key prefixes
cpu intensive - 1536-dimensional vector math in javascript runtime

better approaches for production scale

1. cloudflare vectorize (purpose-built for this):

// cloudflare's native vector database - perfect fit
const results = await env.VECTORIZE_INDEX.query(questionEmbedding, {
  filter: { taskHash: { $eq: "abc123" } }, // filter by task first
  topK: 3,
  returnMetadata: true
}); // single api call, sub-100ms response, same infrastructure

// would store vectors like:
await env.VECTORIZE_INDEX.upsert([{
  id: "followup_abc123_def456",
  values: questionEmbedding, // 1536 dimensions
  metadata: {
    taskHash: "abc123",
    question: "what about sets and reps?",
    answer: "For upper body days, aim for 3-4 sets...",
    taskSimilarity: 0.92 // pre-computed for filtering
  }
}]);

2. third-party vector databases:

// pinecone, weaviate, or similar if you need features vectorize lacks
const results = await vectorDB.query({
  vector: questionEmbedding,
  filter: { taskSimilarity: { $gte: 0.85 } },
  topK: 1
}); // single api call, sub-100ms response

3. pre-computed similarity indices:

// compute similarities at write-time, not read-time
"task_questions_abc123" → {
  taskHash: "abc123",
  questions: [
    { text: "sets and reps?", embedding: [...], answers: ["key1", "key2"] },
    { text: "how often?", embedding: [...], answers: ["key3"] }
  ]
} // one kv call gets all questions for a task

4. hybrid approach with similarity caching:

// cache the similarity calculations themselves
"question_matches_def456" → {
  questionHash: "def456",
  matches: [
    { followupKey: "followup_v2_abc123_ghi789", similarity: 0.87 },
    { followupKey: "followup_v2_xyz789_mno123", similarity: 0.82 }
  ],
  timestamp: "2024-01-15T10:30:00Z"
} // amortize expensive similarity calculations

the vectorize migration

after running the kv-based system in production and validating user demand, we migrated to cloudflare vectorize for true semantic caching.

migration results

performance improvements:

cache lookup: 500-1500ms → <100ms
api calls: 100+ kv operations → 1 vectorize query
cache accuracy: string matching → semantic similarity

real-world impact:

// before: brute force kv iteration
const followupList = await env.GOAL_CACHE.list({ prefix: 'followup_v2_', limit: 50 });
for (const item of followupList.keys) {
  const cached = await env.GOAL_CACHE.get(item.name, 'json'); // 50+ api calls
  // ... similarity calculations in javascript
}

// after: native vectorize query  
const matches = await env.SEMANTIC_CACHE.query(questionEmbedding, {
  filter: { 
    type: { $eq: "followup" },
    taskHash: { $eq: taskHash }
  },
  topK: 3,
  returnMetadata: "all"
}); // single api call with hardware-accelerated similarity search

unified semantic cache setup

single vectorize index handles all cache types:

// goal generation cache
{ 
  id: "goal_abc123_timestamp",
  values: problemEmbedding,
  metadata: { 
    type: "goal_generation", 
    problem: "How do I learn Spanish?", 
    goals: "{...}" 
  }
}

// how-to guide cache
{
  id: "guide_def456_timestamp", 
  values: taskEmbedding,
  metadata: { 
    type: "how_to_guide", 
    taskText: "start 4 day upper lower split", 
    guide: "{...}" 
  }
}

// follow-up question cache
{
  id: "followup_ghi789_timestamp",
  values: questionEmbedding, 
  metadata: { 
    type: "followup", 
    taskHash: "def456", 
    question: "what about sets and reps?", 
    answer: "..."
  }
}

production migration approach

phase 1: dual system with fallbacks

async function findSimilarFollowUpVectorize(taskText, question, env) {
  if (!env.SEMANTIC_CACHE) {
    // fallback to legacy kv caching if vectorize unavailable
    return await findSimilarFollowUp(taskText, question, env);
  }
  
  try {
    // vectorize query
    const matches = await env.SEMANTIC_CACHE.query(questionEmbedding, {...});
    return matches.length > 0 ? matches[0].metadata.answer : null;
  } catch (error) {
    console.error('vectorize error:', error);
    // graceful fallback to kv on errors
    return await findSimilarFollowUp(taskText, question, env);
  }
}

phase 2: monitoring and validation

cache headers distinguish systems: X-Cache-Status: HIT-VECTORIZE vs HIT
performance monitoring shows dramatic improvements
error rates remain low with reliable fallbacks

why the migration made sense:

architectural mismatch resolved - we were forcing kv (key-value store) to do similarity search (vector database job)
performance unpredictability - 100+ api calls per cache lookup could be slower than openai depending on network conditions
complexity reduction - eliminated 200+ lines of similarity calculation code
timing alignment - vectorize became available when we needed it

engineering reality: we built a working but inefficient system, then migrated when better infrastructure became available. the kv approach taught us the requirements and validated user demand before committing to specialized tooling.

why this matters: the taskHash acts as a foreign key that enables semantic deduplication while maintaining referential integrity. multiple follow-up questions about the same task concept share a single task embedding, but each has its own question embedding and cached answer.

this creates a many-to-one relationship where:

one task embedding (e.g., “start 4 day upper lower split”)
supports multiple follow-up caches (e.g., “sets and reps?”, “how often?”, “what weight?”)
without duplicating the expensive 1536-dimensional task vector

cross-user cache benefits

the magic happens when multiple users ask similar questions:

user a: “start 4 day upper lower split” + “what about sets and reps?” → cache miss → ai generates → cached

user b: “begin 4-day upper/lower routine” + “sets and repetitions?” → cache hit → instant response

the semantic matching catches variations:

“What about sets and reps?” ≈ “How many sets and repetitions?”
“start 4 day upper lower split” ≈ “begin upper/lower 4-day routine”
but rejects wrong contexts: “sets and reps” + “bake cookies” → no match

performance characteristics

two distinct response patterns:

scenario	network	openai cost	latency	hit rate
cache miss + openai	openai api	~$0.005	~6.5s	0%
vectorize cache hit	vectorize query	$0	~2.1s	40-60%

vectorize caching implementation

the system now implements unified vectorize caching with semantic similarity search:

single index, three cache types

all caching flows through one vectorize index with metadata-based filtering:

// goal generation vectors
{
  id: "goal_776b24e8b76a40ad_1753846452714",
  values: problemEmbedding, // 1536 dimensions
  metadata: {
    type: "goal_generation",
    problemHash: "776b24e8b76a40ad", 
    problem: "get jacked",
    goals: "{\"today\":[...], \"month\":[...], \"year\":[...]}"
  }
}

// how-to guide vectors  
{
  id: "guide_3933f9b7ab9b0f6f_1753847607690",
  values: taskEmbedding, // 1536 dimensions
  metadata: {
    type: "how_to_guide",
    taskHash: "3933f9b7ab9b0f6f",
    taskText: "complete a 30-minute strength training workout", 
    guide: "{\"overview\":\"...\", \"steps\":[...], \"proTip\":\"...\"}"
  }
}

// follow-up question vectors
{
  id: "followup_3933f9b7ab9b0f6f_5d67953f891f8d41_1753847633760", 
  values: questionEmbedding, // 1536 dimensions
  metadata: {
    type: "followup",
    taskHash: "3933f9b7ab9b0f6f", // links to how-to guide
    question: "how to do a lunge",
    answer: "To perform a lunge: 1. Stand with your feet hip-width apart..."
  }
}

vectorize query patterns

goal generation lookup:

const matches = await env.SEMANTIC_CACHE.query(problemEmbedding, {
  filter: { type: { $eq: "goal_generation" } },
  topK: 3,
  returnMetadata: "all"
});

how-to guide lookup:

const matches = await env.SEMANTIC_CACHE.query(taskEmbedding, {
  filter: { type: { $eq: "how_to_guide" } },
  topK: 3, 
  returnMetadata: "all"
});

follow-up question lookup:

const matches = await env.SEMANTIC_CACHE.query(questionEmbedding, {
  filter: { 
    type: { $eq: "followup" },
    taskHash: { $eq: taskHash }
  },
  topK: 3,
  returnMetadata: "all"
});

performance: native vector similarity search with hardware acceleration, ~2.1s cache hits vs ~6.5s cache misses in production

cost analysis

cost breakdown for 1000 users on similar fitness tasks:

without any caching:

initial guides: 1000 × $0.005 = $5.00
follow-up questions: 1000 × $0.005 = $5.00
task embeddings: 2000 × $0.0001 = $0.20
total: $10.20

with complete semantic caching:

initial guides: 1 × $0.005 = $0.005
follow-up questions: varies by uniqueness ≈ $0.05
task embeddings: 10 × $0.0001 = $0.0001
total: $0.055 (99.5% savings)

the first user asking about any topic creates a “knowledge seed” that benefits all future users with similar needs.

implementation details

rate limiting with graceful degradation

if (response.status === 429) {
  const retryAfter = errorData.retryAfter || 60;
  const minutes = Math.ceil(retryAfter / 60);
  throw new Error(
    `Too many questions! Please wait ${minutes} minute${minutes !== 1 ? 's' : ''} before asking another.`
  );
}

cache warming effects

the vectorize system creates organic cache warming through semantic similarity:

goal generation caching:

“get jacked” → instant responses for “build muscle”, “gain strength”, “get buff”
“learn spanish” → instant responses for “study spanish”, “spanish fluency”
each problem type builds a knowledge base that benefits similar requests

how-to guide caching:

“start 4 day upper lower split” → instant for “begin upper/lower routine”, “4-day workout plan”
“bake chocolate chip cookies” → instant for “make chocolate cookies”, “cookie baking”
semantic matching catches variations without exact string matches

follow-up question caching:

“what about sets and reps?” cached once, serves “how many repetitions?”, “sets and repetitions?”
“what temperature?” cached once, serves “baking temperature?”, “oven temp?”
context-aware caching ensures answers match the original task domain

monitoring and debugging

cache monitoring tracks vectorize performance:

// cache status headers distinguish between systems
return new Response(JSON.stringify(response), {
  headers: {
    'X-Cache-Status': cached ? 'HIT-VECTORIZE' : 'MISS-VECTORIZE',
    'X-Cache-Similarity-Score': bestMatch?.score?.toFixed(3),
    'X-Cache-Vector-Count': matches.matches?.length
  }
});

// console logging shows vectorize query results
console.log('=== GOAL GENERATION REQUEST START ===');
console.log('SEMANTIC_CACHE available:', !!env.SEMANTIC_CACHE);
console.log('Goal generation raw Vectorize response:', JSON.stringify(matches, null, 2));
console.log(`Goal generation search found ${matches.matches?.length || 0} potential matches`);

vectorize vector id structure enables easy debugging:

goal_{problemHash}_{timestamp} - goal generation vectors
guide_{taskHash}_{timestamp} - how-to guide vectors
followup_{taskHash}_{questionHash}_{timestamp} - follow-up q&a vectors

debug cli queries for troubleshooting:

# check metadata indexes
npx wrangler vectorize list-metadata-index semantic-cache

# query by type filter  
npx wrangler vectorize query semantic-cache --vector [...] --filter '{"type": "goal_generation"}'

# check vector count
npx wrangler vectorize info semantic-cache

can monitor hit rates in browser devtools and server logs, plus use cli tools to debug vectorize filtering issues.

alternatives considered

simple string matching: too brittle, misses semantic variations single embedding per question: loses task context, wrong answers llm-based similarity: too expensive for cache lookup hash-based caching: can’t handle natural language variations redis/database: unnecessary infrastructure complexity

lessons learned

migration insights

start simple, upgrade strategically - kv validation → vectorize optimization
reliable fallbacks enable confidence - dual systems during migration prevent downtime
semantic caching scales exponentially - each user benefits from all previous interactions
infrastructure timing matters - vectorize ga made the migration viable
monitoring distinguishes systems - cache headers enable performance comparison

technical learnings

metadata filtering is crucial - type and taskHash filters prevent wrong context matches
similarity thresholds matter - 0.8+ cosine similarity for reliable semantic matching
cache warming is organic - no need to pre-populate, users do it naturally
conversation state is fragile - clear on modal close to prevent bugs
graceful degradation - if caching fails, still call openai
vectorize beats kv - native similarity search vs javascript iterations

production results

performance improvements measured:

cache hit latency: 6.5s → 2.1s (~3x faster)
api call reduction: 100+ kv operations → 1 vectorize query
cost reduction: 99.5% for popular interactions
cache accuracy: improved semantic matching vs string similarity

user experience impact:

instant responses for cached questions
better cross-user knowledge sharing
more consistent ai guidance
reduced rate limiting due to cache hits

metadata indexing gotcha: vectors vs indexes timing

after deploying the vectorize migration, we discovered a critical timing issue that broke filtering for goal generation and follow-up caching.

the problem: retroactive indexing doesn’t work

what happened:

vectors were inserted into vectorize with metadata like type: "goal_generation"
metadata indexes were created later via wrangler vectorize create-metadata-index
filtered queries returned 0 results despite vectors having the correct metadata

// vectors inserted BEFORE metadata index creation
// ❌ cannot be found by filtered queries
const matches = await env.SEMANTIC_CACHE.query(embedding, {
  filter: { type: { $eq: "goal_generation" } }
}); // returns 0 results

// ✅ but can be found by unfiltered queries  
const allMatches = await env.SEMANTIC_CACHE.query(embedding, {
  topK: 10
}); // returns vectors with metadata intact

the discovery process

debugging revealed the issue:

# unfiltered query found 10 vectors including goals
npx wrangler vectorize query semantic-cache --vector [...] --top-k 10

# filtered query found only 2 newer vectors
npx wrangler vectorize query semantic-cache --vector [...] --filter '{"type": "goal_generation"}'

key insight: cloudflare vectorize metadata indexes only apply to vectors inserted after the index creation. existing vectors become invisible to filtered queries.

the solution: nuclear option

since the site had few users, we chose the clean slate approach:

# delete entire index
npx wrangler vectorize delete semantic-cache

# recreate with metadata indexes first
npx wrangler vectorize create semantic-cache --dimensions 1536 --metric cosine
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=type --type=string
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=taskHash --type=string

result: all new vectors are properly indexed and findable by filtered queries.

lessons for production systems

1. create metadata indexes before inserting vectors

# ✅ correct order for new vectorize setup

# step 1: create vectorize index
npx wrangler vectorize create semantic-cache --dimensions 1536 --metric cosine

# step 2: create metadata indexes BEFORE inserting any vectors
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=type --type=string
npx wrangler vectorize create-metadata-index semantic-cache --propertyName=taskHash --type=string

# step 3: verify indexes are ready
npx wrangler vectorize list-metadata-index semantic-cache

# step 4: NOW safe to insert vectors with metadata
# vectors inserted after this point will be findable by filtered queries

in your application code:

// this will work correctly because metadata indexes exist
await env.SEMANTIC_CACHE.upsert([{
  id: "goal_abc123_timestamp",
  values: problemEmbedding,
  metadata: {
    type: "goal_generation",  // ✅ indexed
    problemHash: "abc123"     // ✅ indexed via taskHash
  }
}]);

// filtered queries will find the vector
const matches = await env.SEMANTIC_CACHE.query(embedding, {
  filter: { type: { $eq: "goal_generation" } }  // ✅ works
});

2. vectorize lacks reindexing capabilities unlike elasticsearch or other databases, vectorize doesn’t offer:

reindex command to rebuild metadata indexes
bulk export/import for data migration
retroactive index application

3. migration strategies for production data

small datasets: nuclear option (delete/recreate)
large datasets: build custom export/import pipeline
critical systems: implement dual-write during transition

4. monitoring metadata filtering add debug logging to detect filtering issues:

const filteredMatches = await env.SEMANTIC_CACHE.query(embedding, {
  filter: { type: { $eq: "goal_generation" } }
});

if (filteredMatches.matches.length === 0) {
  console.warn('Filtered query returned 0 results - check metadata indexes');
}

this metadata indexing gotcha cost us a few hours of debugging but taught valuable lessons about vectorize operational characteristics that aren’t well documented.

conclusion

semantic caching with vector databases transforms ai applications from expensive, slow interactions into fast, cost-effective systems that improve with every user.

key insights for ai application developers:

use the right tool for the job - vector databases excel at similarity search, traditional databases at exact lookups
semantic caching scales exponentially - each user interaction creates value for all future similar requests
ai responses don’t need to be unique - they need to be contextually appropriate and fast
metadata filtering is crucial - combine semantic similarity with structured filters for precise results
create metadata indexes before inserting vectors - retroactive indexing doesn’t work in most vector databases

when to use vector databases for caching:

✅ user queries have natural language variations (“sets and reps” vs “repetitions”)
✅ exact string matching misses too many valid cache hits
✅ content generation is expensive (time or cost)
✅ semantic similarity matters more than exact matches
❌ simple key-value lookups work fine
❌ transactional consistency is required

bottom line: if you’re building ai applications with expensive generation costs, semantic caching with vector databases can deliver 3x performance improvements and 99%+ cost reductions for popular content. the infrastructure investment pays for itself quickly through improved user experience and reduced ai api costs.

try the live system: actuallydostuff.com - click the ❓ on any goal to experience ~2s cached responses vs ~6s generated responses, powered by cloudflare vectorize.

Sid Ngeth's Blog A blog about anything (but mostly development)