Building Distributed Search Index for Jekyll with Cloudflare Workers and R2

alexiayangg naked kknowles_x camwhores nsfw helen cruz phica ella bergztröm marta lubawska fapello megan coppage fapello alexisaevans thotsbay jenny_jypsy boobs gracexglenn camwhores nsfw pinkyxlee onlyfans
aliciaravy forum cosmic_crystal222 nude jenny_jypsy pussy alyssa armoogam nude aling_liu onlyfans maya_hxo erothots cherylannggx2 camwhores leaked kknowles_x thotsbay itsdrizzybaby_ nsfw itsdrizzybaby_ thotsbay
aliciaravy sextape itsyourbabysitter sextape ilaria borgonovo fapello anajumendes erome ang3lr4t camwhores leaks hamslam5858 camwhores nsfw djhully amaral pelada pauliiigv leaked bombbunie baddie nude bombbunie erome
amyyreynolds leaked pack itseunchaeofficial camwhores ashleysoftiktok thotsbay jole tummolillo onlyfans elina olsson thotsbay aliciaravy onlyfans download nerushimav itseunchaeofficial fapello luciana milessi erothots mckinley bethel thotsbay
ang3lr4t camwhores leaked peachy_violett nips caryn beaumont erothots itsdrizzybaby_ instagram leaks barbora stříteská nude iamemxi siterip islaodoherty thotsbay nayimtnez desnuda jolanda tummolillo tits alva sjöberg fapello
blog
cherylannggx2 fap patrycja hachuła fapello eden levine thotsbay jeywhite1 nude girlygaaaal nude celeste pamio poringa dionnae maree playboy amyyreynolds fap itsdrizzybaby_ solo itsdemidior nude
elina olsson fapello cindy landolt phica guardiaciviil nude atitaya nukoi nude foggyday_l nude islaodoherty spankbang maxryanxoxo nude jenny_jypsy spankbang ageless vixen thotsbay sweetlady_1998
followjuliajasmin nude aliciaravy porn nsfw aliciaravy spankbang mimisemaan patreon leaks lindseyblakr leak matteo fioravanti lpsg maya_hxo leaked pack amyyreynolds thotsbay hamslam5858 vids ang3lr4t thotsbay
hamslam5858 nude download carmela rinollo nude agelessvixen thotsbay ntrannnnn camwhores leaked itseunchaeofficial nips camilla stelluti fapello anonib "tas" backitupwrennd nude leena xu fansly caro loesel nude
hayleylovitt erothots morganalexandraaa camwhores nsfw leah mifsud erothots amalia cintia thotsbay amyyreynolds porn download kaylthomsen23 nude alyce rocha fapello ayanna lagasse nua itsdrizzybaby_ vids caitlin strez nude
hupczik ivett nude etherealdanyell fapello itsdrizzybaby_ thefappening bsprovocateur fap mfbabyrain camwhores babi_kittieexo _iamrexy_ nude bglenglui nude sonia grey phica lara dewit camwhores leaks
itsdrizzybaby_ nude leaked amariah morales fapello missanatomia nude anetmlcak0va thotsbay jole tummolillo nuda jenny_jypsy leaked pack mfbabyrain nsfw download ivyxwren camwhores nsfw samantra_official nude stefqchy onlyfans
koronababe1 nude enulie_porer thotsbay elina olsson camwhores leaked lalla phica kayceyeth fansly officialsmwea itslacybabe fap leaks ang3lr4t spankbang isaias ailen zavala desnuda silvia d'avenia phica
luciana milessi camwhores nsfw itsdrizzybaby_ ass bebahanofficial nude aliciaravy thotsbay itseunchae cyberleaks candypieuwu erothots autumnivy thotsbay angelica scaglione fapello anetmlcak0va camwhores nsfw aliciaravy onlyfans nsfw
martina finocchio phica . courtneymaloneyy nude jenny_jypsy blowjob glorydayyys thotsbay beshine tumbex amyyreynolds fapello annabelle gesson cum tribute reembirdette leaked finley fae thotsbay bbypocahontas thotsbay
melina johnsen fapello kah kampa erome jenny_jypsy camwhores leaked bbw_maturehottie nude 0cmspring thotsbay nadia gaggioli desnuda kathyane gontijo nua itsdrizzybaby_ porn nsfw shaula ponce desnuda lainabearrknee
mfbabyrain forum gemma wren thotsbay abby berner twerking anetmlcak0va nsfw download txreemarie_disruptive erothots shirly novoa xxx jenny_jypsy nips maartalew nude lara dewit camwhores leaked dianathehuntress10 nude
simigaal leaked . peachy_violett camwhores nsfw anja diergarten fapello cherylannggx2 camwhores leaks javidesuu nude duramaxprincessss onlyfans chenzihaha nude itsdrizzybaby_ blowjob maya_hxo nude download chloergraham onlyfans
sukiyuki3 thotsbay madsxtina nude vanesa calcagno poringa madygio phica itsdemidior porn jole tummolillo tits francikath jenny_jypsy camwhores leaks etherealdanyell thotsbay atomickeerati thotsbay

As Jekyll sites scale to thousands of pages, client-side search solutions like Lunr.js hit performance limits due to memory constraints and download sizes. A distributed search architecture using Cloudflare Workers and R2 storage enables sub-100ms search across massive content collections while maintaining the static nature of Jekyll. This technical guide details the implementation of a sharded, distributed search index that partitions content across multiple R2 buckets and uses Worker-based query processing to deliver Google-grade search performance for static sites.

In This Guide

Distributed Search Architecture and Sharding Strategy
Jekyll Index Generation and Content Processing Pipeline
R2 Storage Optimization for Search Index Files
Worker-Based Query Processing and Result Aggregation
Relevance Ranking and Result Scoring Implementation
Query Performance Optimization and Caching

Distributed Search Architecture and Sharding Strategy

The distributed search architecture partitions the search index across multiple R2 buckets based on content characteristics, enabling parallel query execution and efficient memory usage. The system comprises three main components: the index generation pipeline (Jekyll plugin), the storage layer (R2 buckets), and the query processor (Cloudflare Workers).

Index sharding follows a multi-dimensional strategy: primary sharding by content type (posts, pages, documentation) and secondary sharding by alphabetical ranges or date ranges within each type. This approach ensures balanced distribution while maintaining logical grouping of related content. Each shard contains a complete inverted index for its content subset, along with metadata for relevance scoring and result aggregation.


// Sharding Strategy:
// posts/a-f.json    [65MB]  → R2 Bucket 1
// posts/g-m.json    [58MB]  → R2 Bucket 1  
// posts/n-t.json    [62MB]  → R2 Bucket 2
// posts/u-z.json    [55MB]  → R2 Bucket 2
// pages/*.json      [45MB]  → R2 Bucket 3
// docs/*.json       [120MB] → R2 Bucket 4 (further sharded)

// Query Flow:
// 1. Query → Cloudflare Worker
// 2. Worker identifies relevant shards
// 3. Parallel fetch from multiple R2 buckets
// 4. Result aggregation and scoring
// 5. Response with ranked results

Jekyll Index Generation and Content Processing Pipeline

The index generation occurs during Jekyll build through a custom plugin that processes content, builds inverted indices, and generates sharded index files. The pipeline includes text extraction, tokenization, stemming, and index optimization.

Here's the core Jekyll plugin for distributed index generation:


# _plugins/search_index_generator.rb
require 'nokogiri'
require 'zlib'

class SearchIndexGenerator < Jekyll::Generator
  def generate(site)
    @shards = Hash.new { |h,k| h[k] = {} }
    
    site.documents.each do |doc|
      next unless should_index?(doc)
      
      content = extract_searchable_content(doc)
      tokens = process_content(content)
      add_to_shards(doc, tokens)
    end
    
    generate_shard_files(site)
  end
  
  private
  
  def process_content(content)
    # HTML stripping and text extraction
    text = Nokogiri::HTML(content).text
    # Tokenization and normalization
    tokens = text.downcase.split(/[^\w]+/)
    # Stop word removal and stemming
    tokens.reject! { |t| STOP_WORDS.include?(t) }
    tokens.map! { |t| stem(t) }
    # Frequency analysis
    token_freq = Hash.new(0)
    tokens.each { |t| token_freq[t] += 1 }
    token_freq
  end
  
  def add_to_shards(document, token_freq)
    shard_key = determine_shard(document)
    doc_id = document.url
    
    @shards[shard_key][doc_id] = {
      title: document.data['title'],
      url: document.url,
      content: token_freq,
      metadata: extract_metadata(document),
      boost: calculate_boost_factor(document)
    }
  end
  
  def generate_shard_files(site)
    @shards.each do |shard_name, shard_data|
      compressed_data = Zlib::Deflate.deflate(JSON.generate(shard_data))
      site.pages << SearchIndexPage.new(site, shard_name, compressed_data)
    end
  end
end

R2 Storage Optimization for Search Index Files

R2 storage configuration optimizes for both storage efficiency and query performance. The implementation uses compression, intelligent partitioning, and cache headers to minimize latency and costs.

Index files are compressed using brotli compression with custom dictionaries tailored to the site's content. Each shard includes a header with metadata for quick query planning and shard selection. The R2 bucket structure organizes shards by content type and update frequency, enabling different caching strategies for static vs. frequently updated content.


// R2 Bucket Structure:
// search-indices/
//   ├── posts/
//   │   ├── shard-001.br.json
//   │   ├── shard-002.br.json
//   │   └── manifest.json
//   ├── pages/
//   │   ├── shard-001.br.json  
//   │   └── manifest.json
//   └── global/
//       ├── stopwords.json
//       ├── stemmer-rules.json
//       └── analytics.log

// Upload script with optimization
async function uploadShard(shardName, shardData) {
  const compressed = compressWithBrotli(shardData);
  const key = `search-indices/posts/${shardName}.br.json`;
  
  await env.SEARCH_BUCKET.put(key, compressed, {
    httpMetadata: {
      contentType: 'application/json',
      contentEncoding: 'br'
    },
    customMetadata: {
      'shard-size': compressed.length,
      'document-count': shardData.documentCount,
      'avg-doc-length': shardData.avgLength
    }
  });
}

Worker-Based Query Processing and Result Aggregation

The query processor handles search requests by identifying relevant shards, executing parallel searches, and aggregating results. The implementation uses Worker's concurrent fetch capabilities for optimal performance.

Here's the core query processing implementation:


export default {
  async fetch(request, env, ctx) {
    const { query, page = 1, limit = 10 } = await getSearchParams(request);
    
    if (!query || query.length < 2) {
      return jsonResponse({ error: 'Query too short' }, 400);
    }
    
    const startTime = Date.now();
    const searchTerms = parseQuery(query);
    const relevantShards = await identifyRelevantShards(searchTerms, env);
    
    // Execute parallel searches across shards
    const shardResults = await Promise.allSettled(
      relevantShards.map(shard => searchShard(shard, searchTerms, env))
    );
    
    // Aggregate and rank results
    const allResults = aggregateResults(shardResults);
    const rankedResults = rankResults(allResults, searchTerms);
    const paginatedResults = paginateResults(rankedResults, page, limit);
    
    const responseTime = Date.now() - startTime;
    
    return jsonResponse({
      query,
      results: paginatedResults,
      total: rankedResults.length,
      page,
      limit,
      responseTime,
      shardsQueried: relevantShards.length
    });
  }
}

async function searchShard(shardKey, searchTerms, env) {
  const shardData = await env.SEARCH_BUCKET.get(shardKey);
  if (!shardData) return [];
  
  const decompressed = await decompressBrotli(shardData);
  const index = JSON.parse(decompressed);
  
  return searchTerms.flatMap(term => 
    Object.entries(index)
      .filter(([docId, doc]) => doc.content[term])
      .map(([docId, doc]) => ({
        docId,
        score: calculateTermScore(doc.content[term], doc.boost, term),
        document: doc
      }))
  );
}

Relevance Ranking and Result Scoring Implementation

The ranking algorithm combines TF-IDF scoring with content-based boosting and user behavior signals. The implementation calculates relevance scores using multiple factors including term frequency, document length, and content authority.

Here's the sophisticated ranking implementation:


function rankResults(results, searchTerms) {
  return results
    .map(result => {
      const score = calculateRelevanceScore(result, searchTerms);
      return { ...result, finalScore: score };
    })
    .sort((a, b) => b.finalScore - a.finalScore);
}

function calculateRelevanceScore(result, searchTerms) {
  let score = 0;
  
  // TF-IDF base scoring
  searchTerms.forEach(term => {
    const tf = result.document.content[term] || 0;
    const idf = calculateIDF(term, globalStats);
    score += (tf / result.document.metadata.wordCount) * idf;
  });
  
  // Content-based boosting
  score *= result.document.boost;
  
  // Title match boosting
  const titleMatches = searchTerms.filter(term => 
    result.document.title.toLowerCase().includes(term)
  ).length;
  score *= (1 + (titleMatches * 0.3));
  
  // URL structure boosting
  if (result.document.url.includes(searchTerms.join('-')) {
    score *= 1.2;
  }
  
  // Freshness boosting for recent content
  const daysOld = (Date.now() - new Date(result.document.metadata.date)) / (1000 * 3600 * 24);
  const freshnessBoost = Math.max(0.5, 1 - (daysOld / 365));
  score *= freshnessBoost;
  
  return score;
}

function calculateIDF(term, globalStats) {
  const docFrequency = globalStats.termFrequency[term] || 1;
  return Math.log(globalStats.totalDocuments / docFrequency);
}

Query Performance Optimization and Caching

Query performance optimization involves multiple caching layers, query planning, and result prefetching. The system implements a sophisticated caching strategy that balances freshness with performance.

The caching architecture includes:


// Multi-layer caching strategy
const CACHE_STRATEGY = {
  // L1: In-memory cache for hot queries (1 minute TTL)
  memory: new Map(),
  
  // L2: Worker KV cache for frequent queries (1 hour TTL)  
  kv: env.QUERY_CACHE,
  
  // L3: R2-based shard cache with compression
  shard: env.SEARCH_BUCKET,
  
  // L4: Edge cache for popular result sets
  edge: caches.default
};

async function executeQueryWithCaching(query, env, ctx) {
  const cacheKey = generateCacheKey(query);
  
  // Check L1 memory cache
  if (CACHE_STRATEGY.memory.has(cacheKey)) {
    return CACHE_STRATEGY.memory.get(cacheKey);
  }
  
  // Check L2 KV cache
  const cachedResult = await CACHE_STRATEGY.kv.get(cacheKey);
  if (cachedResult) {
    // Refresh in memory cache
    CACHE_STRATEGY.memory.set(cacheKey, JSON.parse(cachedResult));
    return JSON.parse(cachedResult);
  }
  
  // Execute fresh query
  const results = await executeFreshQuery(query, env);
  
  // Cache results at multiple levels
  ctx.waitUntil(cacheQueryResults(cacheKey, results, env));
  
  return results;
}

// Query planning optimization
function optimizeQueryPlan(searchTerms, shardMetadata) {
  const plan = {
    shards: [],
    estimatedCost: 0,
    executionStrategy: 'parallel'
  };
  
  searchTerms.forEach(term => {
    const termShards = shardMetadata.getShardsForTerm(term);
    plan.shards = [...new Set([...plan.shards, ...termShards])];
    plan.estimatedCost += termShards.length * shardMetadata.getShardCost(term);
  });
  
  // For high-cost queries, use sequential execution with early termination
  if (plan.estimatedCost > 1000) {
    plan.executionStrategy = 'sequential';
    plan.shards.sort((a, b) => a.cost - b.cost);
  }
  
  return plan;
}

This distributed search architecture enables Jekyll sites to handle millions of documents with sub-100ms query response times. The system scales horizontally by adding more R2 buckets and shards, while the Worker-based processing ensures consistent performance regardless of query complexity. The implementation provides Google-grade search capabilities while maintaining the cost efficiency and simplicity of static site generation.