How to Build a Portfolio Chatbot With RAG on the Free Tier
Gemini Flash + Supabase pgvector + Langfuse = a fully functional RAG chatbot with observability that costs exactly zero dollars.
The Thesis
I wanted a chatbot on my portfolio that could answer questions about my projects, skills, and experience - without paying for infrastructure. No OpenAI bills. No Pinecone credits. No Vercel Pro.
The constraints were simple:
| Requirement | Why |
|---|---|
| Zero monthly cost | Portfolio traffic is unpredictable. I'm not paying a fixed monthly fee for a chatbot nobody might ask a question to. |
| Observability built in | If the chatbot hallucinates, I want to know why. If the RAG pipeline returns zero results, I want to see it in a trace. |
| Streaming responses | Nobody wants to stare at a spinner waiting for the full response. Character-by-character streaming or bust. |
| Security | It's my personal brand on the line. Prompt injection, jailbreaks, and data leaks need real defenses - not "we'll handle it later." |
Here's the exact stack that met all of them:
- LLM: Gemini 1.5 Flash (free tier: 1,500 req/day, 1M tokens/day)
- Vector store: Supabase pgvector (free tier: 500MB database)
- Embedding:
gemini-embedding-2(3072-dimensional vectors) - Observability: Langfuse Cloud (free tier)
- Hosting: Vercel Hobby (free tier)
Everything else - the hybrid search, the SSE streaming, the chunking strategy, the security layer - is glue code I wrote. And that's the point. The free tier gives you the primitives; you bring the architecture.
Architecture Overview
Here's the flow end-to-end before we dive into each piece:
The frontend (FloatingChat.tsx) is a React client component that reads the SSE stream via response.body.getReader() and renders text with a typewriter effect. Conversation history lives in React state - no persistence. Refresh loses the context. This is intentional for a portfolio chatbot where the user starts fresh every visit anyway.
1. Why Gemini Flash (and Not OpenAI)
The cost calculation was embarrassingly simple:
| Model | Free Tier | Credit Card Required | Quality for Q&A |
|---|---|---|---|
| Gemini 1.5 Flash | 1,500 req/day, 1M tokens/day | No | Good |
| GPT-3.5 | Zero free tier | Yes | Comparable |
| GPT-4o mini | Pay-as-you-go | Yes | Slightly better |
| Claude Haiku | Pay-as-you-go | Yes | Comparable |
For a portfolio chatbot that answers questions about a developer's background - "What projects have you worked on?", "What's your tech stack?" - the quality difference between Gemini Flash and GPT-3.5 is negligible. Both answer correctly 95% of the time. Both hallucinate in the same ways when the context is thin.
The model is initialized in the API route like this:
The sendMessageStream() method returns an async iterable. Each chunk is serialized as an SSE event:
The hidden cost: Gemini's free tier doesn't require a credit card to sign up. That's huge. If an attacker decided to spam my chatbot 50,000 times, I'd get a 429 error page - not a bill. OpenAI's API has no equivalent safety net on the free tier.
2. Hybrid Search With pgvector - Not Just Vector Search
Pure vector search on a knowledge base of ~15 chunks is fine - exact nearest neighbor on 15 vectors takes microseconds even without an index. But it misses things.
Consider the question: "What projects use React?"
A vector search returns chunks about Skillence (which uses React) and Hisaab Pro (which doesn't - it's vanilla JS). The similarity is driven by the word "project" appearing in both contexts. It works, but it's fuzzy.
Full-text search returns exactly the chunks containing "React" - no more, no less. It's precise but brittle (misses synonyms, paraphrasing, "React.js" vs "React").
Hybrid search combines both. Here's the implementation:
Why RRF (Reciprocal Rank Fusion)?
RRF is the simplest fusion strategy that works. You take each item's rank in each result list, compute 1 / (k + rank), and sum across lists. Items that rank highly in both searches get a higher combined score than items that rank highly in only one.
The k parameter controls how much the raw rank matters. A smaller k amplifies top-ranked results. A larger k gives lower-ranked results more of a chance. I chose k = 60 after testing - it's high enough that a chunk ranked 15th in one search (score = 1/75 ≈ 0.013) can still beat a chunk ranked 1st in the other (score = 1/61 ≈ 0.016) if both contribute.
The SQL Schema
The content_chunks table uses a generated tsvector column, so the full-text index is always in sync with the content:
Note the generated always as - no trigger needed, no application-level sync. Postgres maintains it automatically when content changes.
Also note: no vector index. The comment in schema.sql says it plainly:
-- No vector index needed - ~15 chunks, exact search is instant.
At this scale, an IVFFlat or HNSW index would add complexity without benefit. The entire knowledge base is ~15 chunks. A full scan takes microseconds.
3. Chunking Strategy - Four Patterns, One Manifest
The knowledge base covers my background, projects, skills, decisions, and FAQ. Each document type needs a different chunking strategy. Rather than hardcoding it, I defined a manifest:
Four strategies, each chosen for the content shape:
| Strategy | Used For | How It Splits |
|---|---|---|
single | Bio, Education (short docs) | Whole file = one chunk |
section | Projects, Skills, Decisions | Split on ## or ### headings |
qa-pair | FAQ | Split on **Q: pattern boundaries |
paragraph | Process docs | Same as section (aliased) |
The ingestion script reads the manifest, chunks each file by its strategy, embeds with Gemini Embedding-2, and upserts into Supabase:
A notable detail: the manifest declares chunk_size: 500 tokens and chunk_overlap: 50, but the actual chunking is semantic, not token-based. The chunkBySection() function splits on markdown headings, not on token windows. This means chunks can be 100 tokens or 1,000 tokens depending on the section length.
For a 15-chunk knowledge base, this is fine. For a larger corpus, I'd fix this discrepancy - token-aware chunking with overlap is essential for long documents where a semantic boundary falls mid-paragraph.
4. Langfuse Tracing - Observability Without the Cost
Observability tools are usually the first thing cut from a free-tier project. But I needed traces to debug hallucinations, empty RAG results, and unexpected Gemini responses.
Langfuse's free tier gives you:
- 50,000 observations/month
- Traces with nested spans
- Token usage tracking
- 7-day data retention
The wrapper is deliberately defensive - it never blocks the user response:
Every chat request creates one trace with two spans:
- RAG span - input query, number of results returned, source list
- Generation span - augmented prompt, streaming output, token counts, errors
This structure tells me at a glance: was the RAG pipeline empty? Did it return irrelevant results? Which sources were used? Did Gemini stream successfully or error out?
5. Security on the Free Tier
A portfolio chatbot doesn't have sensitive data, but it's still an exposed API endpoint that anyone on the internet can hit. I built three defenses:
5a. Input Length Cap
Messages over 500 characters are rejected. This prevents token-bleeding attacks where a long payload overwhelms the context window and leaks the system prompt.
5b. Jailbreak Detection
Ten regex patterns covering common prompt injection vectors: "ignore your instructions," "system prompt," "DAN" (Do Anything Now), "you are a free AI," etc. When matched, the request is rejected and I get an email alert via Resend:
5c. Canary Token
A unique string is embedded in the system prompt:
If the response contains this token, the system prompt was leaked. I log the event but don't expose it to the user. This is inspired by LangChain's canary token approach - it's a silent alarm that tells you your defenses failed.
6. The Cost Breakdown - Actually Zero
Here's exactly what the bill looks like:
| Service | Free Tier Limit | Monthly Usage (Estimated) | Cost |
|---|---|---|---|
| Gemini Flash | 1,500 req/day, 1M tokens/day | ~100 requests, ~50K tokens | $0 |
| Gemini Embedding | 1,500 req/day (same pool) | 1 call per chat request | $0 |
| Supabase pgvector | 500MB DB, 5GB bandwidth | < 50MB, < 100MB bandwidth | $0 |
| Langfuse Cloud | 50,000 observations/month | ~5,000 observations | $0 |
| Vercel Hobby | 100GB bandwidth, 600 build mins | < 10GB, < 100 build mins | $0 |
| Resend | 100 emails/day | ~5 alerts/month | $0 |
| Total | - | - | $0 |
The only thing that could generate a bill is a traffic spike. If 10,000 people asked the chatbot questions in one day, I'd hit Gemini's rate limit - the API would return 429 errors, and the chatbot would display a friendly "I'm talking too fast" message. Not ideal UX, but also not a bill.
7. What I'd Do Differently
Every project has its "I'd do this differently if I started today" list. Here's mine:
7a. Use One Gemini SDK, Not Two
The codebase uses @google/genai (v2.3.0) for embeddings and @google/generative-ai (v0.24.1) for chat. Two SDKs, two initialization paths, two interfaces. The @google/genai SDK now supports both embeddings and chat streaming - I'd consolidate to it and delete the older dependency.
7b. Token-Aware Chunking
The manifest declares chunk_size: 500 and chunk_overlap: 50, but the chunker splits on markdown boundaries. I'd implement recursive character text splitting (like LangChain's RecursiveCharacterTextSplitter) that respects token budgets. This matters for the FAQ file, where some Q&A pairs are 50 tokens and others are 800 - the long ones push the Gemini context window unnecessarily.
7c. DB-Backed Rate Limiting
The current rate limiter lives in proxy.ts - an edge middleware with an in-memory sliding window (Map<string, { count: number; windowStart: number }>). It limits /api/chat to 10 requests per IP per 60 seconds. On Vercel Hobby (single-region), it's good enough - but on cold start, the entire map resets, and across regions you'd have independent counters. I'd move rate limiting to Supabase with a simple rate_limits table:
Atomic UPSERT with a window check. Works across cold starts, scales to zero, costs nothing.
7d. Add a Vector Index (When It Matters)
At 15 chunks, exact search is fine. If I expand the knowledge base to 10,000 chunks (blog posts, code snippets, reading notes), I'll add an IVFFlat index on the embedding column. The schema already has the column defined - just missing the CREATE INDEX.
7e. Conversation Persistence via URL State
Portfolio visitors don't need accounts, but they might want to share a chat response. I'd serialize the last N messages into URL search params (?chat=... base64), so sharing a link preserves the conversation. No database writes, no auth, no cost.
7f. Evaluate the Model Choice
This project started with Gemini 1.5 Flash and has since moved to gemini-3.1-flash-lite as the model label updated. The free tier terms have stayed the same. But if I were starting today, I'd also evaluate:
- Gemini 2.0 Flash (better reasoning, still free tier)
- Claude 3 Haiku (cheap per-token, but no free tier - a hard no)
- Local model via Ollama (no API cost, needs a server - incompatible with Vercel serverless)
For now, Gemini Flash remains the rational default: free, good enough, no credit card.
Key Takeaways
If you're building a free-tier RAG chatbot, here's the playbook:
-
Gemini Flash + Supabase pgvector is a killer combo. Both have generous free tiers, no credit card required, and integrate directly via their SDKs. No intermediaries, no proxy services, no markups.
-
Hybrid search beats pure vector search at this scale. RRF is trivial to implement (10 lines of TypeScript) and catches cases that pure semantic search misses. Your full-text index is already there - use it.
-
Observability is not optional, even on free tier. Langfuse costs nothing for this volume and turns a black-box chatbot into a debuggable system. The trace structure (one RAG span + one generation span per request) is the minimum viable observability pattern.
-
Security doesn't need a budget. Canary tokens, regex jailbreak detection, and input caps cost zero dollars and prevent the most common attacks on exposed LLM endpoints. The email alert layer (Resend free tier) means you know when someone tries - silent failures are the real risk.
-
Semantic chunking is fine at small scale; token-aware chunking is necessary at large scale. Know which regime you're in and don't over-engineer for the wrong one.
The full source code is at github.com/SolarisXD/portfolio. The chatbot lives at app/api/chat/route.ts, the RAG pipeline at lib/rag.ts, and the ingestion script at scripts/ingest.ts. ~500 lines of TypeScript total for the entire RAG system. No vector database bills. No observability bills. No surprises.