Sid Ngeth's Blog A blog about anything (but mostly development)

materialized views made my dashboard 9000x faster

i built a rails dashboard to analyze millions of records. first pass: painfully slow. adding materialized views with the scenic gem: absurdly fast.

here’s what i learned benchmarking against 100k users, 1M orders, and 5M user activities.

the problem

dashboard queries were hitting multiple tables with joins and aggregations. every page load meant scanning millions of rows, grouping, sorting… you know the drill.

daily sales query: 7.1 seconds per request. user engagement: same pain. this is fine for batch reports but unusable for a dashboard people actually look at.

materialized views in 30 seconds

regular views are just saved queries. postgres re-runs them every time.

materialized views are snapshots. postgres runs the query once, stores the results as a real table. subsequent reads? just SELECT from that cached table.

trade-off: data gets stale until you refresh. for dashboards where 5-60 minute staleness is fine, this works great.

scenic gem setup

scenic wraps postgres materialized views in rails migrations. feels native.

rails generate scenic:view daily_sales

creates two files:

  • migration file to create the view
  • sql file for your query

here’s the daily sales view:

SELECT
  DATE(orders.order_date) AS sale_date,
  COUNT(DISTINCT orders.id) AS total_orders,
  COUNT(DISTINCT orders.user_id) AS unique_customers,
  SUM(orders.total_amount) AS total_revenue,
  AVG(orders.total_amount) AS average_order_value,
  SUM(CASE WHEN orders.status = 'completed' THEN 1 ELSE 0 END) AS completed_orders,
  SUM(CASE WHEN orders.status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled_orders
FROM orders
GROUP BY DATE(orders.order_date)
ORDER BY sale_date DESC

normal complex aggregation. but instead of running this every time, we materialize it:

class CreateDailySales < ActiveRecord::Migration[8.0]
  def change
    create_view :daily_sales, materialized: true
    add_index :daily_sales, :sale_date, unique: true
  end
end

now you can query it like any rails model:

class DailySale < ApplicationRecord
  def readonly?
    true
  end

  def self.refresh
    Scenic.database.refresh_materialized_view(table_name, concurrently: false, cascade: false)
  end
end

# in your controller
@daily_sales = DailySale.order(sale_date: :desc).limit(30)

postgres reads from a pre-computed table instead of scanning orders every time.

the benchmarks

i built 4 materialized views:

  1. daily_sales - revenue metrics by day
  2. top_products - product performance
  3. user_engagements - customer lifetime value
  4. category_revenues - category breakdowns

then benchmarked raw queries vs materialized views using benchmark-ips.

results

daily sales summary

  • raw query: 6.25 iterations/sec (160ms per query)
  • materialized view: 2,191 iterations/sec (456 microseconds per query)
  • 350x faster

top products by revenue

  • raw query: 0.69 iterations/sec (1.44 seconds per query)
  • materialized view: 438 iterations/sec (2.28ms per query)
  • 633x faster

user engagement metrics

  • raw query: 0.14 iterations/sec (7.12 seconds per query)
  • materialized view: 135 iterations/sec (7.39ms per query)
  • 963x faster

category revenue analysis

  • raw query: 0.29 iterations/sec (3.41 seconds per query)
  • materialized view: 2,715 iterations/sec (368 microseconds per query)
  • 9,252x faster

the user engagement query went from 7 seconds to 7 milliseconds. category revenue from 3.4 seconds to 368 microseconds.

how the queries work

let’s look at the user engagement view since it had the biggest pain:

SELECT
  users.id AS user_id,
  users.email,
  users.name,
  COUNT(DISTINCT orders.id) AS total_orders,
  SUM(orders.total_amount) AS lifetime_value,
  AVG(orders.total_amount) AS avg_order_value,
  COUNT(DISTINCT user_activities.id) AS total_activities,
  COUNT(DISTINCT CASE WHEN user_activities.activity_type = 'page_view' THEN user_activities.id END) AS page_views,
  MAX(orders.order_date) AS last_order_date,
  MAX(user_activities.occurred_at) AS last_activity_date,
  DATE_PART('day', NOW() - MAX(user_activities.occurred_at)) AS days_since_last_activity
FROM users
LEFT JOIN orders ON users.id = orders.user_id
LEFT JOIN user_activities ON users.id = user_activities.user_id
GROUP BY users.id, users.email, users.name
ORDER BY lifetime_value DESC NULLS LAST

two left joins across 100k users, 1M orders, and 5M activities. grouping, aggregating, sorting. every single time someone loads the dashboard.

materialized it? 100k rows pre-computed. SELECT with a simple ORDER BY and LIMIT.

the indexes matter too:

add_index :user_engagements, :user_id, unique: true

postgres can use the index for lookups. filtering by high-value customers? instant.

why materialized views are faster: database internals

ran EXPLAIN ANALYZE on both approaches to see what postgres is actually doing. the difference is wild.

raw query execution (7.1 seconds)

Limit  (cost=1666565.79..1666566.04 rows=100)
  Buffers: shared hit=383450 read=135233 written=1559
  ->  Sort  (top-N heapsort)
        ->  GroupAggregate  (rows=100000)
              ->  Merge Left Join  (rows=50455739)  ← 50 MILLION intermediate rows
                    ->  Gather Merge (parallel workers: 2)
                          ->  Incremental Sort
                                ->  Merge Left Join (users + orders)
                    ->  Materialize (user_activities, 5M rows)

what’s happening:

  • joins 100k users + 1M orders + 5M activities
  • creates 50 million intermediate rows
  • groups all 100k users
  • sorts by lifetime value
  • reads 135,233 disk blocks from storage
  • takes top 100

the query is scanning millions of rows, doing complex joins, aggregating, then sorting. postgres is working hard.

materialized view execution (7.4ms)

Limit  (cost=0.29..8.87 rows=100)
  Buffers: shared hit=103
  ->  Index Scan using index_user_engagements_on_user_id
        Order By: lifetime_value DESC

what’s happening:

  • uses index to read rows sorted by lifetime_value
  • reads 103 blocks (all from cache)
  • stops after 100 rows

no joins. no aggregation. no sorting. just reading pre-computed results.

buffer analysis: cache hits matter

postgres tracks how often data is read from RAM (cache hits) vs disk:

base tables getting hammered by raw queries:

order_items:      5.2M disk reads, 76% cache hit ❌
user_activities:  1.3M disk reads, 91% cache hit ❌
orders:           817K disk reads, 95% cache hit ⚠️

materialized views:

daily_sales:        37 disk reads, 99.88% cache hit ✅
user_engagements: 9,612 disk reads, 99.71% cache hit ✅
top_products:     1,168 disk reads, 99.87% cache hit ✅

disk reads are ~1000x slower than RAM. materialized views stay in cache because they’re small and accessed frequently.

query cost comparison

postgres estimates query cost before execution:

query raw cost view cost ratio
daily sales 101,503 0.96 105,628x
user engagement 763,318 2.86 266,860x
top products 101,996 2.54 40,156x

these aren’t execution times, they’re cost units. includes disk I/O, CPU operations, memory usage. lower is better.

raw query for user engagement costs 763,318 units. materialized view: 2.86 units.

the memory problem: external sorts

raw daily sales query execution plan shows this:

Sort Method: external merge  Disk: 14208kB
  Worker 0: Disk: 12200kB
  Worker 1: Disk: 13736kB

sorting 1M rows doesn’t fit in work_mem, so postgres spills to disk. writes ~40MB of temporary files across 3 parallel workers.

disk I/O during sorting kills performance.

materialized views? no sorting needed. data is already sorted via indexes.

sequential scans vs index scans

checked how often postgres uses indexes vs scanning entire tables:

base tables:

orders:      5.5M index scans (99.99% index usage) ✅
users:       6.2M index scans (100% index usage)   ✅
products:    7.0M index scans (100% index usage)   ✅

every raw query hits these tables with index lookups. millions of operations putting load on the database.

materialized views:

category_revenues: 38,642 sequential scans (0 index scans) ✅
top_products:       5,988 sequential scans (0 index scans) ✅
daily_sales:            5 seq scans, 29,578 index scans   ✅

materialized views are small. sequential scans are actually faster than indexes for small tables (no index overhead).

real I/O impact

ran rails sql:analysis to get detailed buffer statistics:

raw user engagement query:

  • 135,233 disk blocks read
  • 383,450 cache blocks read
  • 1,559 blocks written (temp data)
  • 9.26 seconds execution

materialized view:

  • 0 disk blocks read
  • 103 cache blocks read
  • 0 blocks written
  • 0.0074 seconds execution

the raw query is doing 1300x more I/O. that’s why it’s slow.

tools for analysis

added comprehensive SQL analysis tools to the repo:

# full analysis report
rails sql:analysis

# shows: execution plans, buffer usage, cache hit ratios,
# index usage, query costs, table statistics

# analyze specific query
rails sql:analyze_query QUERY='SELECT * FROM orders WHERE status = "completed"'

# compare raw vs materialized views
rails benchmark:compare

the EXPLAIN ANALYZE output shows exactly what postgres is doing: parallel workers, sort methods, join types, buffer usage, actual row counts.

check out PERFORMANCE_ANALYSIS.md in the repo for the complete breakdown with execution plans and statistics.

refreshing the views

views get stale. you need to refresh them.

i use a background job with solid queue:

class RefreshMaterializedViewsJob < ApplicationJob
  queue_as :default

  def perform
    DailySale.refresh
    TopProduct.refresh
    UserEngagement.refresh
    CategoryRevenue.refresh
  end
end

scheduled hourly in production:

# config/recurring.yml
production:
  refresh_materialized_views:
    class: RefreshMaterializedViewsJob
    queue: default
    schedule: every hour

refreshing all 4 views takes about 27 seconds with my dataset. once an hour is negligible overhead for 350-9000x query speedups.

for larger views or high-traffic sites, use CONCURRENTLY:

def self.refresh
  Scenic.database.refresh_materialized_view(table_name, concurrently: true)
end

requires unique indexes but lets you refresh without locking the view. users can keep querying during refresh.

when this makes sense

materialized views work when:

  • you have complex aggregations that run often
  • data staleness of 5-60 minutes is acceptable
  • reads massively outnumber writes
  • the underlying query is expensive (>500ms)

don’t use them for:

  • real-time data requirements
  • simple queries already fast with indexes
  • write-heavy tables that change constantly

my dashboard checks all the boxes. analytics data where hour-old numbers are fine. users hitting the same queries hundreds of times per day.

the full setup

i open sourced the complete case study. includes:

  • production-ready schema (users, products, orders, activities)
  • 4 materialized views with sql
  • seed script that generates millions of records
  • benchmark rake tasks
  • dashboard ui
  • automated refresh jobs

you can clone it and run benchmarks yourself:

git clone https://github.com/sngeth/scenic-materialized-views-demo
cd scenic-materialized-views-demo
bundle install
rails db:create db:migrate
rails db:seed
rails benchmark:refresh
rails benchmark:compare

customize data volume with env vars:

USERS_COUNT=50000 PRODUCTS_COUNT=5000 rails db:seed

some specifics on scenic

scenic handles view versioning like migrations. updating a view:

rails generate scenic:view daily_sales --version 2

creates daily_sales_v02.sql. modify the query, run migrations, scenic handles the swap.

you can also drop down to raw sql when needed:

ActiveRecord::Base.connection.execute("REFRESH MATERIALIZED VIEW CONCURRENTLY daily_sales")

scenic mostly stays out of your way. it’s a thin wrapper that makes postgres materialized views feel like rails.

monitoring refresh performance

track how long refreshes take:

def perform
  Rails.logger.info "Starting materialized views refresh..."
  start_time = Time.now

  DailySale.refresh
  Rails.logger.info "  ✓ DailySale refreshed"

  # ... other views

  elapsed_time = Time.now - start_time
  Rails.logger.info "Completed in #{elapsed_time.round(2)}s"
end

watch for degradation as data grows. if refreshes start taking too long, consider:

  • refreshing views separately with different schedules
  • using incremental refresh patterns
  • partitioning underlying tables

practical example: the dashboard controller

here’s how simple the controller gets:

class DashboardController < ApplicationController
  def index
    @daily_sales = DailySale.order(sale_date: :desc).limit(30)
    @top_products = TopProduct.order(total_revenue: :desc).limit(10)
    @category_revenues = CategoryRevenue.order(total_revenue: :desc)
    @top_users = UserEngagement.order(lifetime_value: :desc).limit(10)
  end
end

four simple queries. no joins, no aggregations, no complexity. just reading pre-computed data.

response time? 50-100ms total including rendering. used to be 10+ seconds with raw queries.

the views handle all the heavy lifting in the background refresh job.

cost analysis

refreshing 4 views takes 27 seconds every hour = 648 seconds per day.

without materialized views, if the dashboard gets hit 1000 times per day (conservative):

  • 1000 requests × 4 queries × 3 seconds average = 12,000 seconds of query time
  • plus database load, connection pool pressure, etc.

the math checks out. background refresh overhead is tiny compared to saved query time.

edge cases

partial data during refresh: use CONCURRENTLY to avoid downtime, but it requires unique indexes and takes longer.

view dependencies: if views reference other views, refresh order matters. scenic handles this with cascade options.

schema changes: changing underlying tables requires updating and versioning the views. scenic makes this manageable with version files.

storage: materialized views duplicate data. monitor disk usage. my 4 views add maybe 50mb on top of 2gb of base tables. negligible.

wrapping up

350x to 9000x faster queries. 27 seconds of refresh time per hour. hour-old data that’s perfectly acceptable for analytics.

materialized views aren’t magic. they’re cached query results. but for dashboards on millions of rows, they transform unusable into instant.

the scenic gem makes them feel native to rails. write sql, run migrations, query like models.

check out the full repo if you want to try it. includes all the benchmarks, views, and a working dashboard you can load with test data.

tracking your wins with git

another year down, and i’m trying to remember what i actually built this year. honestly? it’s all a blur.

you know the feeling. you’ve been shipping code consistently, fixing bugs, building features, but when someone asks “what did you accomplish this year?” your brain just goes blank. was that auth refactor in march or july? did i ship the analytics dashboard before or after the mobile redesign?

the problem with developer memory

we’re constantly context switching. one day you’re debugging a race condition in the payment flow, the next you’re building a new onboarding experience, then suddenly you’re optimizing database queries because the dashboard is slow. each task feels important in the moment, but they all blend together over months.

whether it’s performance reviews, job interviews, or just internal reflection, people want concrete examples of your impact. “tell us about a complex technical challenge you solved” or “describe how you improved system performance.” but when everything feels like just another tuesday, it’s hard to remember which wins were actually significant.

your git history is your accomplishment log

every commit you make is a timestamp of progress. your git history contains:

  • exact dates of when you shipped features
  • the complexity and scope of changes
  • how many bugs you fixed vs features you built
  • patterns in your work (are you always fixing the same types of issues?)
  • collaboration evidence (co-authored commits, code reviews)

the trick is turning that raw commit data into a coherent story of growth and impact.

the magic command

here’s what i fed into an LLM to generate my yearly summary:

# get a year's worth of commits with stats
git log --author="[email protected]" \
        --since="2024-01-01" \
        --until="2024-12-31" \
        --pretty=format:"%h|%ad|%s" \
        --date=short \
        --all | head -50

then prompt your favorite LLM with:

“analyze these git commits and create a technical accomplishments summary. group by major themes like features, bug fixes, performance improvements, and security. highlight the business impact and technical complexity. include specific metrics where possible.”

automating with a script

i’ve been using this technique for weekly standups too. here’s a script that automates the whole process:

#!/bin/bash
# git-standup.sh - Generate AI-powered standup reports from git commits

set -e

DAYS=${1:-7}  # Default to last 7 days
AUTHOR=${2:-$(git config user.email)}
ENV_FILE=${3:-~/.env}

# Source environment variables
source "$ENV_FILE"

# Get git commits
COMMITS=$(git log --author="$AUTHOR" \
    --since="$DAYS days ago" \
    --pretty=format:"%h|%ad|%s" \
    --date=short \
    --all \
    --no-merges | head -50)

# Prepare the prompt
PROMPT="Analyze these git commits and create a concise standup update. Focus on:
- What was accomplished (group similar work)
- Any blockers or challenges implied by the commits
- Key technical wins or improvements
- Format as: Completed, In Progress, Blockers, Notes

Commits (format: hash|date|message):
$COMMITS"

# Use OpenAI API
if [[ -n "$OPENAI_API_KEY" ]]; then
    ESCAPED_PROMPT=$(echo "$PROMPT" | jq -Rs .)

    curl -s https://api.openai.com/v1/chat/completions \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $OPENAI_API_KEY" \
        -d "{
            \"model\": \"gpt-4o-mini\",
            \"messages\": [{
                \"role\": \"user\",
                \"content\": $ESCAPED_PROMPT
            }],
            \"max_tokens\": 1000
        }" | jq -r '.choices[0].message.content'
fi

just add your OPENAI_API_KEY to ~/.env and run:

./git-standup.sh           # last 7 days
./git-standup.sh 3         # last 3 days
./git-standup.sh 14 [email protected]  # custom timeframe/author

what the analysis revealed

looking at my own year through this lens was… honestly pretty shocking. here’s what git actually tracked:

major feature developments:

  1. webinar platform (2025)
    • enhanced VCF platform to support webinars including:
      • post-registration functionality and user workflows
      • custom marketing capabilities for webinar events
      • early start/late end time configuration system
      • private webinar filtering for staff interfaces
    • technical impact: enabled virtual employment workshops, expanding platform capabilities beyond traditional job fairs
  2. virtual career fair (VCF) enhancements
    • developed VCF featured jobs system - complete job highlighting and promotion feature
    • built pre-event search functionality - allowing candidates to discover opportunities before events
    • enhanced chat welcome message formatting with WYSIWYG input
    • improved message template positioning and dropdown functionality
    • created mobile-responsive interfaces for exhibitor lists and candidate interactions
  3. event management & analytics
    • built sold-out event handling system with automatic waitlist functionality
    • created comprehensive messaging analytics with campaign details and performance tracking
    • enhanced control center real-time statistics display

security & performance contributions:

security hardening

  • strengthened password requirements and implemented secure reset flows
  • prevented user enumeration attacks in authentication systems
  • replaced insecure staff password generation with secure reset links
  • added CSRF protection and input sanitization improvements

performance optimization

  • resolved N+1 query issues in candidate searches and exhibitor displays
  • optimized database queries and added proper indexing
  • implemented efficient search filtering with elasticsearch integration
  • added query optimization for large dataset operations

technical problem solving:

mobile & responsive design

  • fixed critical mobile responsiveness issues across VCF interfaces
  • resolved viewport and layout problems for exhibitor schedules
  • implemented x-teleport solutions for dropdown menu clipping issues
  • enhanced mobile chat functionality with proper input visibility

data export & reporting

  • built comprehensive CSV export systems for:
    • staff organization users
    • client job postings with missing columns
    • candidate applications with enhanced filtering
    • event folder candidate downloads with rep attendance data

UI/UX improvements

  • implemented advanced filtering systems with sidebar interfaces
  • created sortable lists for completed one-on-one meetings

business impact contributions:

event operations

  • enhanced attendee tracking with view confirmation systems
  • implemented booth management with presence and broadcasting fixes
  • created candidate folder filtering for improved exhibitor experience
  • built time zone handling for multi-region events

client tools

  • developed job deletion workflows with automatic credit refunds
  • enhanced job application search with advanced filtering options
  • created messaging template systems with positioning improvements
  • implemented draft job management capabilities

quality assurance & testing

  • fixed flaky test issues with proper mocking and stubbing
  • implemented integration specs for complex workflows
  • added test helpers for consistent testing patterns

recent high-impact work (2024-2025):

  • september 2025: sold-out event handling and waitlist system
  • august 2025: VCF enhancements and messaging analytics
  • july 2025: staff candidate tracking and view confirmation
  • june 2025: security hardening and password requirements

technical skills demonstrated:

  • full-stack ruby on rails development
  • javascript/stimulus frontend frameworks
  • elasticsearch implementation and optimization
  • database design and query optimization
  • real-time features with turbo streams
  • mobile-responsive design patterns
  • security best practices implementation

beyond the basic stats

the real value isn’t just counting commits. it’s seeing patterns:

what types of problems do you gravitate toward? my commits showed i spend a lot of time on integration challenges, mobile responsiveness, and real-time features.

when are you most productive? my commit timestamps revealed patterns i never noticed. heavy feature work in morning sprints, bug fixes and optimization in the afternoon.

what’s your technical growth path? the progression from simple bug fixes early in the year to building complete subsystems later shows clear skill development. commits touching multiple systems prove comfort with complex, cross-cutting changes.

staying on top of it

keep a running note of the big wins as they happen. git gives you the data, but you need to capture the context: why was this hard? what would’ve happened if you didn’t fix it? how many users did this impact?

your commits are proof you’ve been busy. turning them into a story of impact? that’s the difference between “i wrote code” and “i moved the business forward.”

how that cloudflare outage happened (and how to avoid it)

so cloudflare had this massive outage recently. their tenant service api went down, taking the dashboard and a bunch of other apis with it. the root cause? a react useEffect dependency array bug that made their dashboard hammer the api with unnecessary requests.

here’s what went wrong…

the setup

they had a react component that needed to fetch data from their tenant service api. pretty standard stuff - throw it in a useEffect, call it a day:

useEffect(() => {
  fetchTenantData(config);
}, [config]);

looks fine, right? except config was an object that got recreated on every render.

why objects break dependency arrays

react’s dependency array uses Object.is() to check if dependencies changed (verified in react’s source - see packages/shared/objectIs.js). for primitives like strings and numbers, this works great:

Object.is('hello', 'hello') // true
Object.is(42, 42) // true

but for objects and arrays? different story:

Object.is({a: 1}, {a: 1}) // false!
Object.is([1, 2], [1, 2]) // false!

even if the contents are identical, they’re different object references. so when you do this:

function Dashboard() {
  const config = { endpoint: '/api/tenant' }; // new object every render!

  useEffect(() => {
    fetchData(config);
  }, [config]); // this runs every single render
}

that effect runs on every render. every state update. every prop change. everything.

the cascade failure

here’s where it gets interesting. the dashboard wasn’t just making one extra call - it was making dozens. why? because the api call itself was probably updating state:

  1. component renders → creates new config object
  2. useEffect sees “new” dependency → calls api
  3. api response updates state → triggers re-render
  4. go to step 1

add multiple components doing this, users refreshing the page, and a recent service update that made the tenant service less stable… boom. you’ve got an outage.

how to fix it

few options here:

option 1: useMemo

memoize the object so it keeps the same reference:

const config = useMemo(() => ({
  endpoint: '/api/tenant'
}), []); // only create once

useEffect(() => {
  fetchData(config);
}, [config]); // now this only runs once

option 2: primitive dependencies

instead of passing the whole object, use primitive values:

const endpoint = '/api/tenant';

useEffect(() => {
  fetchData({ endpoint });
}, [endpoint]); // strings compare by value

option 3: move it outside

if the config never changes, define it outside the component:

const CONFIG = { endpoint: '/api/tenant' };

function Dashboard() {
  useEffect(() => {
    fetchData(CONFIG);
  }, []); // no dependency needed
}

how eslint might have made it worse

here’s the ironic part: the exhaustive-deps rule might have actually caused this bug!

{
  "rules": {
    "react-hooks/exhaustive-deps": "error"
  }
}

imagine you start with this:

function Dashboard() {
  const config = { endpoint: '/api/tenant' };

  useEffect(() => {
    fetchData(config);
  }, []); // eslint error: missing dependency 'config'
}

the linter complains that config is used but not in the deps array. so you “fix” it:

useEffect(() => {
  fetchData(config);
}, [config]); // linter happy, performance dead

now your effect runs on every render because config is a new object each time. the linter pushed you into the bug!

the real fix is understanding why the warning exists and addressing the root cause (memoizing the object, using primitives, or moving it outside the component) rather than just making the linter happy.

deploying rails to aws ecs fargate with application load balancer health checks

a complete guide to containerizing a rails application and deploying it to aws ecs fargate with proper alb health check configuration.

overview

this guide walks through deploying a rails 8 application to aws using:

  • ecs fargate for serverless container orchestration
  • application load balancer (alb) for traffic routing and health checks
  • ecr for container image storage
  • secrets manager for secure configuration management
  • cloudwatch for logging

important security note: replace all placeholder values like [APP-NAME] and [ACCOUNT-ID] with your actual values. never commit these actual values to version control.

why ecs fargate over traditional deployment?

benefits of fargate:

  • no server management - aws handles os patches, scaling, security
  • pay-per-use pricing model
  • built-in integration with alb and other aws services
  • automatic scaling and load balancing
  • perfect for microservices and containerized applications

vs. traditional ec2:

  • no ssh access needed
  • no ami management
  • scales to zero for cost savings
  • simpler operations and ci/cd

prerequisites

  • aws cli configured with appropriate permissions
  • docker installed locally
  • rails application with health check endpoint

step 1: containerizing the rails application

1.1 create dockerfile

rails 8 generates an excellent production-ready dockerfile. key components:

# multi-stage build for smaller final image
ARG RUBY_VERSION=3.2.9
FROM ruby:$RUBY_VERSION-slim as base

# production environment configuration
ENV RAILS_ENV="production" \
    BUNDLE_DEPLOYMENT="1" \
    BUNDLE_PATH="/usr/local/bundle"

# thruster configuration for http proxy (recommended)
ENV TARGET_PORT=3000
ENV HTTP_PORT=80
EXPOSE 80

# use thruster to proxy port 80 → rails on port 3000
CMD ["./bin/thrust", "./bin/rails", "server", "-b", "0.0.0.0", "-p", "3000"]

1.2 health check endpoint

create a robust health check endpoint:

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def check
    render json: {
      status: "ok",
      timestamp: Time.current.iso8601,
      rails_version: Rails.version,
      environment: Rails.env
    }, status: :ok
  end
end
# config/routes.rb
Rails.application.routes.draw do
  get "health/check"
  # other routes...
end

1.3 docker entrypoint

simplify the entrypoint for containerized deployment:

#!/bin/bash -e
# bin/docker-entrypoint

# enable jemalloc for reduced memory usage
if [ -z "${LD_PRELOAD+x}" ]; then
    LD_PRELOAD=$(find /usr/lib -name libjemalloc.so.2 -print -quit)
    export LD_PRELOAD
fi

echo "starting rails server without database setup..."
exec "${@}"

1.4 platform compatibility

add linux platforms to gemfile.lock for cross-platform builds:

bundle lock --add-platform x86_64-linux aarch64-linux

step 2: aws infrastructure setup

2.1 create ecr repository

security note: use unique repository names to avoid conflicts with existing resources.

aws ecr create-repository --repository-name [APP-NAME] --region us-east-1

2.2 build and push docker image

# build for production architecture
docker buildx build --platform linux/amd64 -t [APP-NAME]:latest .

# tag and push to ecr
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com

docker tag [APP-NAME]:latest [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:latest
docker push [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:latest

2.3 vpc and networking setup

security consideration: this creates a new vpc. if you have existing infrastructure, consider using existing vpcs and subnets instead.

# create vpc
VPC_ID=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 --region us-east-1 --query Vpc.VpcId --output text)

# create public subnets in different azs
SUBNET1=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.1.0/24 --availability-zone us-east-1a --query Subnet.SubnetId --output text)
SUBNET2=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.2.0/24 --availability-zone us-east-1b --query Subnet.SubnetId --output text)

# internet gateway and routing
IGW_ID=$(aws ec2 create-internet-gateway --query InternetGateway.InternetGatewayId --output text)
aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID

# route table configuration
RT_ID=$(aws ec2 create-route-table --vpc-id $VPC_ID --query RouteTable.RouteTableId --output text)
aws ec2 create-route --route-table-id $RT_ID --destination-cidr-block 0.0.0.0/0 --gateway-id $IGW_ID
aws ec2 associate-route-table --subnet-id $SUBNET1 --route-table-id $RT_ID
aws ec2 associate-route-table --subnet-id $SUBNET2 --route-table-id $RT_ID

# enable auto-assign public ips
aws ec2 modify-subnet-attribute --subnet-id $SUBNET1 --map-public-ip-on-launch
aws ec2 modify-subnet-attribute --subnet-id $SUBNET2 --map-public-ip-on-launch

step 3: application load balancer configuration

3.1 security groups

security note: the alb security group allows traffic from the entire internet (0.0.0.0/0). this is appropriate for public web applications but consider restricting if needed.

# alb security group
ALB_SG=$(aws ec2 create-security-group \
  --group-name [APP-NAME]-alb-sg \
  --description "security group for alb" \
  --vpc-id $VPC_ID \
  --query GroupId --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $ALB_SG \
  --protocol tcp --port 80 --cidr 0.0.0.0/0

# ecs security group
ECS_SG=$(aws ec2 create-security-group \
  --group-name [APP-NAME]-ecs-sg \
  --description "security group for ecs tasks" \
  --vpc-id $VPC_ID \
  --query GroupId --output text)

aws ec2 authorize-security-group-ingress \
  --group-id $ECS_SG \
  --protocol tcp --port 80 --source-group $ALB_SG

3.2 create application load balancer

# create alb
ALB_ARN=$(aws elbv2 create-load-balancer \
  --name [APP-NAME]-alb \
  --subnets $SUBNET1 $SUBNET2 \
  --security-groups $ALB_SG \
  --query 'LoadBalancers[0].LoadBalancerArn' --output text)

# create target group with health check configuration
TG_ARN=$(aws elbv2 create-target-group \
  --name [APP-NAME]-targets \
  --protocol HTTP --port 80 \
  --vpc-id $VPC_ID \
  --target-type ip \
  --health-check-path /health/check \
  --health-check-protocol HTTP \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --matcher HttpCode=200 \
  --query 'TargetGroups[0].TargetGroupArn' --output text)

# create listener
aws elbv2 create-listener \
  --load-balancer-arn $ALB_ARN \
  --protocol HTTP --port 80 \
  --default-actions Type=forward,TargetGroupArn=$TG_ARN

3.3 health check configuration details

the alb health check configuration is critical for proper operation:

  • path: /health/check - your rails endpoint
  • success codes: 200 - http ok status
  • interval: 30 seconds - check frequency
  • timeout: 5 seconds - request timeout
  • healthy threshold: 2 - consecutive successful checks to mark healthy
  • unhealthy threshold: 3 - consecutive failed checks to mark unhealthy

step 4: ecs configuration

4.1 create ecs cluster

aws ecs create-cluster --cluster-name [APP-NAME]-cluster

4.2 iam role for task execution

security note: check if ecstaskexecutionrole already exists in your account before creating it to avoid conflicts.

# create execution role (skip if it already exists)
aws iam create-role \
  --role-name ecsTaskExecutionRole \
  --assume-role-policy-document '{
    "Version":"2012-10-17",
    "Statement":[{
      "Effect":"Allow",
      "Principal":{"Service":"ecs-tasks.amazonaws.com"},
      "Action":"sts:AssumeRole"
    }]
  }'

# attach required policies
aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/SecretsManagerReadWrite

4.3 secrets management

store sensitive configuration in aws secrets manager:

# store rails master key
aws secretsmanager create-secret \
  --name [APP-NAME]/rails_master_key \
  --secret-string "$(cat config/master.key)"

4.4 ecs task definition

{
  "family": "[APP-NAME]-task",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::[ACCOUNT-ID]:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "[APP-NAME]",
      "image": "[ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 80,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "RAILS_ENV",
          "value": "production"
        },
        {
          "name": "RAILS_LOG_TO_STDOUT",
          "value": "true"
        }
      ],
      "secrets": [
        {
          "name": "RAILS_MASTER_KEY",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:[ACCOUNT-ID]:secret:[APP-NAME]/rails_master_key"
        }
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost/health/check || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/[APP-NAME]",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

4.5 register task definition and create service

# create cloudwatch log group
aws logs create-log-group --log-group-name /ecs/[APP-NAME]

# register task definition
TASK_DEF_ARN=$(aws ecs register-task-definition \
  --cli-input-json file://ecs-task-definition.json \
  --query 'taskDefinition.taskDefinitionArn' --output text)

# create ecs service
aws ecs create-service \
  --cluster [APP-NAME]-cluster \
  --service-name [APP-NAME]-service \
  --task-definition $TASK_DEF_ARN \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[$SUBNET1,$SUBNET2],
    securityGroups=[$ECS_SG],
    assignPublicIp=ENABLED
  }" \
  --load-balancers "targetGroupArn=$TG_ARN,containerName=[APP-NAME],containerPort=80"

step 5: deployment and testing

5.1 monitor deployment

# check service status
aws ecs describe-services --cluster [APP-NAME]-cluster --services [APP-NAME]-service --region us-east-1

# check target health
aws elbv2 describe-target-health --target-group-arn $TG_ARN --region us-east-1

# view logs
aws logs get-log-events --log-group-name /ecs/[APP-NAME] --log-stream-name [LOG-STREAM] --region us-east-1

5.2 test health checks

# get alb dns name
ALB_DNS=$(aws elbv2 describe-load-balancers \
  --load-balancer-arns $ALB_ARN \
  --query 'LoadBalancers[0].DNSName' --output text)

# test health check endpoint
curl http://$ALB_DNS/health/check

expected response:

{
  "status": "ok",
  "timestamp": "2025-09-13t21:28:14z",
  "rails_version": "8.0.2.1",
  "environment": "production"
}

common issues and solutions

container permission errors

issue: permission denied - bind(2) for "0.0.0.0" port 80

solution options:

option a: use non-privileged port (recommended for security)

# run as non-root user on port 3000
USER rails:rails
EXPOSE 3000
CMD ["./bin/rails", "server", "-b", "0.0.0.0", "-p", "3000"]

# update alb target group to port 3000
# update ecs security group to allow port 3000 from alb

option b: use thruster proxy (better performance)

# run as root to bind privileged port, but thruster drops privileges
ENV TARGET_PORT=3000
ENV HTTP_PORT=80
EXPOSE 80
CMD ["./bin/thrust", "./bin/rails", "server", "-b", "0.0.0.0", "-p", "3000"]

# benefits: http/2, compression, static file serving, caching
# security: thruster runs as root but rails process runs as rails user

thruster benefits you lose with option a:

  • http/2 support
  • automatic compression (gzip/brotli)
  • static file serving optimizations
  • built-in caching
  • x-sendfile support for efficient file downloads

health check failures

issue: alb showing 502/503 errors

solutions:

  1. verify health check path matches your rails route
  2. ensure container is listening on the correct port
  3. check security group allows alb → ecs communication
  4. review container logs for startup errors

platform compatibility

issue: exec format error in container logs

solution: build for correct architecture:

docker buildx build --platform linux/amd64 -t [APP-NAME]:latest .

security considerations

best practices implemented

  1. secrets management: sensitive data stored in aws secrets manager
  2. network security: security groups restrict access between components
  3. least privilege: iam roles with minimal required permissions
  4. container security: multi-stage builds reduce attack surface

security group rules

with option a (port 3000):

  • alb sg: allow http (80) from internet
  • ecs sg: allow http (3000) only from alb sg
  • alb handles port 80 → 3000 mapping

with option b (thruster):

  • alb sg: allow http (80) from internet
  • ecs sg: allow http (80) only from alb sg
  • thruster handles http optimizations

security trade-offs

option a (non-privileged port):

  • ✅ better: no root processes
  • ✅ better: principle of least privilege
  • ❌ worse: no http/2, compression, caching
  • ❌ worse: higher resource usage for static files

option b (thruster):

  • ✅ better: http/2, compression, optimizations
  • ✅ better: rails process still runs as non-root
  • ⚠️ acceptable: thruster proxy runs as root (industry standard)
  • ⚠️ acceptable: container isolation provides security boundary

recommendation: use thruster (option b) unless you have strict security requirements that prohibit any root processes.

cost optimization

fargate pricing factors

  • cpu allocation: 256 cpu units (0.25 vcpu)
  • memory allocation: 512 mb ram
  • running time: pay per second, minimum 1 minute

cost-saving tips

  1. right-size resources: start small, monitor, and adjust
  2. use spot pricing: for non-critical workloads
  3. scale to zero: during low-traffic periods
  4. monitor usage: cloudwatch metrics for optimization

monitoring and logging

cloudwatch integration

  • container logs: automatically streamed to cloudwatch
  • metrics: cpu, memory, network utilization
  • alarms: set up alerts for health check failures

health check monitoring

# create cloudwatch alarm for unhealthy targets
aws cloudwatch put-metric-alarm \
  --alarm-name "[APP-NAME]-unhealthy-targets" \
  --alarm-description "alb has unhealthy targets" \
  --metric-name UnHealthyHostCount \
  --namespace AWS/ApplicationELB \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 0 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=TargetGroup,Value=$TG_ARN

deployment commands summary

here’s the complete sequence of commands to deploy your rails app:

# 1. build and push image
docker buildx build --platform linux/amd64 -t [APP-NAME]:latest .
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com
docker tag [APP-NAME]:latest [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:latest
docker push [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:latest

# 2. create infrastructure
aws ecs create-cluster --cluster-name [APP-NAME]-cluster --region us-east-1
aws logs create-log-group --log-group-name /ecs/[APP-NAME] --region us-east-1

# 3. register task definition and deploy
aws ecs register-task-definition --cli-input-json file://ecs-task-definition.json --region us-east-1
aws ecs create-service --cluster [APP-NAME]-cluster --service-name [APP-NAME]-service --task-definition [APP-NAME]-task:1 --desired-count 2 --launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets=[subnet-ids],securityGroups=[ecs-sg-id],assignPublicIp=ENABLED}" --load-balancers "targetGroupArn=[tg-arn],containerName=[APP-NAME],containerPort=80" --region us-east-1

# 4. test deployment
curl http://[alb-dns]/health/check

high availability and reliability patterns

current availability with 2 containers

our basic deployment with desired-count: 2 provides:

  • basic redundancy: if one container fails, traffic routes to the healthy container
  • rolling updates: ecs can update one container at a time without downtime
  • automatic recovery: failed containers are automatically restarted
  • estimated availability: ~99.5% (basic level)

achieving higher availability (99.9%+)

for production applications requiring maximum uptime, implement these patterns:

1. multi-az deployment with increased capacity

{
  "serviceName": "[APP-NAME]-service-ha",
  "desiredCount": 4,
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 50,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  },
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": ["subnet-1a", "subnet-1b", "subnet-1c"],
      "securityGroups": ["sg-ecs"],
      "assignPublicIp": "ENABLED"
    }
  }
}

benefits:

  • 4 containers across 3 availability zones
  • can lose entire az and maintain service
  • circuit breaker automatically rolls back failed deployments
  • deployment flexibility allows 100% capacity increase during deployments

2. auto scaling configuration

# create auto scaling target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/[APP-NAME]-cluster/[APP-NAME]-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 4 \
  --max-capacity 20

# cpu-based scaling policy
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/[APP-NAME]-cluster/[APP-NAME]-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-scaling \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 300,
    "ScaleInCooldown": 300
  }'

3. enhanced health checks

extend your health controller for comprehensive monitoring:

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def check
    health_data = {
      status: "ok",
      timestamp: Time.current.iso8601,
      rails_version: Rails.version,
      environment: Rails.env,
      uptime: uptime_seconds,
      memory: memory_usage,
      checks: {
        database: database_check,
        redis: redis_check,
        storage: storage_check
      }
    }

    if health_data[:checks].values.all? { |check| check[:status] == "ok" }
      render json: health_data, status: :ok
    else
      render json: health_data, status: :service_unavailable
    end
  end

  private

  def database_check
    ActiveRecord::Base.connection.execute("SELECT 1")
    { status: "ok", response_time_ms: 0 }
  rescue => e
    { status: "error", message: e.message }
  end

  def memory_usage
    return {} unless defined?(GC)

    {
      rss_mb: `ps -o rss= -p #{Process.pid}`.strip.to_i / 1024,
      gc_count: GC.count,
      heap_slots: GC.stat[:heap_live_slots]
    }
  end

  def uptime_seconds
    Process.clock_gettime(Process::CLOCK_UPTIME).to_i
  end
end

4. monitoring and alerting setup

# create comprehensive alarms
aws cloudwatch put-metric-alarm \
  --alarm-name "[APP-NAME]-high-cpu" \
  --alarm-description "high cpu utilization" \
  --metric-name CPUUtilization \
  --namespace AWS/ECS \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=ServiceName,Value=[APP-NAME]-service Name=ClusterName,Value=[APP-NAME]-cluster

aws cloudwatch put-metric-alarm \
  --alarm-name "[APP-NAME]-response-time" \
  --alarm-description "high response time" \
  --metric-name TargetResponseTime \
  --namespace AWS/ApplicationELB \
  --statistic Average \
  --period 300 \
  --evaluation-periods 3 \
  --threshold 2.0 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=LoadBalancer,Value=[ALB-FULL-NAME]

5. graceful shutdown handling

rails applications handle sigterm gracefully by default with puma. configure ecs task definition for proper shutdown timing:

{
  "containerDefinitions": [{
    "stopTimeout": 30,
    "healthCheck": {
      "command": ["CMD-SHELL", "curl -f http://localhost/health/check || exit 1"],
      "interval": 15,
      "timeout": 5,
      "retries": 3,
      "startPeriod": 45
    }
  }]
}

availability comparison

pattern containers azs estimated availability recovery time
basic 2 2 99.5% 2-3 minutes
enhanced 4 3 99.9% 30 seconds
enterprise 6+ 3+ 99.95%+ 10 seconds

cost vs availability trade-offs

basic deployment (2 containers):

  • cost: ~$30/month for small workloads
  • availability: sufficient for internal tools, staging
  • recovery: manual intervention may be needed

high availability (4+ containers):

  • cost: ~$60-120/month depending on scale
  • availability: production-ready for business applications
  • recovery: automatic with circuit breakers

enterprise (6+ containers + auto-scaling):

  • cost: variable, $100-500+/month based on traffic
  • availability: mission-critical applications
  • recovery: instant failover across multiple zones

deployment pipeline for ha

# 1. build and test
docker buildx build --platform linux/amd64 -t [APP-NAME]:latest .
docker run --rm -p 3000:80 [APP-NAME]:latest &
sleep 10
curl -f http://localhost:3000/health/check || exit 1

# 2. push to ecr
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com
docker tag [APP-NAME]:latest [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:$(git rev-parse --short HEAD)
docker push [ACCOUNT-ID].dkr.ecr.us-east-1.amazonaws.com/[APP-NAME]:$(git rev-parse --short HEAD)

# 3. update task definition with new image
sed "s/:latest/:$(git rev-parse --short HEAD)/g" ecs-task-definition.json > ecs-task-definition-$(git rev-parse --short HEAD).json
aws ecs register-task-definition --cli-input-json file://ecs-task-definition-$(git rev-parse --short HEAD).json

# 4. update service (ecs handles rolling deployment)
aws ecs update-service \
  --cluster [APP-NAME]-cluster \
  --service [APP-NAME]-service \
  --task-definition [APP-NAME]-task:$(aws ecs list-task-definitions --family-prefix [APP-NAME]-task --status ACTIVE --sort DESC --max-items 1 --query 'taskDefinitionArns[0]' --output text | cut -d'/' -f2)

# 5. wait for deployment to complete
aws ecs wait services-stable --cluster [APP-NAME]-cluster --services [APP-NAME]-service

for most production applications, this configuration provides excellent availability:

{
  "desiredCount": 4,
  "deploymentConfiguration": {
    "maximumPercent": 150,
    "minimumHealthyPercent": 75,
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  },
  "healthCheckGracePeriodSeconds": 60
}

key benefits:

  • 4 containers provide redundancy across az failures
  • 75% minimum ensures 3 containers always running during deployments
  • circuit breaker prevents bad deployments from taking down service
  • reasonable costs while maintaining high availability

conclusion

this deployment approach provides:

  • scalable architecture that grows with your application
  • high availability across multiple azs with configurable redundancy levels
  • proper health monitoring with comprehensive alb and container health checks
  • security best practices with secrets management
  • cost-effective operations with serverless containers
  • reliability patterns including auto-scaling, circuit breakers, and graceful shutdowns

the combination of alb health checks, ecs service management, and proper application health endpoints creates a robust production deployment that can achieve 99.9%+ availability for business-critical applications.

for production environments, consider adding:

  • database integration (rds with multi-az)
  • ssl/tls termination at alb
  • cdn (cloudfront) for global performance
  • comprehensive monitoring and alerting
  • backup and disaster recovery strategies
  • blue/green or canary deployments

repository structure

├── dockerfile                 # container definition
├── docker-compose.yml        # local development
├── ecs-task-definition.json  # ecs configuration
├── app/
│   └── controllers/
│       └── health_controller.rb
├── config/
│   └── routes.rb
└── bin/
    └── docker-entrypoint

this guide demonstrates a complete production-ready rails deployment on aws using modern containerization and infrastructure practices.

Building a Terminal IRC Client with Bubble Tea: A Deep Dive into Go's TUI Framework

When I decided to build a modern IRC client for the terminal, I wanted something more sophisticated than the typical ncurses-based applications. Enter Bubble Tea, Charm’s powerful framework for building terminal user interfaces in Go. In this post, I’ll walk through how Bubble Tea works and how I used it to create a feature-rich IRC client.

What is Bubble Tea?

Bubble Tea is based on The Elm Architecture, bringing functional programming concepts to terminal UIs. It follows a simple pattern:

  • Model: Your application state
  • Update: A function that modifies state based on messages
  • View: A function that renders the current state

This architecture makes applications predictable, testable, and easy to reason about.

The Elm Architecture in Bubble Tea

According to the Bubble Tea repository, it’s “based on the functional design paradigms of The Elm Architecture”. Here’s how it works:

The Four Pillars

Every Bubble Tea program consists of:

  1. Model: A struct that holds your entire application state
  2. Init(): Returns the initial model and any startup commands
  3. Update(msg tea.Msg): Receives messages and returns an updated model
  4. View(): Takes the model and returns a string representation

Here’s the minimal interface every Bubble Tea program must implement:

type Model interface {
    Init() Cmd
    Update(Msg) (Model, Cmd)
    View() string
}

How It Works

A key concept: You implement these methods, but you never call them. The framework calls your code:

// What you write:
func (m IRCModel) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
    switch msg := msg.(type) {
    case tea.KeyMsg:
        // Handle user input
    case msgConnected:
        // Handle IRC connection
    }
    return m, nil
}

func main() {
    p := tea.NewProgram(InitialModel())
    p.Run()  // You call this once, then Bubble Tea takes over
}

Inside p.Run(), Bubble Tea’s event loop calls your methods:

// What Bubble Tea does (you never write this):
for {
    select {
    case msg := <-p.msgs:
        model, cmd = model.Update(msg)  // Framework calls YOUR Update
        handleCommand(cmd)              // Framework handles returned command
        render(model.View())            // Framework calls YOUR View
    }
}

The Message Flow

The genius of this architecture is its unidirectional data flow:

    ┌─────────────────┐
    │                 │
    │     Model       │◄─────────────┐
    │                 │              │
    └────────┬────────┘              │
             │                       │
             ▼                       │
    ┌─────────────────┐              │
    │                 │              │
    │      View       │              │
    │                 │              │
    └────────┬────────┘              │
             │                       │
             ▼                       │
        Terminal                     │
         Display                     │
             │                       │
         User Input                  │
             │                       │
             ▼                       │
    ┌─────────────────┐              │
    │                 │              │
    │     Update      │──────────────┘
    │                 │
    └─────────────────┘

Messages flow in one direction: User input → Update → Model → View → Display.

Why this matters: In traditional UI programming, different parts of your app can modify state directly, leading to chaos:

// Traditional approach - multiple places changing state
func onKeyPress() {
    sidebar.addChannel("#golang")
    chatArea.updateUserCount(42)
    statusBar.setConnected(true)
    // Who changed what? When? In what order?
}

func onNetworkEvent() {
    sidebar.removeUser("bob")
    chatArea.addMessage("bob left")
    // Now sidebar and chat area might be out of sync!
}

With Bubble Tea’s unidirectional flow, only one place can change state:

// Bubble Tea approach - all changes go through Update
func (m IRCModel) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
    switch msg := msg.(type) {
    case UserLeftMsg:
        // Remove from users list
        delete(m.channelUsers[msg.channel], msg.user)
        // Add to message history  
        m.addMessage(msg.channel, fmt.Sprintf("%s left", msg.user))
        // State is always consistent!
    }
    return m, nil
}

This guarantees your UI state is always consistent because there’s only one path for changes.

Why This Matters for Terminal UIs

Traditional terminal UI libraries like ncurses use imperative updates:

// ncurses - imperative, stateful
mvprintw(10, 20, "Status: ");
if (connected) {
    attron(COLOR_PAIR(GREEN));
    printw("Connected");
} else {
    attron(COLOR_PAIR(RED));
    printw("Disconnected");
}
refresh();

With Bubble Tea’s Elm Architecture:

// Bubble Tea - declarative, functional
func (m Model) View() string {
    status := "Disconnected"
    if m.connected {
        status = "Connected"
    }
    return fmt.Sprintf("Status: %s", status)
}

The framework handles all the diffing, rendering, and optimization. You just describe what you want to see.

The Core Architecture

Here’s how I structured the IRC client using Bubble Tea:

type IRCModel struct {
    // UI components
    viewport        viewport.Model
    sidebarViewport viewport.Model
    textarea        textarea.Model
    
    // Application state
    allMessages     map[string][]string
    channels        map[string]bool
    channelUsers    map[string][]string
    activeChannel   string
    sidebarFocused  bool
    connected       bool
    
    // Layout
    width           int
    height          int
    sidebarWidth    int
}

The model contains both UI components (viewports, textarea) and application state (channels, messages, users). This separation allows for clean state management while leveraging Bubble Tea’s built-in components.

The Update Loop

The heart of any Bubble Tea application is the Update function, which handles all events:

func (m IRCModel) Update(msg tea.Msg) (tea.Model, tea.Cmd) {
    // Route input based on focus
    if !m.sidebarFocused {
        m.textarea, tiCmd = m.textarea.Update(msg)
        m.viewport, vpCmd = m.viewport.Update(msg)
    } else {
        m.sidebarViewport, svpCmd = m.sidebarViewport.Update(msg)
    }

    switch msg := msg.(type) {
    case tea.WindowSizeMsg:
        m.handleResize(msg)
        
    case tea.KeyMsg:
        return m.handleKeypress(msg)
        
    case msgConnected:
        return m.handleConnection(msg)
        
    case msgReceived:
        return m.handleIRCMessage(msg)
    }
    
    return m, tea.Batch(tiCmd, vpCmd, svpCmd)
}

Notice how different message types are handled separately. This pattern makes it easy to add new features without breaking existing functionality.

Custom Message Types

One powerful feature of Bubble Tea is custom message types. For IRC, I created specific messages for different network events:

type msgConnected struct {
    conn net.Conn
}

type msgReceived struct {
    text string
}

type errMsg error

These messages are sent through commands, which are functions that return messages:

func connectToIRC(server, nickname string) tea.Cmd {
    return func() tea.Msg {
        conn, err := net.Dial("tcp", server)
        if err != nil {
            return errMsg(err)
        }
        
        // Send IRC registration
        writer := bufio.NewWriter(conn)
        writer.WriteString(fmt.Sprintf("NICK %s\r\n", nickname))
        writer.WriteString(fmt.Sprintf("USER %s 0 * :%s\r\n", nickname, nickname))
        writer.Flush()
        
        return msgConnected{conn: conn}
    }
}

This approach keeps the UI responsive while handling network operations in the background.

Layout with Golden Ratio

For the visual design, I implemented a golden ratio layout to create pleasing proportions:

goldenRatio := 1.618
m.sidebarWidth = int(float64(msg.Width) / (goldenRatio + 1.0))

// Ensure reasonable bounds
if m.sidebarWidth < 15 {
    m.sidebarWidth = 15
}
if m.sidebarWidth > 25 {
    m.sidebarWidth = 25
}

This creates a sidebar that’s approximately 38% of the screen width, following the golden ratio principle for visual harmony.

Independent Scrolling with Focus Management

One challenge was implementing independent scrolling for the sidebar and main chat area. I solved this with a focus system:

case tea.KeyTab:
    m.sidebarFocused = !m.sidebarFocused
    if m.sidebarFocused {
        m.textarea.Blur()
    } else {
        m.textarea.Focus()
    }

When the sidebar is focused, arrow keys scroll through channels and users. When the chat is focused, they scroll through message history. This gives users full control over both areas independently.

Real-time Updates

IRC requires real-time message handling. I set up a continuous message loop:

func waitForMessage(conn net.Conn) tea.Cmd {
    return func() tea.Msg {
        scanner := bufio.NewScanner(conn)
        if scanner.Scan() {
            return msgReceived{text: scanner.Text()}
        }
        if err := scanner.Err(); err != nil {
            return errMsg(err)
        }
        return nil
    }
}

Each time a message is received, it triggers an update, parses the IRC protocol, and updates the appropriate channel or user list.

Styling with Lipgloss

Bubble Tea integrates beautifully with Lipgloss for styling. I created adaptive styles that work in both light and dark terminals:

var (
    titleStyle = lipgloss.NewStyle().
        Foreground(lipgloss.AdaptiveColor{Light: "#FFFFFF", Dark: "#FFFDF5"}).
        Background(lipgloss.AdaptiveColor{Light: "#0969DA", Dark: "#25A065"}).
        Padding(0, 1)

    userStyle = lipgloss.NewStyle().
        Foreground(lipgloss.AdaptiveColor{Light: "#1A7F37", Dark: "#7EE787"})
)

This ensures the client looks great regardless of the terminal’s color scheme.

Under the Hood: How Bubble Tea Prevents UI Blocking

Looking at the Bubble Tea source code reveals elegant concurrency patterns that keep the UI responsive. Here’s how it actually works:

The Message Channel Architecture

Bubble Tea uses a central message channel (p.msgs) as the communication hub:

func (p *Program) Send(msg Msg) {
    select {
    case <-p.ctx.Done():
    case p.msgs <- msg:
    }
}

This channel allows background goroutines to safely send messages back to the main event loop without blocking.

Command Execution in Goroutines

When you return a tea.Cmd, Bubble Tea spawns a goroutine to execute it:

func (p *Program) handleCommands(cmds chan Cmd) chan struct{} {
    go func() {
        for {
            select {
            case cmd := <-cmds:
                go func() {
                    // Each command runs in its own goroutine
                    msg := cmd()
                    p.Send(msg)  // Send result back to main loop
                }()
            }
        }
    }()
}

Key benefits:

  1. Non-blocking execution - Long-running operations don’t freeze the UI
  2. Automatic panic recovery - Crashed commands don’t take down the app
  3. Graceful cleanup - Context cancellation stops all goroutines on exit

The Event Loop

The main event loop processes messages sequentially, ensuring thread safety:

func (p *Program) eventLoop(model Model, cmds chan Cmd) (Model, error) {
    for {
        select {
        case msg := <-p.msgs:
            // Update model (always on main thread)
            model, cmd = model.Update(msg)
            
            // Send new commands for background execution
            select {
            case cmds <- cmd:
            case <-p.ctx.Done():
                return model, nil
            }
            
            // Render immediately with updated model
            p.renderer.write(model.View())
        }
    }
}

Why This Design Matters

In your IRC client, when connectToIRC() makes a network call:

  1. Network operation runs in background goroutine (doesn’t block UI)
  2. User can still type, scroll, resize (UI remains responsive)
  3. When connection completes, sends msgConnected (thread-safe communication)
  4. Main loop processes message and updates model (sequential, no race conditions)
  5. UI re-renders with new state (immediate visual feedback)

This is why you can have dozens of ongoing network operations (IRC reads, user lookups, etc.) without any UI lag or complex synchronization code.

Source Code

You can check out the complete IRC client source code at github.com/sngeth/chat.