What re:Invent Changed About How I Build Software

Practical AWS and AI engineering patterns for faster delivery, better reliability, and lower costs.

By Garrett John Law — Software Engineer II @ Re:Build Manufacturing

December 5, 2025

I just got back from AWS re:Invent in Las Vegas. This post covers the patterns I'm bringing back to my team—not the keynote hype, but session-level details that translate into real improvements within 90 days.

TL;DR

Faster security reviews for AI agents using Bedrock's new policy controls
Ship AI features only when they beat baseline metrics through automated checks
Treat model training, fine-tuning, and inference as separate infrastructure problems
Turn chaos experiments into real disaster recovery playbooks with measurable targets

Shipping AI Agents Safely

Ready to deploy now:

Bedrock Agent Core Identity and Policy makes AI agents enterprise-ready. You define exactly what tools an agent can access upfront, with strict boundaries. Security reviews go from "let's see what this thing can do" to "here's the schema, here are the limits." I've seen these reviews take weeks; this cuts that significantly.
Service tiers for inference separate real-time requests from batch workloads, giving you predictable response times for user-facing features while keeping costs down for background processing.

Worth piloting next:

Agent evaluations with observability looks promising for tracing multi-step agent tasks, but needs investment to establish baselines. My plan is to pick one high-volume workflow and see if the metrics correlate with user satisfaction before rolling out broadly.

The main risk: these policies require defining your tool interfaces upfront. If your agent's capabilities are still evolving, you'll feel the friction.

Evaluation-Driven Deployment

One session stood out: "AI Evaluation: From Model Testing to Production Monitoring" at MGM, presented by Jessie Manders (Sr. Product Manager) and Sandeep Singh (Sr. Gen AI Data Scientist). They walked through AWS Bedrock Evaluations—a managed platform that addresses the evaluation problem that usually takes teams weeks or months.

The core insight: evaluation is a trade-off between quality, cost, and latency. Most teams optimize for one and ignore the others, then wonder why production doesn't match their demos.

Bedrock Evaluations offers three approaches:

Automated metrics: Traditional scoring for accuracy and robustness
Human evaluation: Subject matter experts reviewing answers against a rubric
LLM-as-a-judge: Using AI to score helpfulness, completeness, readability, and more

The practical advice that stuck with me: start with 50-100 of your most common questions. Don't try to evaluate everything. Get a subject matter expert to manually review answers—"Is this right? What's missing?"—before you automate. And critically: bigger models aren't always better judges.

The deployment loop I'm implementing:

Evaluation dataset under version control
Automated checks before merging code
Gradual rollout in production
Dashboards tracking success rates and response times

The rule is strict: only ship when the new version beats the current one by agreed-upon margins. No more "looks good to me" approvals. This creates auditable decisions and catches regressions before users do.

Further reading: Bedrock Evaluations announcement | Custom metrics guide

Training, Tuning, and Inference Are Different Problems

One mental model shift from re:Invent: stop treating ML infrastructure as one thing. Training, fine-tuning, and inference have fundamentally different cost drivers and scaling needs:

Training: Time and data transfer speed dominate. Plan capacity in advance.
Fine-tuning: Iteration speed matters most. Optimize for fast experiments.
Inference: Spiky, user-facing, latency-sensitive. Scale based on traffic patterns, not averages.

Resilience That Actually Gets Tested

Here's how to turn chaos experiments into real disaster recovery playbooks using SSM Inventory, AWS Fault Injection Service, and Resilience Hub:

Map your systems first. Understand what breaks when something fails, before you inject failures.
Generate response docs. Deterministic incident handling, not ad-hoc heroics.
Rehearse recovery. Measure how long it takes to recover, track improvement over time.

The prerequisite most teams skip: accurate resource tagging and inventory. Start there, in non-production environments, before you get fancy with fault injection.

What I'm Taking Away

Re:Invent is overwhelming by design—hundreds of sessions, announcements every hour. But the signal is clear:

The teams shipping reliable AI features fastest are treating evaluation as a first-class engineering problem, not an afterthought. They're defining boundaries for agents upfront. They're segmenting infrastructure by workload type. And they're rehearsing failure instead of hoping it doesn't happen.

The ideas aren't new, but the tooling to execute them well finally is. The next 90 days will be about putting it into practice.

Let's Connect

Looking to hire an engineer who ships fast and solves real problems? Send a message below.