This is Part 1 of a 6-part series on building production generative AI applications on GCP. This foundational article covers architecture design patterns and service selection. Subsequent parts will dive into hands-on implementation.
| Characteristic | Detail |
|---|---|
| Estimated Reading Time | 30-45 minutes |
| Technical Level | Intermediate (AI Scientists transitioning to Cloud/Full-Stack) |
| Prerequisites | Basic understanding of ML concepts and cloud computing |
1. Architecture Cheat Sheet
The 5-Layer Stack
┌─────────────────────────────────────────────────────────┐
│ 5. OBSERVABILITY & GOVERNANCE │
│ Cloud Monitoring | Cloud Logging | Vertex AI Monitor │ │
│ Cloud Trace | Cloud DLP API │
└─────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────┐
│ 4. ORCHESTRATION & PROCESSING │
│ Vertex AI Pipelines | Cloud Functions/Run | Pub/Sub │
└─────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────┐
│ 3. STORAGE & DATA │
│ GCS | Vertex AI Vector Search | Memorystore Redis │ │
│ BigQuery | Firestore | Vertex AI Feature Store │
└─────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────┐
│ 2. API & GATEWAY │
│ Cloud Endpoints/Apigee | Load Balancing | Cloud Armor │
└─────────────────────────────────────────────────────────┘
▲
┌─────────────────────────────────────────────────────────┐
│ 1. MODEL SERVING │
│ Vertex AI Prediction | Cloud Run + vLLM | GKE │
└─────────────────────────────────────────────────────────┘
Decision Matrix: Key Service Choices
| Decision | Option A | Option B | When to Use A vs B |
|---|---|---|---|
| Model Serving | Vertex AI Prediction | Cloud Run + vLLM | A: Managed infrastructure, Google models, built-in MLOps B: Custom control, open-source models, cost optimization |
| Scaling Strategy | Pre-warm (min-instances > 0) | Scale-from-zero | A: Predictable traffic, low latency requirements B: Sporadic traffic, cost-sensitive workloads |
| Logging | Cloud Logging | BigQuery | A: Real-time monitoring and debugging B: Analytics, compliance, long-term retention |
| Caching | Memorystore Redis | Direct queries | A: High query repetition, latency optimization B: Always-fresh results required |
Sample RAG Chatbot Flow
User Request
↓
[Cloud Endpoints] ← Auth, rate limiting
↓
[Cloud Run] ← Orchestration service
↓
├─→ [Memorystore Redis] ← Check cache
│ ↓ (cache miss)
├─→ [Vertex AI Vector Search] ← Retrieve relevant docs
│ (min_replica_count=2)
↓
├─→ [Cloud Run + vLLM] ← Generate response
│ OR [Vertex AI Prediction] ← For Gemini/PaLM
↓
├─→ [Cloud DLP API] ← Detect PII
│ └─→ [Cloud Run] ← Redaction logic
↓
Response + Logging
├─→ [Cloud Logging] ← Real-time ops
└─→ [BigQuery] ← Analytics via Log Sink
Autoscaling Quick Reference
Predictable Traffic Spikes:
Cloud Scheduler → Cloud Function → Scale Up
(30 min before peak)
↓
Cloud Run: min-instances 0→5
Vector Search: auto-scales
↓
Cloud Scheduler → Cloud Function → Scale Down
(after peak hours)
Unpredictable Traffic:
- Cloud Run:
min-instances=2,max-instances=50 - Vector Search:
min_replica_count=2 - Redis:
allkeys-lrueviction, TTL=3600s
2. Introduction & Problem Statement
What Are We Building?
Modern generative AI applications—chatbots, document analysis tools, code assistants—require more than just a language model. They need:
- Fast, reliable model serving that scales with demand
- Knowledge retrieval systems for context-aware responses (RAG)
- Data pipelines for processing and embedding documents
- Security measures for PII detection and compliance
- Observability for monitoring model performance and costs
Building these systems on Google Cloud Platform requires understanding not just individual services, but how they work together as a cohesive architecture.
Who Is This For?
This guide is designed for:
- AI scientists transitioning to full-stack data science roles
- ML engineers architecting production systems
- Cloud architects adding GenAI capabilities to existing infrastructure
- Technical interviewers preparing for cloud architecture discussions
Why GCP for Generative AI?
GCP offers several advantages for GenAI applications:
- Vertex AI ecosystem - Unified platform for ML lifecycle management
- Native LLM access - Direct integration with Google’s models (Gemini, PaLM)
- Purpose-built services - Vector Search, Feature Store, Model Monitoring
- Flexible compute options - From fully managed to custom containers
- Enterprise-grade security - Built-in DLP, IAM, and compliance tools
3. Architectural Thinking: The Framework
The 5-Layer Approach
Instead of thinking about individual services, organize your architecture into five functional layers:
| Component Layer | Primary Function | Key Responsibilities/Features |
|---|---|---|
| 1. Model Serving Layer | Where inference happens | Hosts models (proprietary or open-source), Handles prediction requests, Manages model versions and traffic splitting |
| 2. API & Gateway Layer | How users interact with your system | Authentication and authorization, Rate limiting and quota management, Load balancing across backend services |
| 3. Storage & Data Layer | Where knowledge lives | Vector databases for semantic search, Feature stores for real-time features, Caches for performance optimization, Long-term storage for training data |
| 4. Orchestration & Processing Layer | How components communicate | Event-driven workflows, Data preprocessing pipelines, Asynchronous task management |
| 5. Observability & Governance Layer | How you monitor and control | Metrics and alerting, Distributed tracing, Security and compliance |
Decomposing Requirements
When designing a GenAI system, ask these questions for each layer:
Serving: What models do we need? How will they scale? Gateway: Who accesses the system? What are the SLAs? Storage: What data do we retrieve? How fast must it be? Orchestration: What workflows are required? Sync or async? Observability: What metrics matter? What are the compliance needs?
Request Flow vs Training Flow
Your architecture serves two distinct workflows:
Inference (Request) Flow:
User → Gateway → Orchestration → [Vector Search + Model] → Response
- Optimized for latency (milliseconds to seconds)
- Stateless and horizontally scalable
- Focuses on reliability and availability
Training (Development) Flow:
Data → Preprocessing → Training → Evaluation → Registry → Deployment
- Optimized for throughput (hours to days)
- Resource-intensive (GPUs/TPUs)
- Focuses on reproducibility and versioning
Most production architectures prioritize inference flow since it directly impacts user experience.
4. Core Components Deep Dive
Layer 1: Model Serving
| Platform | Best For | Key Features | When to Use |
|---|---|---|---|
| Vertex AI Prediction | Google’s foundation models, managed infrastructure, integrated MLOps | Auto-scaling based on traffic, Built-in A/B testing and traffic splitting, Native integration with Vertex AI training jobs, Managed endpoints with SLA guarantees | Deploying Gemini, PaLM, or other Google models, Need for enterprise support and SLAs, Teams without DevOps expertise, Models trained within Vertex AI ecosystem |
| Cloud Run + Custom Frameworks (vLLM, TGI) | Open-source models, custom serving logic, cost optimization | Full control over serving environment, Support for any containerized framework, Scales to zero for cost savings, Custom pre/post-processing pipelines | Deploying Llama, Mistral, or other OSS models, Need custom inference optimizations (vLLM), Complex preprocessing requirements, Cost-sensitive workloads with variable traffic |
| GKE (Google Kubernetes Engine) | Multi-model serving, complex orchestration, fine-grained control | Serving multiple models with shared resources, Need for advanced networking configurations, Existing Kubernetes expertise in team, Complex multi-stage inference pipelines |
Layer 2: API & Gateway
| Platform/Service | Best For/Purpose | Key Features/Types |
|---|---|---|
| Cloud Endpoints | RESTful APIs, OpenAPI specs, Google Cloud-native apps | API key and JWT authentication, Request validation against OpenAPI specs, Built-in monitoring and logging |
| Apigee | Enterprise API management, multi-cloud, complex policies | Advanced rate limiting and quotas, API monetization capabilities, Developer portal and analytics, Multi-cloud and hybrid support |
| Cloud Load Balancing | Distribute traffic, health checking, SSL termination | Global HTTPS LB (For global applications), Regional LB (For region-specific workloads), Internal LB (For service-to-service communication) |
Layer 3: Storage & Data
| Platform/Service | Purpose | Key Features/Use Cases/Configuration |
|---|---|---|
| Vertex AI Vector Search | Semantic search for RAG applications | Managed approximate nearest neighbor search, Supports multiple distance metrics (cosine, dot product, L2), Streaming updates for real-time indexing, Auto-scaling with configurable replicas |
Configuration: min_replica_count (Keep index “warm”, recommend $\ge$2), machine_type (Balance cost vs QPS capacity), distance_measure_type (Match your embedding model) | ||
| Memorystore (Redis) | Caching layer for query results and session data | Key Use Cases: Cache frequent vector search results, Store user conversation history, Session management |
Configuration Strategy: Set eviction policy: allkeys-lru for caching, Define TTLs based on data freshness needs (e.g., 3600s), Vertical scaling before horizontal (simpler operations) | ||
| Cloud Storage (GCS) | Object storage for model artifacts, datasets, documents | Best Practices: Use lifecycle policies for cost management, Enable versioning for model artifacts, Organize by project/environment (dev/staging/prod) |
| BigQuery | Data warehouse for analytics, training data, logs | Key Use Cases: Store and query large-scale training datasets, Long-term log retention and analysis, Feature engineering for ML models |
| Firestore / Cloud SQL | Operational databases for application state | When to use: Firestore: Document-based data, real-time sync; Cloud SQL: Relational data, complex queries |
Layer 4: Orchestration & Processing
| Platform/Service | Purpose | Key Features/Common Patterns/Use Cases |
|---|---|---|
| Vertex AI Pipelines | MLOps workflows (training, evaluation, deployment) | Kubeflow Pipelines or TFX under the hood, Component reusability, Experiment tracking and lineage |
| Cloud Functions / Cloud Run | Event-driven processing, lightweight orchestration | Common Patterns: Document preprocessing on upload, Webhook handlers, Scheduled tasks (with Cloud Scheduler), Fastapi for building APIs |
| Pub/Sub | Asynchronous messaging between services | Use Cases: Decouple preprocessing from inference, Fan-out patterns for parallel processing, Event streaming for analytics |
Layer 5: Observability & Governance
| Platform/Service | Purpose/Metrics | Key Features/Best Practices/Value |
|---|---|---|
| Cloud Monitoring | Metrics to track: Model latency (p50, p95, p99), Request rate and error rate, Token usage and costs, Vector search QPS | |
| Cloud Logging | What to log: Prediction requests and responses, Model versions used, Error traces, User interactions (for compliance) | Best Practice: Export logs to BigQuery for long-term analysis |
| Cloud Trace | Distributed tracing across services | Value: Identify bottlenecks in multi-service request paths |
| Vertex AI Model Monitoring | Prediction drift detection, Training-serving skew monitoring, Feature attribution analysis | |
| Cloud DLP API | Detect and redact PII | 150+ built-in info type detectors (SSN, credit cards, emails), Custom info type definitions, Automatic redaction or masking |
5. Handling Scale: Autoscaling Strategies
Understanding the Challenge
GenAI applications have unique scaling characteristics:
- Model inference is GPU-bound - Can’t scale infinitely like stateless web apps
- Cold starts are expensive - Loading models into memory takes 10-60 seconds
- Traffic is often bursty - Launch events, viral content, business hours
- Costs scale linearly with compute - Unlike traditional apps with economies of scale
Component-Specific Strategies
Cloud Run (Orchestration & Custom Serving)
Configuration parameters:
min-instances: Minimum always-on containersmax-instances: Maximum concurrent containersconcurrency: Requests per container
Three strategies:
1. Cost-optimized (Unpredictable, low traffic):
min-instances: 0
max-instances: 50
concurrency: 80
- Pros: Pay only for actual usage
- Cons: First requests after idle have cold starts (5-15s)
2. Performance-optimized (Consistent traffic):
min-instances: 5
max-instances: 50
concurrency: 80
- Pros: No cold starts, predictable latency
- Cons: Pay for idle capacity 24/7
3. Balanced (Variable traffic with peaks):
min-instances: 2
max-instances: 50
concurrency: 80
- Pros: Minimal cold starts, reasonable cost
- Cons: Slight delay during rapid scale-up
Vertex AI Vector Search
Key parameter: min_replica_count
Strategy:
- Set
min_replica_count ≥ 2to keep index “warm” - Vector Search auto-scales based on QPS
- No cold start issues with minimum replicas
Additional optimization:
- Send periodic health check queries to maintain warmth
- Use streaming updates if constantly adding vectors
Vertex AI Prediction
Configuration:
min_replica_count: Minimum serving replicasmax_replica_count: Maximum replicasmachine_type: GPU/CPU type per replica
Strategy:
- Always keep
min_replica_count ≥ 1(no scale-to-zero) - Pay for capacity, not requests
- More predictable than Cloud Run, but less cost-flexible
Memorystore Redis
Scaling approach:
First: Vertical scaling
- Increase memory of existing instance (5GB → 300GB)
- Simpler operations, no distributed systems complexity
Then: Eviction policies
- Configure
allkeys-lrueviction - Set appropriate TTLs on cached data
- Let Redis self-manage memory automatically
Last resort: Redis Cluster
- Only for very high scale (TBs of data)
- Adds operational complexity
- Required when single instance insufficient
Pre-warming for Predictable Spikes
For known traffic patterns (product launches, scheduled events):
Timeline:
T-30 min: Cloud Scheduler triggers scale-up
↓
Cloud Function updates configurations:
- Cloud Run: min-instances 0 → 5
- Vertex AI: min_replica_count 1 → 5
↓
T+0: Event starts, infrastructure ready
↓
T+3 hours: Event ends
↓
Cloud Scheduler triggers scale-down
↓
Cloud Function restores configurations:
- Cloud Run: min-instances 5 → 0
- Vertex AI: min_replica_count 5 → 1
Implementation: Cloud Scheduler → Cloud Function → GCP API calls
Reactive Auto-scaling
For unpredictable spikes, monitor key metrics:
Cloud Monitoring alerts:
- Request queue depth > threshold → Scale up
- CPU utilization > 70% → Scale up
- Error rate > 5% → Investigate, possibly scale
Response actions:
- Trigger Cloud Functions to adjust configurations
- Send notifications to on-call engineers
- Log incidents for post-mortem analysis
6. Cost Optimization Playbook
This section can be made significantly more efficient and easier to read by using a two-part table structure: one for the Cost Drivers and another, larger table for Optimization Strategies by Component. This approach makes excellent use of horizontal space and clearly organizes the content.
Cost Optimization Playbook
Cost Drivers (Summary Table)
| Cost Driver | Typical Percentage | Key Scaling Factors |
|---|---|---|
| Model serving compute | 40-60% | Traffic and model size |
| Vector search | 15-25% | Index size and QPS |
| Data storage | 10-15% | Model/dataset volume, BigQuery usage |
| Networking | 5-10% | Cross-region egress, inter-service API calls |
| Logging and monitoring | 5-10% | Log ingestion and retention, custom metrics |
Optimization Strategies by Component (Detailed Table)
| Component | Strategy | Details/Goal |
|---|---|---|
| Model Serving | Right-size instances | Profile GPU utilization; use smallest instance meeting latency SLAs; consider CPU-only for smaller models. |
| Batch prediction | Group multiple requests for higher throughput; trade latency for cost (ideal for offline use cases). | |
| Model optimization | Apply Quantization (FP16, INT8); use Distillation for smaller models; leverage efficient architectures (e.g., vLLM). | |
| Scale-to-zero | Use Cloud Run for automatic scaling to zero in non-production (dev/staging) environments for significant savings. | |
| Vector Search | Index optimization | Scale machine type based on actual QPS needs; adjust min_replica_count; use smaller machine types for development. |
| Approximate vs exact search | Use Approximate Nearest Neighbor (ANN) for most queries; reserve exact search for critical use cases. | |
| Index segmentation | Separate indices by use case to allow independent scaling based on traffic. | |
| Storage | Lifecycle policies | Move old data to Nearline/Coldline storage; delete temporary files; archive logs after retention period. |
| Compression | Compress datasets/models in GCS; use Parquet/Avro instead of CSV for large datasets. | |
| Query optimization (BigQuery) | Partition tables by date; cluster tables on frequently queried columns; use BigQuery slots efficiently. | |
| Caching | Aggressive caching with Redis | Cache vector search results (can save 50-80% of searches). Set appropriate TTLs; monitor cache hit rates. |
| Example Savings: Vector search query: $0.001 vs. Redis cache read: $0.00001. 80% cache hit rate = 80% cost reduction on searches. | ||
| Logging | Log sampling | Sample prediction logs (e.g., 10% in production); log all errors/anomalies; full logging only in development. |
| Structured logging | Use JSON for efficient querying; avoid duplicate info; export only necessary fields to BigQuery. | |
| Retention policies | Keep detailed logs for $\approx$30 days; aggregate metrics for longer retention; archive compliance logs to Cold storage. |
Cost vs Performance Trade-offs
| Component | Cost-Optimized | Balanced | Performance-Optimized |
|---|---|---|---|
| Cloud Run | min=0, scale-to-zero | min=2, moderate always-on | min=10, pre-warmed |
| Vector Search | min_replicas=1, small machine | min_replicas=2, medium machine | min_replicas=5, large machine |
| Redis | 5GB, aggressive eviction | 20GB, moderate TTLs | 100GB, long TTLs |
| Logging | 10% sampling, 7-day retention | 50% sampling, 30-day retention | 100% logging, 90-day retention |
| Typical Monthly Cost | $500-2,000 | $2,000-8,000 | $8,000-25,000+ |
| Suitable For | MVP, prototypes, low traffic | Production, moderate traffic | Enterprise, high traffic |
Monitoring Cost Efficiency
Key metrics to track:
- Cost per 1,000 predictions
- Cost per user session
- GPU utilization percentage
- Cache hit rate
- Average response latency vs cost
Action triggers:
- GPU utilization <50% → Downsize instance
- Cache hit rate <60% → Increase Redis capacity
- Cost per prediction increasing → Investigate inefficiencies
7. Production Best Practices
Security & Compliance
PII Detection & Redaction
Always use Cloud DLP API for:
- Detecting sensitive information in user inputs
- Redacting PII before logging
- Compliance with GDPR, HIPAA, etc.
Pattern:
User Input → Cloud DLP (detect) → Redaction Logic → Model
Model Output → Cloud DLP (detect) → Redaction Logic → User
Authentication & Authorization
Best practices:
- Use Cloud IAM for service-to-service authentication
- Implement API keys or OAuth for user authentication
- Apply principle of least privilege
- Rotate credentials regularly
Data Encryption
At rest:
- Use customer-managed encryption keys (CMEK) for sensitive data
- Enable encryption by default on all GCS buckets
In transit:
- Enforce HTTPS for all API endpoints
- Use Private Google Access for internal traffic
Monitoring & Observability
What to Monitor
| Category | Key Metrics to Track |
|---|---|
| Infrastructure metrics | Instance count and utilization, Request latency (p50, p95, p99), Error rates by type, Network throughput |
| Model metrics | Prediction latency, Token usage (for LLMs), Model version distribution, Drift detection alerts |
| Business metrics | Cost per prediction, User satisfaction scores, Feature usage patterns, Conversion rates |
Alert Strategy
| Alert Level | Trigger Condition | Required Response |
|---|---|---|
| Critical alerts | Error rate >5% for >5 minutes, p99 latency >10 seconds, Service unavailability | Immediate response |
| Warning alerts | Cost spike >50% vs baseline, Cache hit rate drop >20%, GPU utilization <30% (waste) or >90% (saturation) | Investigate within hours |
| Info alerts | Model drift detection, Unusual traffic patterns, Capacity planning thresholds | Review daily/weekly |
Measuring Architecture Health
Once your GenAI application is running in production, how do you know if your architecture is truly well-designed? GCP provides automated tools to continuously assess your cloud stack across five key dimensions: security posture, cost efficiency, performance optimization, access management, and resource compliance. These tools generate actionable recommendations and health scores, helping you identify gaps before they become problems. Think of them as your architecture’s “continuous health monitoring system.”
| Tool/Service | What It Measures | Use Case | Automation Level |
|---|---|---|---|
| Security Command Center | Security posture, vulnerabilities, misconfigurations | Identifies security risks across all GCP resources | ✅ Fully automated scanning |
| Recommender | Cost optimization, performance, security improvements | Suggests rightsizing, idle resources, best practices | ✅ Automated recommendations |
| Policy Intelligence | IAM policies, access patterns, least privilege violations | Ensures proper access controls and permissions | ✅ Automated policy analysis |
| Cloud Asset Inventory | Resource compliance, organizational policies | Tracks all resources and checks policy compliance | ✅ Automated inventory + compliance |
| Architecture Framework Assessment | Operational excellence, reliability, performance, cost, security | Comprehensive well-architected review (manual questionnaire) | ⚠️ Manual but structured |
Common Pitfalls to Avoid
Architecture Anti-patterns
| Anti-pattern | Description/Impact | Mitigation/Best Practice |
|---|---|---|
| 1. Over-engineering for scale | Introduces unnecessary complexity (e.g., starting with GKE when Cloud Run is enough). | Start simple (e.g., Cloud Run), scale when necessary, avoid distributed systems complexity until required. |
| 2. Under-investing in monitoring | Leads to inability to optimize or troubleshoot scaling issues. | Set up monitoring before scaling issues arise; include cost monitoring from day one. |
| 3. Ignoring cold starts | Destroys user experience due to slow initial responses. | Configure min-instances for production; pre-warm services before known traffic spikes. |
| 4. Insufficient caching | Results in high costs from expensive, repetitive vector searches. | Cache aggressively (e.g., with Redis) using appropriate TTLs. |
| 5. Logging everything | Full prediction logging is expensive and slows down analysis. | Use sampling in production; focus logging on errors and anomalies. |
Operational Pitfalls
| Pitfall | Description/Impact | Mitigation/Best Practice |
|---|---|---|
| 1. No rollback strategy | Makes recovery from bad deployments slow or impossible. | Always deploy with traffic splitting; keep previous model versions available; test rollback procedures regularly. |
| 2. Lack of reproducibility | Prevents auditing or recreating specific model results. | Version all model artifacts; track training data and hyperparameters; use Vertex AI Model Registry. |
| 3. Manual configuration management | Prone to human error, slow, and lacks audit trail. | Use Infrastructure as Code (Terraform); version control all configurations; automate deployments. |
| 4. Ignoring model drift | Leads to stale models and degrading performance over time. | Set up Vertex AI Model Monitoring; define acceptable performance ranges; establish retraining triggers. |
| 5. Poor error handling | Causes hard failures and a poor user experience. | Implement graceful degradation; return meaningful error messages; log failures for analysis. |
Testing Strategy
| Test Type | Primary Goal | Key Activities/Scope |
|---|---|---|
| Unit tests | Validate individual components in isolation. | Test input/output contracts; Mock external dependencies. |
| Integration tests | Validate service interactions and end-to-end flows. | Test service interactions; Validate end-to-end request flows; Use staging environment. |
| Load tests | Verify performance and autoscaling under stress. | Simulate expected peak traffic; Test autoscaling behavior; Identify bottlenecks. |
| Chaos engineering | Test system resilience against failures. | Test failure scenarios (service outages); Validate fallback mechanisms; Ensure graceful degradation. |
8. Conclusion & Resources
Key Takeaways
Architecture principles:
- Think in layers - Organize services into functional layers for clarity
- Start simple - Use managed services before custom solutions
- Design for scale - Plan autoscaling strategies from the beginning
- Monitor everything - Observability is not optional in production
- Optimize iteratively - Start with working system, then optimize costs
Service selection guidelines:
- Vertex AI Prediction for Google models and managed infrastructure
- Cloud Run for custom serving and cost optimization
- Vector Search for semantic search in RAG applications
- Cloud DLP for PII detection and compliance
- Both Cloud Logging and BigQuery for comprehensive observability
Scaling wisdom:
- Pre-warm for predictable spikes
- Keep minimum replicas for unpredictable traffic
- Let Redis manage memory with eviction policies
- Scale vertically before horizontally (simpler operations)
Next in This Series
Part 2: Model Serving & Inference (Coming soon)
- Hands-on: Deploying models to Vertex AI Prediction
- Hands-on: Setting up Cloud Run with vLLM
- Implementing RAG with Vector Search
- Performance optimization techniques
Part 3: Building the Data Pipeline
- Document processing and embedding generation
- Vector database setup and management
- Feature stores and caching implementation
- ETL pipeline orchestration
Part 4: API Layer & Orchestration
- Building FastAPI services on Cloud Run
- API Gateway configuration
- Authentication and rate limiting
- Request routing and load balancing
Part 5: Observability & Production Hardening
- Setting up monitoring dashboards
- Implementing logging strategies
- PII detection with Cloud DLP
- Autoscaling configuration automation
Part 6: MLOps & CI/CD
- Model versioning and registry
- A/B testing strategies
- Automated retraining pipelines
- Deployment automation with Cloud Build
Essential GCP Documentation
Core services:
Architecture guides:
Pricing:
Training resources:
9. Quizzes!
Test your understanding of GCP GenAI architecture:
Scenario 1: Service Selection
Question: You’re building a customer service chatbot that needs to:
- Use a fine-tuned Llama 3 model
- Retrieve from 10,000 support documents
- Handle 1,000 requests/day (sporadic traffic)
- Detect and redact PII before logging
Which services would you choose and why?
Click to reveal answer
Recommended architecture:
- Cloud Run + vLLM for serving Llama 3 (open-source model, cost-optimized with scale-to-zero)
- Vertex AI Vector Search for document retrieval (managed, scales automatically)
- Memorystore Redis (small instance, 5GB) for caching frequent queries
- Cloud DLP API for PII detection (purpose-built, 150+ detectors)
- Cloud Logging + BigQuery for logging (real-time + analytics)
Configuration:
- Cloud Run:
min-instances=0(low traffic, cost-sensitive) - Vector Search:
min_replica_count=2(keep warm, but small) - Redis:
allkeys-lrueviction, TTL=3600s
Why not Vertex AI Prediction? The model is open-source (Llama 3), and traffic is too low to justify the always-on cost of managed endpoints.
Scenario 2: Autoscaling Strategy
Question: Your GenAI application experiences:
- Normal traffic: 100 requests/hour
- Daily spike: 2,000 requests/hour (9-11 AM)
- Monthly product launches: 10,000 requests/hour (date known in advance)
How would you configure autoscaling?
Click to reveal answer
Hybrid strategy:
For daily predictable spikes (9-11 AM):
- Cloud Scheduler at 8:45 AM: Scale up
- Cloud Run:
min-instances=2(from 0 or 1) - Keep Vector Search at
min_replica_count=2(already handles this)
- Cloud Run:
- Cloud Scheduler at 11:15 AM: Scale down
- Cloud Run:
min-instances=1(maintain some warmth for baseline)
- Cloud Run:
For monthly product launches:
- Manual or scheduled pre-warming 30 minutes before
- Cloud Run:
min-instances=10 - Vector Search: Consider temporarily increasing to
min_replica_count=5
- Cloud Run:
- Monitor real-time during launch
- Scale down manually after launch concludes
Baseline configuration:
- Cloud Run:
min-instances=1,max-instances=50 - Vector Search:
min_replica_count=2 - Redis: 20GB instance (handle peak without evicting hot data)
Key insight: Use automation for daily patterns, manual intervention for rare high-stakes events where you want full control.
Scenario 3: Cost Optimization
Question: Your architecture costs $15,000/month:
- Cloud Run serving: $8,000 (GPU instances)
- Vector Search: $4,000
- Redis: $1,500
- Logging: $1,500
How would you reduce costs by 40% without significantly impacting performance?
Click to reveal answer
Target: $9,000/month ($6,000 savings)
Optimization plan:
1. Cloud Run serving ($8,000 → $4,500, save $3,500):
- Implement aggressive caching (reduce queries by 60%)
- Model quantization (FP16 → INT8, smaller GPU needed)
- Reduce
min-instancesduring off-peak hours (nights/weekends) - Right-size GPU instances based on actual utilization metrics
2. Vector Search ($4,000 → $2,500, save $1,500):
- Reduce
min_replica_countfrom 3 to 2 - Downsize machine type (analyze actual QPS requirements)
- Cache top 20% of queries (covers 80% of traffic)
3. Redis ($1,500 → $1,200, save $300):
- Analyze cache hit rates and optimize eviction policy
- Slightly more aggressive TTLs (reduce from 1 hour to 45 min)
4. Logging ($1,500 → $800, save $700):
- Implement 20% sampling for prediction logs (from 100%)
- Reduce BigQuery retention from 90 to 30 days
- Export less-used logs to GCS Coldline
Total savings: $6,000/month (40% reduction)
Performance impact:
- Latency increase: ~5-10% (due to caching misses)
- Availability: No change (still redundant)
- Cold starts: Slightly more during scale-up
Key insight: Most cost savings come from caching (reduces actual model invocations) and right-sizing compute (many deployments are over-provisioned).
Scenario 4: Architecture Decision
Question: When should you use Cloud Run instead of Vertex AI Prediction for serving a generative AI model?
Click to reveal answer
Use Cloud Run when:
Open-source models (Llama, Mistral, Falcon)
- Vertex AI Prediction primarily optimized for Google/proprietary models
- vLLM on Cloud Run offers better performance for OSS LLMs
Custom preprocessing/postprocessing
- Need complex business logic before/after inference
- Multi-stage pipelines in single container
Cost optimization required
- Traffic is sporadic (scale-to-zero capability)
- Can’t justify always-on endpoint costs
Custom serving frameworks
- Want to use specific inference servers (vLLM, TGI, TensorRT)
- Need cutting-edge optimizations not available in managed service
Full infrastructure control
- Custom networking requirements
- Specific container configurations
- Advanced logging/monitoring integration
Use Vertex AI Prediction when:
Google’s foundation models (Gemini, PaLM)
- Native integration and optimization
- Simpler API access
Managed infrastructure preferred
- Team lacks DevOps expertise
- Want built-in MLOps features
Enterprise SLAs required
- Need guaranteed uptime
- Vendor support essential
A/B testing and traffic splitting
- Built-in canary deployments
- Easy model version management
Integrated with Vertex AI training
- Seamless deployment from training jobs
- Model registry integration
Hybrid approach (common in production):
- Cloud Run for orchestration
- Vertex AI Prediction for Google models
- Cloud Run + vLLM for custom models
Scenario 5: Debugging Performance
Question: Your RAG application has high latency (p95 = 8 seconds). Users complain it’s too slow. The flow is:
Cloud Run → Vector Search → Cloud Run + vLLM → Cloud DLP → Response
How would you diagnose and fix the bottleneck?
Click to reveal answer
Diagnosis approach using Cloud Trace:
Step 1: Add distributed tracing
- Enable Cloud Trace across all services
- Instrument each service call
- Identify which component takes longest
Likely findings (typical latency breakdown):
Total: 8000ms
├─ Cloud Run orchestration: 50ms
├─ Vector Search: 800ms ⚠️
├─ Cloud Run (vLLM): 6500ms ⚠️⚠️
├─ Cloud DLP: 500ms
└─ Network overhead: 150ms
Optimization strategies:
For Vector Search (800ms):
- Add Redis caching (reduce to ~50ms for cache hits)
- Increase
min_replica_count(index might be cold) - Optimize query (reduce number of results retrieved)
- Check if index needs rebuilding (fragmentation)
For vLLM inference (6500ms) - BIGGEST BOTTLENECK:
- Check GPU utilization (might be under-powered)
- If <50%: Issue is software, not hardware
- If >90%: Need larger GPU or batching
- Enable KV caching in vLLM (reduces repeat token computation)
- Reduce max_tokens if generating too much text
- Implement streaming responses (perceived latency improvement)
- Check for cold starts (increase min-instances)
- Consider model quantization (INT8 is 2-3x faster)
For Cloud DLP (500ms):
- Only scan response text, not entire context
- Use custom detectors (faster than all 150+ types)
- Consider async processing for non-critical checks
Expected results after optimization:
Total: 2500ms (69% improvement)
├─ Cloud Run orchestration: 50ms
├─ Vector Search (cached): 50ms ✓
├─ Cloud Run (vLLM optimized): 2000ms ✓
├─ Cloud DLP (scoped): 300ms ✓
└─ Network overhead: 100ms
Key insight: Always measure before optimizing. Use Cloud Trace to find the actual bottleneck—don’t assume!
Scenario 6: Security & Compliance
Question: Your GenAI chatbot will handle healthcare data (HIPAA compliance required). What architectural considerations must you address?
Click to reveal answer
HIPAA compliance requirements for GCP architecture:
1. Data Encryption
- At rest: Use Customer-Managed Encryption Keys (CMEK) for:
- GCS buckets (model artifacts, documents)
- BigQuery datasets (logs, analytics)
- Memorystore Redis (conversation cache)
- In transit: Enforce HTTPS/TLS 1.2+ for all endpoints
2. PII/PHI Protection
- Always use Cloud DLP API to:
- Detect PHI in user inputs before processing
- Redact PHI before logging
- Mask PHI in model responses if necessary
- Configure for healthcare-specific info types:
- Medical record numbers
- Medication names
- ICD codes
- Provider identifiers
3. Access Controls
- Implement principle of least privilege with IAM
- Use VPC Service Controls to create security perimeter
- Enable Private Google Access (no public internet)
- Implement Cloud Armor for DDoS protection
4. Audit Logging
- Enable Admin Activity logs (who did what)
- Enable Data Access logs (who accessed what data)
- Export logs to immutable storage (GCS with retention policy)
- Set up log-based metrics for compliance monitoring
5. Network Isolation
- Deploy services in VPC (not public internet)
- Use Private Service Connect for Google APIs
- Implement firewall rules restricting traffic
- Consider shared VPC for multi-project setup
6. Data Residency
- Choose specific regions for data storage (e.g., us-central1)
- Ensure Vector Search index in same region
- Configure BigQuery with specific location
7. Business Associate Agreement (BAA)
- Sign Google Cloud BAA (required for HIPAA)
- Document architecture in compliance documentation
- Regular security assessments and audits
Architecture modifications:
User (HTTPS only)
↓
[Cloud Armor] ← DDoS protection
↓
[HTTPS Load Balancer] ← TLS termination
↓
[VPC Network] ← Private communication
↓
[Cloud Run] ← Orchestration in VPC
↓
├─→ [Cloud DLP] ← Detect/redact PHI
├─→ [Vector Search] ← CMEK encrypted
├─→ [Cloud Run + vLLM] ← In VPC
└─→ [Audit Logs] ← Immutable trail
↓
[GCS with CMEK] ← Encrypted storage
[BigQuery with CMEK] ← Compliance logs
Key insight: HIPAA compliance is not just about encryption—it’s about comprehensive data governance, access controls, and audit trails throughout the entire architecture.
10. Final Thoughts
You now have a comprehensive understanding of GCP architecture for generative AI applications. The key to success is:
- Start with clear requirements - Understand your use case before choosing services
- Design in layers - Organize complexity into manageable components
- Measure everything - You can’t optimize what you don’t monitor
- Iterate and improve - Start simple, scale as needed, optimize continuously
In Part 2 of this series, we’ll get hands-on with actual implementations, starting with deploying models to production. Stay tuned!
The GCP documentation and community forums are excellent resources for diving deeper into any of the presented topics in this blog: