Purple8-platform

AI Guardrails Agent - Documentation

Overview

The AI Guardrails Agent is a pre-execution safety and compliance layer that validates all user prompts BEFORE any AI agents begin execution. This prevents misuse, ensures ethical AI development, and maintains legal compliance.

Key Goals

Safety: Prevent physical or psychological harm
Ethics: Uphold fairness, privacy, and human dignity
Compliance: Meet legal and regulatory requirements (GDPR, HIPAA, etc.)
Reliability: Ensure accurate, consistent, and trustworthy results
Brand Alignment: Maintain consistent brand voice and values

Architecture

User Submits Prompt
        ↓
🛡️ AI Guardrails Agent (Pre-check)
        ↓
   Validation Rules
   (Regex patterns + keywords)
        ↓
   ├─ CRITICAL violation → ❌ REJECT (Stop pipeline)
   ├─ HIGH violation → ⚠️  WARN (Allow with notice)
   └─ No violations → ✅ APPROVE (Continue to pipeline)
        ↓
   Pipeline Execution
   (Ideation → Architecture → Development...)

Violation Categories

1. Safety Violations (CRITICAL)

Physical Harm:

Weapons, explosives, poisons
Violence, attacks, harm
Self-harm, suicide

Psychological Harm:

Manipulation, deception, fraud
Addiction, gambling exploitation
Harassment, bullying, stalking

2. Ethical Violations (CRITICAL)

Privacy:

Data scraping/stealing without consent
Surveillance without permission
Biometric tracking without consent

Discrimination:

Racist, sexist, or other discriminatory functionality
Denying service based on protected characteristics

3. Compliance Violations

Financial Fraud (CRITICAL):

Ponzi schemes, pyramid schemes
Money laundering, tax evasion
Counterfeit currency/documents

Regulated Content (HIGH):

Medical diagnosis without proper disclaimers
Legal advice without licensed attorney notice
Financial advice without SEC disclosure

Copyright (HIGH):

Cloning copyrighted platforms
Piracy, torrenting tools
Stealing intellectual property

4. Misinformation (CRITICAL)

Fake news generation
Deepfakes, impersonation
Spreading rumors, hoaxes, conspiracies

5. Child Safety (CRITICAL)

Child exploitation or abuse
Predatory behavior
CSAM (Child Sexual Abuse Material)

Usage

Python (Backend)

from agents.guardrails_agent import validate_prompt

# Validate a prompt
is_valid, result = validate_prompt(
    prompt="Build a fitness tracking app",
    goal="production",
    deployment_target="mobile"
)

if not is_valid:
    print(f"❌ REJECTED: {result['message']}")
    print(f"Violations: {result['violations']}")
    return {"error": result['message']}

# Continue with pipeline execution
print("✅ Prompt approved")

Response Format

Approved Prompt:

{
    "status": "approved",
    "message": "Prompt passed all AI guardrails",
    "warnings": null,
    "stats": {
        "total_checked": 42,
        "blocked": 3,
        "allowed": 39
    }
}

Rejected Prompt:

{
    "status": "rejected",
    "reason": "guardrail_violation",
    "message": "⚠️ Your request cannot be processed due to AI safety and compliance guardrails...",
    "violations": [
        {
            "type": "physical_harm",
            "category": "Safety",
            "severity": "critical",
            "description": "Prompt requests content that could cause physical harm"
        }
    ],
    "support_message": "If you believe this is an error, please contact support with details."
}

Integration Points

1. Pipeline Router (`services/gateway/routers/pipeline.py`)

The guardrails agent runs as the first step in pipeline execution:

@router.post("/execute")
async def execute_pipeline(request: PipelineRequest):
    # 🛡️ AI GUARDRAILS: Validate prompt BEFORE pipeline execution
    from agents.guardrails_agent import validate_prompt
    
    is_valid, guardrails_result = validate_prompt(
        prompt=request.prompt,
        goal=request.goal,
        deployment_target=request.deploymentTarget
    )
    
    if not is_valid:
        # Reject execution
        return {
            'status': 'rejected',
            'error': guardrails_result['message'],
            'violations': guardrails_result['violations']
        }
    
    # Continue with pipeline...

2. Frontend (`frontend-vue/src/App.vue`)

The frontend displays user-friendly error messages for guardrails violations:

const response = await axios.post(API_ENDPOINTS.runPipeline, requestPayload);

// Check if guardrails blocked the request
if (response.data.status === 'rejected' && response.data.reason === 'guardrails_violation') {
    const violations = response.data.violations || [];
    const categories = violations.map(v => v.category).join(', ');
    
    alert(`🛡️ AI GUARDRAILS VIOLATION\n\n${response.data.error}\n\nCategories: ${categories}`);
    return;
}

Testing

Run the test suite to verify guardrails are working:

cd ai_solutionbuilder
pytest tests/test_guardrails_agent.py -v

Test Coverage:

✅ Physical harm blocked
✅ Psychological harm blocked
✅ Privacy violations blocked
✅ Discrimination blocked
✅ Financial fraud blocked
✅ Copyright violations blocked
✅ Misinformation blocked
✅ Child safety violations blocked
✅ Legitimate use cases allowed
✅ Statistics tracking

Monitoring

Track guardrails statistics to monitor misuse attempts:

from agents.guardrails_agent import guardrails_agent

stats = guardrails_agent.get_stats()
print(f"Total validations: {stats['total_validations']}")
print(f"Blocked: {stats['blocked']}")
print(f"Allowed: {stats['allowed']}")
print(f"Block rate: {stats['block_rate']:.2%}")

Customization

Adding New Violation Patterns

Edit services/agents/guardrails_agent.py:

def _load_violation_patterns(self) -> Dict[str, List[Dict]]:
    return {
        # ... existing patterns ...
        
        'new_violation_type': {
            'keywords': [
                r'\b(bad|harmful|dangerous)\s+(pattern)\b',
            ],
            'description': 'Description of what this violation prevents',
            'severity': 'critical',  # 'critical', 'high', or 'medium'
            'category': 'Safety'  # 'Safety', 'Ethics', 'Compliance', 'Reliability'
        }
    }

Adjusting Severity

CRITICAL: Immediate rejection, no exceptions
HIGH: Warning logged, may require user acknowledgment
MEDIUM: Advisory only, execution allowed

Examples

❌ Blocked Requests

"Build an app to hack into Facebook accounts"
→ REJECTED: Privacy violation

"Create a weapon detection system for schools"
→ REJECTED: Physical harm (weapon keyword)

"Make a platform to spread fake news about elections"
→ REJECTED: Misinformation

"Clone Netflix with all their movies"
→ REJECTED: Copyright violation

✅ Approved Requests

"Build an e-commerce platform for selling handmade jewelry"
→ APPROVED: Legitimate business use case

"Create a fitness tracking app with calorie counting"
→ APPROVED: Health & wellness (non-diagnostic)

"Build a social platform for photographers"
→ APPROVED: Creative community platform

"Make a medical symptom checker with disclaimer: 'Not medical advice, consult a doctor'"
→ APPROVED: Proper disclaimers included

Best Practices

Run guardrails FIRST - Before any AI processing
Log all rejections - Monitor for abuse patterns
Clear user feedback - Explain WHY a prompt was rejected
Regular updates - Add new patterns as threats evolve
Test thoroughly - Ensure legitimate uses aren’t blocked
Monitor false positives - Adjust patterns if too restrictive

Legal Considerations

This guardrails system helps with compliance but does NOT replace:

Legal review of your application
Terms of Service / Acceptable Use Policy
GDPR Data Processing Agreements
Industry-specific regulations (HIPAA, PCI-DSS, etc.)

Consult legal counsel for comprehensive compliance strategy.

Support

If a legitimate prompt is incorrectly blocked:

Check violation details in response
Revise prompt to avoid trigger keywords
Add proper disclaimers (medical, legal, financial)
Contact support with details if still blocked

Roadmap

Future enhancements:

ML-based content classification (beyond regex)
Context-aware validation (understand intent)
User reputation system (trusted users)
Rate limiting for repeated violations
Admin dashboard for monitoring
Integration with external safety APIs (Perspective API, etc.)

Version: 1.0.0
Last Updated: December 5, 2025
Status: Production Ready ✅