Files
motia-iii/docs/AI_KNOWLEDGE_SYNC.md

600 lines
16 KiB
Markdown

# AI Knowledge Collection Sync - Dokumentation
**Version**: 1.0
**Datum**: 11. März 2026
**Status**: ✅ Implementiert
---
## Überblick
Synchronisiert EspoCRM `CAIKnowledge` Entities mit XAI Collections für semantische Dokumentensuche. Unterstützt vollständigen Collection-Lifecycle, BLAKE3-basierte Integritätsprüfung und robustes Hash-basiertes Change Detection.
## Features
**Collection Lifecycle Management**
- NEW → Collection erstellen in XAI
- ACTIVE → Automatischer Sync der Dokumente
- PAUSED → Sync pausiert, Collection bleibt
- DEACTIVATED → Collection aus XAI löschen
**Dual-Hash Change Detection**
- EspoCRM Hash (MD5/SHA256) für lokale Änderungserkennung
- XAI BLAKE3 Hash für Remote-Integritätsverifikation
- Metadata-Hash für Beschreibungs-Änderungen
**Robustheit**
- BLAKE3 Verification nach jedem Upload
- Metadata-Only Updates via PATCH
- Orphan Detection & Cleanup
- Distributed Locking (Redis)
- Daily Full Sync (02:00 Uhr nachts)
**Fehlerbehandlung**
- Unsupported MIME Types → Status "unsupported"
- Transient Errors → Retry mit Exponential Backoff
- Partial Failures toleriert
---
## Architektur
```
┌─────────────────────────────────────────────────────────────────┐
│ EspoCRM CAIKnowledge │
│ ├─ activationStatus: new/active/paused/deactivated │
│ ├─ syncStatus: unclean/pending_sync/synced/failed │
│ └─ datenbankId: XAI Collection ID │
└─────────────────────────────────────────────────────────────────┘
↓ Webhook
┌─────────────────────────────────────────────────────────────────┐
│ Motia Webhook Handler │
│ → POST /vmh/webhook/aiknowledge/update │
└─────────────────────────────────────────────────────────────────┘
↓ Emit Event
┌─────────────────────────────────────────────────────────────────┐
│ Queue: aiknowledge.sync │
└─────────────────────────────────────────────────────────────────┘
↓ Lock: aiknowledge:{id}
┌─────────────────────────────────────────────────────────────────┐
│ Sync Handler │
│ ├─ Check activationStatus │
│ ├─ Manage Collection Lifecycle │
│ ├─ Sync Documents (with BLAKE3 verification) │
│ └─ Update Statuses │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ XAI Collections API │
│ └─ Collections with embedded documents │
└─────────────────────────────────────────────────────────────────┘
```
---
## EspoCRM Konfiguration
### 1. Entity: CAIKnowledge
**Felder:**
| Feld | Typ | Beschreibung | Werte |
|------|-----|--------------|-------|
| `name` | varchar(255) | Name der Knowledge Base | - |
| `datenbankId` | varchar(255) | XAI Collection ID | Automatisch gefüllt |
| `activationStatus` | enum | Lifecycle-Status | new, active, paused, deactivated |
| `syncStatus` | enum | Sync-Status | unclean, pending_sync, synced, failed |
| `lastSync` | datetime | Letzter erfolgreicher Sync | ISO 8601 |
| `syncError` | text | Fehlermeldung bei Failure | Max 2000 Zeichen |
**Enum-Definitionen:**
```json
{
"activationStatus": {
"type": "enum",
"options": ["new", "active", "paused", "deactivated"],
"default": "new"
},
"syncStatus": {
"type": "enum",
"options": ["unclean", "pending_sync", "synced", "failed"],
"default": "unclean"
}
}
```
### 2. Junction: CAIKnowledgeCDokumente
**additionalColumns:**
| Feld | Typ | Beschreibung |
|------|-----|--------------|
| `aiDocumentId` | varchar(255) | XAI file_id |
| `syncstatus` | enum | Per-Document Sync-Status |
| `syncedHash` | varchar(64) | MD5/SHA256 von EspoCRM |
| `xaiBlake3Hash` | varchar(128) | BLAKE3 Hash von XAI |
| `syncedMetadataHash` | varchar(64) | Hash der Metadaten |
| `lastSync` | datetime | Letzter Sync dieses Dokuments |
**Enum-Definition:**
```json
{
"syncstatus": {
"type": "enum",
"options": ["new", "unclean", "synced", "failed", "unsupported"]
}
}
```
### 3. Webhooks
**Webhook 1: CREATE**
```json
{
"event": "CAIKnowledge.afterSave",
"url": "https://your-motia-domain.com/vmh/webhook/aiknowledge/update",
"method": "POST",
"payload": "{\"entity_id\": \"{$id}\", \"entity_type\": \"CAIKnowledge\", \"action\": \"create\"}",
"condition": "entity.isNew()"
}
```
**Webhook 2: UPDATE**
```json
{
"event": "CAIKnowledge.afterSave",
"url": "https://your-motia-domain.com/vmh/webhook/aiknowledge/update",
"method": "POST",
"payload": "{\"entity_id\": \"{$id}\", \"entity_type\": \"CAIKnowledge\", \"action\": \"update\"}",
"condition": "!entity.isNew()"
}
```
**Webhook 3: DELETE (Optional)**
```json
{
"event": "CAIKnowledge.afterRemove",
"url": "https://your-motia-domain.com/vmh/webhook/aiknowledge/delete",
"method": "POST",
"payload": "{\"entity_id\": \"{$id}\", \"entity_type\": \"CAIKnowledge\", \"action\": \"delete\"}"
}
```
**Empfehlung**: Nur CREATE + UPDATE verwenden. DELETE über `activationStatus="deactivated"` steuern.
### 4. Hooks (EspoCRM Backend)
**Hook 1: Document Link → syncStatus auf "unclean"**
```php
// Hooks/Custom/CAIKnowledge/AfterRelateLinkMultiple.php
namespace Espo\Custom\Hooks\CAIKnowledge;
class AfterRelateLinkMultiple extends \Espo\Core\Hooks\Base
{
public function afterRelateLinkMultiple($entity, $options, $data)
{
if ($data['link'] === 'dokumentes') {
// Mark as unclean when documents linked
$entity->set('syncStatus', 'unclean');
$this->getEntityManager()->saveEntity($entity);
}
}
}
```
**Hook 2: Document Change → Junction auf "unclean"**
```php
// Hooks/Custom/CDokumente/AfterSave.php
namespace Espo\Custom\Hooks\CDokumente;
class AfterSave extends \Espo\Core\Hooks\Base
{
public function afterSave($entity, $options)
{
if ($entity->isAttributeChanged('description') ||
$entity->isAttributeChanged('md5') ||
$entity->isAttributeChanged('sha256')) {
// Mark all junction entries as unclean
$this->updateJunctionStatuses($entity->id, 'unclean');
// Mark all related CAIKnowledge as unclean
$this->markRelatedKnowledgeUnclean($entity->id);
}
}
}
```
---
## Environment Variables
```bash
# XAI API Keys (erforderlich)
XAI_API_KEY=your_xai_api_key_here
XAI_MANAGEMENT_KEY=your_xai_management_key_here
# Redis (für Locking)
REDIS_HOST=localhost
REDIS_PORT=6379
# EspoCRM
ESPOCRM_API_BASE_URL=https://crm.bitbylaw.com/api/v1
ESPOCRM_API_KEY=your_espocrm_api_key
```
---
## Workflows
### Workflow 1: Neue Knowledge Base erstellen
```
1. User erstellt CAIKnowledge in EspoCRM
└─ activationStatus: "new" (default)
2. Webhook CREATE gefeuert
└─ Event: aiknowledge.sync
3. Sync Handler:
└─ activationStatus="new" → Collection erstellen in XAI
└─ Update EspoCRM:
├─ datenbankId = collection_id
├─ activationStatus = "active"
└─ syncStatus = "unclean"
4. Nächster Webhook (UPDATE):
└─ activationStatus="active" → Dokumente syncen
```
### Workflow 2: Dokumente hinzufügen
```
1. User verknüpft Dokumente mit CAIKnowledge
└─ EspoCRM Hook setzt syncStatus = "unclean"
2. Webhook UPDATE gefeuert
└─ Event: aiknowledge.sync
3. Sync Handler:
└─ Für jedes Junction-Entry:
├─ Check: MIME Type supported?
├─ Check: Hash changed?
├─ Download von EspoCRM
├─ Upload zu XAI mit Metadata
├─ Verify Upload (BLAKE3)
└─ Update Junction: syncstatus="synced"
4. Update CAIKnowledge:
└─ syncStatus = "synced"
└─ lastSync = now()
```
### Workflow 3: Metadata-Änderung
```
1. User ändert Document.description in EspoCRM
└─ EspoCRM Hook setzt Junction syncstatus = "unclean"
└─ EspoCRM Hook setzt CAIKnowledge syncStatus = "unclean"
2. Webhook UPDATE gefeuert
3. Sync Handler:
└─ Berechne Metadata-Hash
└─ Hash unterschiedlich? → PATCH zu XAI
└─ Falls PATCH fehlschlägt → Fallback: Re-upload
└─ Update Junction: syncedMetadataHash
```
### Workflow 4: Knowledge Base deaktivieren
```
1. User setzt activationStatus = "deactivated"
2. Webhook UPDATE gefeuert
3. Sync Handler:
└─ Collection aus XAI löschen
└─ Alle Junction Entries zurücksetzen:
├─ syncstatus = "new"
└─ aiDocumentId = NULL
└─ CAIKnowledge bleibt in EspoCRM (mit datenbankId)
```
### Workflow 5: Daily Full Sync
```
Cron: Täglich um 02:00 Uhr
1. Lade alle CAIKnowledge mit:
└─ activationStatus = "active"
└─ syncStatus IN ("unclean", "failed")
2. Für jedes:
└─ Emit: aiknowledge.sync Event
3. Queue verarbeitet alle sequenziell
└─ Fängt verpasste Webhooks ab
```
---
## Monitoring & Troubleshooting
### Logs prüfen
```bash
# Motia Service Logs
sudo journalctl -u motia-iii -f | grep -i "ai knowledge"
# Letzte 100 Sync-Events
sudo journalctl -u motia-iii -n 100 | grep "AI KNOWLEDGE SYNC"
# Fehler der letzten 24 Stunden
sudo journalctl -u motia-iii --since "24 hours ago" | grep "❌"
```
### EspoCRM Status prüfen
```sql
-- Alle Knowledge Bases mit Status
SELECT
id,
name,
activation_status,
sync_status,
last_sync,
sync_error
FROM c_ai_knowledge
WHERE activation_status = 'active';
-- Junction Entries mit Sync-Problemen
SELECT
j.id,
k.name AS knowledge_name,
d.name AS document_name,
j.syncstatus,
j.last_sync
FROM c_ai_knowledge_c_dokumente j
JOIN c_ai_knowledge k ON j.c_ai_knowledge_id = k.id
JOIN c_dokumente d ON j.c_dokumente_id = d.id
WHERE j.syncstatus IN ('failed', 'unsupported');
```
### Häufige Probleme
#### Problem: "Lock busy for aiknowledge:xyz"
**Ursache**: Vorheriger Sync noch aktiv oder abgestürzt
**Lösung**:
```bash
# Redis lock manuell freigeben
redis-cli
> DEL sync_lock:aiknowledge:xyz
```
#### Problem: "Unsupported MIME type"
**Ursache**: Document hat MIME Type, den XAI nicht unterstützt
**Lösung**:
- Dokument konvertieren (z.B. RTF → PDF)
- Oder: Akzeptieren (bleibt mit Status "unsupported")
#### Problem: "Upload verification failed"
**Ursache**: XAI liefert kein BLAKE3 Hash oder Hash-Mismatch
**Lösung**:
1. Prüfe XAI API Dokumentation (Hash-Format geändert?)
2. Falls temporär: Retry läuft automatisch
3. Falls persistent: XAI Support kontaktieren
#### Problem: "Collection not found"
**Ursache**: Collection wurde manuell in XAI gelöscht
**Lösung**: Automatisch gelöst - Sync erstellt neue Collection
---
## API Endpoints
### Webhook Endpoint
```http
POST /vmh/webhook/aiknowledge/update
Content-Type: application/json
{
"entity_id": "kb-123",
"entity_type": "CAIKnowledge",
"action": "update"
}
```
**Response:**
```json
{
"success": true,
"knowledge_id": "kb-123"
}
```
---
## Performance
### Typische Sync-Zeiten
| Szenario | Zeit | Notizen |
|----------|------|---------|
| Collection erstellen | < 1s | Nur API Call |
| 1 Dokument (1 MB) | 2-4s | Upload + Verify |
| 10 Dokumente (10 MB) | 20-40s | Sequenziell |
| 100 Dokumente (100 MB) | 3-6 min | Lock TTL: 30 min |
| Metadata-only Update | < 1s | Nur PATCH |
| Orphan Cleanup | 1-3s | Pro 10 Dokumente |
### Lock TTLs
- **AIKnowledge Sync**: 30 Minuten (1800 Sekunden)
- **Redis Lock**: Same as above
- **Auto-Release**: Bei Timeout (TTL expired)
### Rate Limits
**XAI API:**
- Files Upload: ~100 requests/minute
- Management API: ~1000 requests/minute
**Strategie bei Rate Limit (429)**:
- Exponential Backoff: 2s, 4s, 8s, 16s, 32s
- Respect `Retry-After` Header
- Max 5 Retries
---
## XAI Collections Metadata
### Document Metadata Fields
Werden für jedes Dokument in XAI gespeichert:
```json
{
"fields": {
"document_name": "Vertrag.pdf",
"description": "Mietvertrag Mustermann",
"created_at": "2024-01-01T00:00:00Z",
"modified_at": "2026-03-10T15:30:00Z",
"espocrm_id": "dok-123"
}
}
```
**inject_into_chunk**: `true` für `document_name` und `description`
→ Verbessert semantische Suche
### Collection Metadata
```json
{
"metadata": {
"espocrm_entity_type": "CAIKnowledge",
"espocrm_entity_id": "kb-123",
"created_at": "2026-03-11T10:00:00Z"
}
}
```
---
## Testing
### Manueller Test
```bash
# 1. Erstelle CAIKnowledge in EspoCRM
# 2. Prüfe Logs
sudo journalctl -u motia-iii -f
# 3. Prüfe Redis Lock
redis-cli
> KEYS sync_lock:aiknowledge:*
# 4. Prüfe XAI Collection
curl -H "Authorization: Bearer $XAI_MANAGEMENT_KEY" \
https://management-api.x.ai/v1/collections
```
### Integration Test
```python
# tests/test_aiknowledge_sync.py
async def test_full_sync_workflow():
"""Test complete sync workflow"""
# 1. Create CAIKnowledge with status "new"
knowledge = await espocrm.create_entity('CAIKnowledge', {
'name': 'Test KB',
'activationStatus': 'new'
})
# 2. Trigger webhook
await trigger_webhook(knowledge['id'])
# 3. Wait for sync
await asyncio.sleep(5)
# 4. Check collection created
knowledge = await espocrm.get_entity('CAIKnowledge', knowledge['id'])
assert knowledge['datenbankId'] is not None
assert knowledge['activationStatus'] == 'active'
# 5. Link document
await espocrm.link_entities('CAIKnowledge', knowledge['id'], 'CDokumente', doc_id)
# 6. Trigger webhook again
await trigger_webhook(knowledge['id'])
await asyncio.sleep(10)
# 7. Check junction synced
junction = await espocrm.get_junction_entries(
'CAIKnowledgeCDokumente',
'cAIKnowledgeId',
knowledge['id']
)
assert junction[0]['syncstatus'] == 'synced'
assert junction[0]['xaiBlake3Hash'] is not None
```
---
## Maintenance
### Wöchentliche Checks
- [ ] Prüfe failed Syncs in EspoCRM
- [ ] Prüfe Redis Memory Usage
- [ ] Prüfe XAI Storage Usage
- [ ] Review Logs für Patterns
### Monatliche Tasks
- [ ] Cleanup alte syncError Messages
- [ ] Verify XAI Collection Integrity
- [ ] Review Performance Metrics
- [ ] Update MIME Type Support List
---
## Support
**Bei Problemen:**
1. **Logs prüfen**: `journalctl -u motia-iii -f`
2. **EspoCRM Status prüfen**: SQL Queries (siehe oben)
3. **Redis Locks prüfen**: `redis-cli KEYS sync_lock:*`
4. **XAI API Status**: https://status.x.ai
**Kontakt:**
- Team: BitByLaw Development
- Motia Docs: `/opt/motia-iii/bitbylaw/docs/INDEX.md`
---
**Version History:**
- **1.0** (11.03.2026) - Initial Release
- Collection Lifecycle Management
- BLAKE3 Hash Verification
- Daily Full Sync
- Metadata Change Detection