feat(document-sync): enhance DocumentSync with file status checks and hash-based change detection; add thumbnail generation and metadata update methods
This commit is contained in:
229
docs/DOCUMENT_SYNC_XAI_STATUS.md
Normal file
229
docs/DOCUMENT_SYNC_XAI_STATUS.md
Normal file
@@ -0,0 +1,229 @@
|
|||||||
|
# Document Sync mit xAI Collections - Implementierungs-Status
|
||||||
|
|
||||||
|
## ✅ Implementiert
|
||||||
|
|
||||||
|
### 1. Webhook Endpunkte
|
||||||
|
- **POST** `/vmh/webhook/document/create`
|
||||||
|
- **POST** `/vmh/webhook/document/update`
|
||||||
|
- **POST** `/vmh/webhook/document/delete`
|
||||||
|
|
||||||
|
### 2. Event Handler (`document_sync_event_step.py`)
|
||||||
|
- Queue Topics: `vmh.document.{create|update|delete}`
|
||||||
|
- Redis Distributed Locking
|
||||||
|
- Vollständiges Document Loading von EspoCRM
|
||||||
|
|
||||||
|
### 3. Sync Utilities (`document_sync_utils.py`)
|
||||||
|
- **✅ Datei-Status Prüfung**: "Neu", "Geändert" → xAI-Sync erforderlich
|
||||||
|
- **✅ Hash-basierte Change Detection**: MD5/SHA Vergleich für Updates
|
||||||
|
- **✅ Related Entities Discovery**: Many-to-Many Attachments durchsuchen
|
||||||
|
- **✅ Collection Requirements**: Automatische Ermittlung welche Collections nötig sind
|
||||||
|
|
||||||
|
## ⏳ In Arbeit
|
||||||
|
|
||||||
|
### 4. Thumbnail-Generierung (`generate_thumbnail()`)
|
||||||
|
|
||||||
|
**Anforderungen:**
|
||||||
|
- Erste Seite eines PDFs als Vorschaubild
|
||||||
|
- DOCX/DOC → PDF → Image Konvertierung
|
||||||
|
- Bild-Dateien: Resize auf Thumbnail-Größe
|
||||||
|
- Fallback: Generic File-Icons basierend auf MIME-Type
|
||||||
|
|
||||||
|
**Benötigte Dependencies:**
|
||||||
|
```bash
|
||||||
|
# Python Packages
|
||||||
|
pip install pdf2image python-docx Pillow docx2pdf
|
||||||
|
|
||||||
|
# System Dependencies (Ubuntu/Debian)
|
||||||
|
apt-get install poppler-utils libreoffice
|
||||||
|
```
|
||||||
|
|
||||||
|
**Implementierungs-Schritte:**
|
||||||
|
|
||||||
|
1. **PDF Handling** (Priorität 1):
|
||||||
|
```python
|
||||||
|
from pdf2image import convert_from_path
|
||||||
|
from PIL import Image
|
||||||
|
import io
|
||||||
|
|
||||||
|
def generate_pdf_thumbnail(pdf_path: str) -> bytes:
|
||||||
|
# Konvertiere erste Seite zu Image
|
||||||
|
images = convert_from_path(pdf_path, first_page=1, last_page=1, dpi=150)
|
||||||
|
thumbnail = images[0]
|
||||||
|
|
||||||
|
# Resize auf Thumbnail-Größe (z.B. 200x280)
|
||||||
|
thumbnail.thumbnail((200, 280), Image.Resampling.LANCZOS)
|
||||||
|
|
||||||
|
# Convert zu bytes
|
||||||
|
buffer = io.BytesIO()
|
||||||
|
thumbnail.save(buffer, format='PNG')
|
||||||
|
return buffer.getvalue()
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **DOCX Handling** (Priorität 2):
|
||||||
|
```python
|
||||||
|
from docx2pdf import convert
|
||||||
|
import tempfile
|
||||||
|
import os
|
||||||
|
|
||||||
|
def generate_docx_thumbnail(docx_path: str) -> bytes:
|
||||||
|
# Temporäres PDF erstellen
|
||||||
|
with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp:
|
||||||
|
pdf_path = tmp.name
|
||||||
|
|
||||||
|
# DOCX → PDF Konvertierung (benötigt LibreOffice)
|
||||||
|
convert(docx_path, pdf_path)
|
||||||
|
|
||||||
|
# PDF-Thumbnail generieren
|
||||||
|
thumbnail = generate_pdf_thumbnail(pdf_path)
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
os.remove(pdf_path)
|
||||||
|
|
||||||
|
return thumbnail
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Image Handling** (Priorität 3):
|
||||||
|
```python
|
||||||
|
from PIL import Image
|
||||||
|
import io
|
||||||
|
|
||||||
|
def generate_image_thumbnail(image_path: str) -> bytes:
|
||||||
|
img = Image.open(image_path)
|
||||||
|
img.thumbnail((200, 280), Image.Resampling.LANCZOS)
|
||||||
|
|
||||||
|
buffer = io.BytesIO()
|
||||||
|
img.save(buffer, format='PNG')
|
||||||
|
return buffer.getvalue()
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Thumbnail Upload zu EspoCRM**:
|
||||||
|
```python
|
||||||
|
# EspoCRM unterstützt Preview-Images via Attachment API
|
||||||
|
async def upload_thumbnail_to_espocrm(
|
||||||
|
document_id: str,
|
||||||
|
thumbnail_bytes: bytes,
|
||||||
|
espocrm_api
|
||||||
|
):
|
||||||
|
# Create Attachment
|
||||||
|
attachment_data = {
|
||||||
|
'name': 'preview.png',
|
||||||
|
'type': 'image/png',
|
||||||
|
'role': 'Inline Attachment',
|
||||||
|
'parentType': 'Document',
|
||||||
|
'parentId': document_id,
|
||||||
|
'field': 'previewImage' # Custom field?
|
||||||
|
}
|
||||||
|
|
||||||
|
# Upload via EspoCRM Attachment API
|
||||||
|
# POST /api/v1/Attachment mit multipart/form-data
|
||||||
|
# TODO: espocrm.py muss upload_attachment() Methode bekommen
|
||||||
|
```
|
||||||
|
|
||||||
|
**Offene Fragen:**
|
||||||
|
- Welches Feld in EspoCRM Document für Preview? `previewImage`? `thumbnail`?
|
||||||
|
- Größe des Thumbnails? (empfohlen: 200x280 oder 300x400)
|
||||||
|
- Format: PNG oder JPEG?
|
||||||
|
|
||||||
|
## ❌ Noch nicht implementiert
|
||||||
|
|
||||||
|
### 5. xAI Service (`xai_service.py`)
|
||||||
|
|
||||||
|
**Anforderungen:**
|
||||||
|
- File Upload zu xAI (basierend auf `test_xai_collections_api.py`)
|
||||||
|
- Add File zu Collections
|
||||||
|
- Remove File von Collections
|
||||||
|
- File Download von EspoCRM
|
||||||
|
|
||||||
|
**Referenz-Code vorhanden:**
|
||||||
|
- `/opt/motia-iii/bitbylaw/test_xai_collections_api.py` (630 Zeilen, alle xAI Operations getestet)
|
||||||
|
|
||||||
|
**Implementierungs-Plan:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
class XAIService:
|
||||||
|
def __init__(self, context=None):
|
||||||
|
self.management_key = os.getenv('XAI_MANAGEMENT_KEY')
|
||||||
|
self.api_key = os.getenv('XAI_API_KEY')
|
||||||
|
self.context = context
|
||||||
|
|
||||||
|
async def upload_file(self, file_content: bytes, filename: str) -> str:
|
||||||
|
"""Upload File zu xAI → returns file_id"""
|
||||||
|
# Multipart/form-data upload
|
||||||
|
# POST https://api.x.ai/v1/files
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def add_to_collection(self, collection_id: str, file_id: str):
|
||||||
|
"""Add File zu Collection"""
|
||||||
|
# POST https://management-api.x.ai/v1/collections/{collection_id}/documents/{file_id}
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def remove_from_collection(self, collection_id: str, file_id: str):
|
||||||
|
"""Remove File von Collection"""
|
||||||
|
# DELETE https://management-api.x.ai/v1/collections/{collection_id}/documents/{file_id}
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def download_from_espocrm(self, attachment_id: str) -> bytes:
|
||||||
|
"""Download File von EspoCRM Attachment"""
|
||||||
|
# GET https://crm.bitbylaw.com/api/v1/Attachment/file/{attachment_id}
|
||||||
|
pass
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📋 Integration Checklist
|
||||||
|
|
||||||
|
### Vollständiger Upload-Flow:
|
||||||
|
|
||||||
|
1. ✅ Webhook empfangen → Event emittieren
|
||||||
|
2. ✅ Event Handler: Lock acquire
|
||||||
|
3. ✅ Document laden von EspoCRM
|
||||||
|
4. ✅ Entscheidung: Sync nötig? (Datei-Status, Hash-Check, Collections)
|
||||||
|
5. ⏳ Download File von EspoCRM
|
||||||
|
6. ⏳ Hash berechnen (MD5/SHA)
|
||||||
|
7. ⏳ Thumbnail generieren
|
||||||
|
8. ❌ Upload zu xAI (falls neu oder Hash changed)
|
||||||
|
9. ❌ Add zu Collections
|
||||||
|
10. ⏳ Update EspoCRM Metadaten (xaiFileId, xaiCollections, xaiSyncedHash, thumbnail)
|
||||||
|
11. ✅ Lock release
|
||||||
|
|
||||||
|
### Datei-Stati in EspoCRM:
|
||||||
|
|
||||||
|
- **"Neu"**: Komplett neue Datei → xAI Upload + Collection Add
|
||||||
|
- **"Geändert"**: File-Inhalt geändert → xAI Re-Upload + Collection Update
|
||||||
|
- **"Gesynct"**: Erfolgreich gesynct, keine Änderungen
|
||||||
|
- **"Fehler"**: Sync fehlgeschlagen (mit Error-Message)
|
||||||
|
|
||||||
|
### EspoCRM Custom Fields:
|
||||||
|
|
||||||
|
**Erforderlich für Document Entity:**
|
||||||
|
- `dateiStatus` (Enum): "Neu", "Geändert", "Gesynct", "Fehler"
|
||||||
|
- `md5` (String): MD5 Hash des Files
|
||||||
|
- `sha` (String): SHA Hash des Files
|
||||||
|
- `xaiFileId` (String): xAI File ID
|
||||||
|
- `xaiCollections` (Array): JSON Array von Collection IDs
|
||||||
|
- `xaiSyncedHash` (String): Hash beim letzten erfolgreichen Sync
|
||||||
|
- `xaiSyncStatus` (Enum): "syncing", "synced", "failed"
|
||||||
|
- `xaiSyncError` (Text): Fehlermeldung bei Sync-Fehler
|
||||||
|
- `previewImage` (Attachment?): Vorschaubild
|
||||||
|
|
||||||
|
## 🚀 Nächste Schritte
|
||||||
|
|
||||||
|
**Priorität 1: xAI Service**
|
||||||
|
- Code aus `test_xai_collections_api.py` extrahieren
|
||||||
|
- In `services/xai_service.py` übertragen
|
||||||
|
- EspoCRM Download-Funktion implementieren
|
||||||
|
|
||||||
|
**Priorität 2: Thumbnail-Generator**
|
||||||
|
- Dependencies installieren
|
||||||
|
- PDF-Thumbnail implementieren
|
||||||
|
- EspoCRM Upload-Methode erweitern
|
||||||
|
|
||||||
|
**Priorität 3: Integration testen**
|
||||||
|
- Document in EspoCRM anlegen
|
||||||
|
- Datei-Status auf "Neu" setzen
|
||||||
|
- Webhook triggern
|
||||||
|
- Logs analysieren
|
||||||
|
|
||||||
|
## 📚 Referenzen
|
||||||
|
|
||||||
|
- **xAI API Tests**: `/opt/motia-iii/bitbylaw/test_xai_collections_api.py`
|
||||||
|
- **EspoCRM API**: `services/espocrm.py`
|
||||||
|
- **Beteiligte Sync** (Referenz-Implementierung): `steps/vmh/beteiligte_sync_event_step.py`
|
||||||
@@ -162,6 +162,11 @@ class DocumentSync:
|
|||||||
"""
|
"""
|
||||||
Entscheidet ob ein Document zu xAI synchronisiert werden muss
|
Entscheidet ob ein Document zu xAI synchronisiert werden muss
|
||||||
|
|
||||||
|
Prüft:
|
||||||
|
1. Datei-Status Feld ("Neu", "Geändert")
|
||||||
|
2. Hash-Werte für Change Detection
|
||||||
|
3. Related Entities mit xAI Collections
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
document: Vollständiges Document Entity von EspoCRM
|
document: Vollständiges Document Entity von EspoCRM
|
||||||
|
|
||||||
@@ -178,9 +183,38 @@ class DocumentSync:
|
|||||||
xai_file_id = document.get('xaiFileId')
|
xai_file_id = document.get('xaiFileId')
|
||||||
xai_collections = document.get('xaiCollections') or []
|
xai_collections = document.get('xaiCollections') or []
|
||||||
|
|
||||||
|
# Datei-Status und Hash-Felder
|
||||||
|
datei_status = document.get('dateiStatus') or document.get('fileStatus')
|
||||||
|
file_md5 = document.get('md5') or document.get('fileMd5')
|
||||||
|
file_sha = document.get('sha') or document.get('fileSha')
|
||||||
|
xai_synced_hash = document.get('xaiSyncedHash') # Hash beim letzten xAI-Sync
|
||||||
|
|
||||||
self._log(f"📋 Document Analysis: {doc_name} (ID: {doc_id})")
|
self._log(f"📋 Document Analysis: {doc_name} (ID: {doc_id})")
|
||||||
self._log(f" xaiFileId: {xai_file_id or 'N/A'}")
|
self._log(f" xaiFileId: {xai_file_id or 'N/A'}")
|
||||||
self._log(f" xaiCollections: {xai_collections}")
|
self._log(f" xaiCollections: {xai_collections}")
|
||||||
|
self._log(f" Datei-Status: {datei_status or 'N/A'}")
|
||||||
|
self._log(f" MD5: {file_md5[:16] if file_md5 else 'N/A'}...")
|
||||||
|
self._log(f" SHA: {file_sha[:16] if file_sha else 'N/A'}...")
|
||||||
|
self._log(f" xaiSyncedHash: {xai_synced_hash[:16] if xai_synced_hash else 'N/A'}...")
|
||||||
|
|
||||||
|
# ═══════════════════════════════════════════════════════════════
|
||||||
|
# PRIORITY CHECK: Datei-Status "Neu" oder "Geändert"
|
||||||
|
# ═══════════════════════════════════════════════════════════════
|
||||||
|
if datei_status in ['Neu', 'Geändert', 'neu', 'geändert', 'New', 'Changed']:
|
||||||
|
self._log(f"🆕 Datei-Status: '{datei_status}' → xAI-Sync ERFORDERLICH")
|
||||||
|
|
||||||
|
# Hole Collections (entweder existierende oder von Related Entities)
|
||||||
|
if xai_collections:
|
||||||
|
target_collections = xai_collections
|
||||||
|
else:
|
||||||
|
target_collections = await self._get_required_collections_from_relations(doc_id)
|
||||||
|
|
||||||
|
if target_collections:
|
||||||
|
return (True, target_collections, f"Datei-Status: {datei_status}")
|
||||||
|
else:
|
||||||
|
# Datei ist neu/geändert aber keine Collections gefunden
|
||||||
|
self._log(f"⚠️ Datei-Status '{datei_status}' aber keine Collections gefunden - überspringe Sync")
|
||||||
|
return (False, [], f"Datei-Status: {datei_status}, aber keine Collections")
|
||||||
|
|
||||||
# ═══════════════════════════════════════════════════════════════
|
# ═══════════════════════════════════════════════════════════════
|
||||||
# FALL 1: Document ist bereits in xAI UND Collections sind gesetzt
|
# FALL 1: Document ist bereits in xAI UND Collections sind gesetzt
|
||||||
@@ -188,8 +222,19 @@ class DocumentSync:
|
|||||||
if xai_file_id and xai_collections:
|
if xai_file_id and xai_collections:
|
||||||
self._log(f"✅ Document bereits in xAI gesynct mit {len(xai_collections)} Collection(s)")
|
self._log(f"✅ Document bereits in xAI gesynct mit {len(xai_collections)} Collection(s)")
|
||||||
|
|
||||||
# Prüfe ob Update nötig (z.B. wenn File selbst geändert wurde)
|
# Prüfe ob File-Inhalt geändert wurde (Hash-Vergleich)
|
||||||
# TODO: Implementiere File-Hash-Vergleich für Update-Erkennung
|
current_hash = file_md5 or file_sha
|
||||||
|
|
||||||
|
if current_hash and xai_synced_hash:
|
||||||
|
if current_hash != xai_synced_hash:
|
||||||
|
self._log(f"🔄 Hash-Änderung erkannt! RESYNC erforderlich")
|
||||||
|
self._log(f" Alt: {xai_synced_hash[:16]}...")
|
||||||
|
self._log(f" Neu: {current_hash[:16]}...")
|
||||||
|
return (True, xai_collections, "File-Inhalt geändert (Hash-Mismatch)")
|
||||||
|
else:
|
||||||
|
self._log(f"✅ Hash identisch - keine Änderung")
|
||||||
|
else:
|
||||||
|
self._log(f"⚠️ Keine Hash-Werte verfügbar für Vergleich")
|
||||||
|
|
||||||
return (False, xai_collections, "Bereits gesynct, keine Änderung erkannt")
|
return (False, xai_collections, "Bereits gesynct, keine Änderung erkannt")
|
||||||
|
|
||||||
@@ -316,3 +361,97 @@ class DocumentSync:
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
self._log(f"❌ Fehler beim Laden von Download-Info: {e}", level='error')
|
self._log(f"❌ Fehler beim Laden von Download-Info: {e}", level='error')
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
async def generate_thumbnail(self, file_path: str, mime_type: str) -> Optional[bytes]:
|
||||||
|
"""
|
||||||
|
Generiert Vorschaubild (Thumbnail) für ein Document
|
||||||
|
|
||||||
|
Unterstützt:
|
||||||
|
- PDF: Erste Seite als Bild
|
||||||
|
- DOCX/DOC: Konvertierung zu PDF, dann erste Seite
|
||||||
|
- Images: Resize auf Thumbnail-Größe
|
||||||
|
- Andere: Platzhalter-Icon basierend auf MIME-Type
|
||||||
|
|
||||||
|
Args:
|
||||||
|
file_path: Pfad zur Datei (lokal oder Download-URL)
|
||||||
|
mime_type: MIME-Type des Documents
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Thumbnail als bytes (PNG/JPEG) oder None bei Fehler
|
||||||
|
"""
|
||||||
|
self._log(f"🖼️ Thumbnail-Generierung für {mime_type}")
|
||||||
|
|
||||||
|
# TODO: Implementierung
|
||||||
|
#
|
||||||
|
# Benötigte Libraries:
|
||||||
|
# - pdf2image (für PDF → Image)
|
||||||
|
# - python-docx + docx2pdf (für DOCX → PDF → Image)
|
||||||
|
# - Pillow (PIL) für Image-Processing
|
||||||
|
# - poppler-utils (System-Dependency für pdf2image)
|
||||||
|
#
|
||||||
|
# Implementierungs-Schritte:
|
||||||
|
#
|
||||||
|
# 1. PDF-Handling:
|
||||||
|
# from pdf2image import convert_from_path
|
||||||
|
# images = convert_from_path(file_path, first_page=1, last_page=1)
|
||||||
|
# thumbnail = images[0].resize((200, 280))
|
||||||
|
# return thumbnail_to_bytes(thumbnail)
|
||||||
|
#
|
||||||
|
# 2. DOCX-Handling:
|
||||||
|
# - Konvertiere zu temporärem PDF
|
||||||
|
# - Dann wie PDF behandeln
|
||||||
|
#
|
||||||
|
# 3. Image-Handling:
|
||||||
|
# from PIL import Image
|
||||||
|
# img = Image.open(file_path)
|
||||||
|
# img.thumbnail((200, 280))
|
||||||
|
# return image_to_bytes(img)
|
||||||
|
#
|
||||||
|
# 4. Fallback:
|
||||||
|
# - Generic file-type icon basierend auf MIME-Type
|
||||||
|
|
||||||
|
self._log(f"⚠️ Thumbnail-Generierung noch nicht implementiert", level='warn')
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def update_sync_metadata(
|
||||||
|
self,
|
||||||
|
document_id: str,
|
||||||
|
xai_file_id: str,
|
||||||
|
collection_ids: List[str],
|
||||||
|
file_hash: Optional[str] = None,
|
||||||
|
thumbnail_data: Optional[bytes] = None
|
||||||
|
) -> None:
|
||||||
|
"""
|
||||||
|
Updated Document-Metadaten nach erfolgreichem xAI-Sync
|
||||||
|
|
||||||
|
Args:
|
||||||
|
document_id: EspoCRM Document ID
|
||||||
|
xai_file_id: xAI File ID
|
||||||
|
collection_ids: Liste der xAI Collection IDs
|
||||||
|
file_hash: MD5/SHA Hash des gesyncten Files
|
||||||
|
thumbnail_data: Vorschaubild als bytes
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
update_data = {
|
||||||
|
'xaiFileId': xai_file_id,
|
||||||
|
'xaiCollections': collection_ids,
|
||||||
|
'dateiStatus': 'Gesynct', # Status zurücksetzen
|
||||||
|
}
|
||||||
|
|
||||||
|
# Hash speichern für zukünftige Change Detection
|
||||||
|
if file_hash:
|
||||||
|
update_data['xaiSyncedHash'] = file_hash
|
||||||
|
|
||||||
|
# Thumbnail als Attachment hochladen (falls vorhanden)
|
||||||
|
if thumbnail_data:
|
||||||
|
# TODO: Implementiere Thumbnail-Upload zu EspoCRM
|
||||||
|
# EspoCRM unterstützt Preview-Images für Documents
|
||||||
|
# Muss als separates Attachment hochgeladen werden
|
||||||
|
self._log(f"⚠️ Thumbnail-Upload noch nicht implementiert", level='warn')
|
||||||
|
|
||||||
|
await self.espocrm.update_entity('Document', document_id, update_data)
|
||||||
|
self._log(f"✅ Sync-Metadaten aktualisiert für Document {document_id}")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
self._log(f"❌ Fehler beim Update von Sync-Metadaten: {e}", level='error')
|
||||||
|
raise
|
||||||
|
|||||||
Reference in New Issue
Block a user