feat(document-sync): enhance DocumentSync with file status checks and hash-based change detection; add thumbnail generation and metadata update methods

2026-03-03 09:15:02 +00:00
parent ee9aab049f
commit 70265c9adf
2 changed files with 370 additions and 2 deletions
--- a/docs/DOCUMENT_SYNC_XAI_STATUS.md
+++ b/docs/DOCUMENT_SYNC_XAI_STATUS.md
@@ -0,0 +1,229 @@
 # Document Sync mit xAI Collections - Implementierungs-Status
 ## ✅ Implementiert
 ### 1. Webhook Endpunkte
 - **POST** `/vmh/webhook/document/create`
 - **POST** `/vmh/webhook/document/update`  
 - **POST** `/vmh/webhook/document/delete`
 ### 2. Event Handler (`document_sync_event_step.py`)
 - Queue Topics: `vmh.document.{create|update|delete}`
 - Redis Distributed Locking
 - Vollständiges Document Loading von EspoCRM
 ### 3. Sync Utilities (`document_sync_utils.py`)
 - **✅ Datei-Status Prüfung**: "Neu", "Geändert" → xAI-Sync erforderlich
 - **✅ Hash-basierte Change Detection**: MD5/SHA Vergleich für Updates
 - **✅ Related Entities Discovery**: Many-to-Many Attachments durchsuchen
 - **✅ Collection Requirements**: Automatische Ermittlung welche Collections nötig sind
 ## ⏳ In Arbeit
 ### 4. Thumbnail-Generierung (`generate_thumbnail()`)
 **Anforderungen:**
 - Erste Seite eines PDFs als Vorschaubild
 - DOCX/DOC → PDF → Image Konvertierung
 - Bild-Dateien: Resize auf Thumbnail-Größe
 - Fallback: Generic File-Icons basierend auf MIME-Type
 **Benötigte Dependencies:**
 ```bash
 # Python Packages
 pip install pdf2image python-docx Pillow docx2pdf
 # System Dependencies (Ubuntu/Debian)
 apt-get install poppler-utils libreoffice
 ```
 **Implementierungs-Schritte:**
 1. **PDF Handling** (Priorität 1):
 ```python
 from pdf2image import convert_from_path
 from PIL import Image
 import io
 def generate_pdf_thumbnail(pdf_path: str) -> bytes:
    # Konvertiere erste Seite zu Image
    images = convert_from_path(pdf_path, first_page=1, last_page=1, dpi=150)
    thumbnail = images[0]
    # Resize auf Thumbnail-Größe (z.B. 200x280)
    thumbnail.thumbnail((200, 280), Image.Resampling.LANCZOS)
    # Convert zu bytes
    buffer = io.BytesIO()
    thumbnail.save(buffer, format='PNG')
    return buffer.getvalue()
 ```
 2. **DOCX Handling** (Priorität 2):
 ```python
 from docx2pdf import convert
 import tempfile
 import os
 def generate_docx_thumbnail(docx_path: str) -> bytes:
    # Temporäres PDF erstellen
    with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as tmp:
        pdf_path = tmp.name
    # DOCX → PDF Konvertierung (benötigt LibreOffice)
    convert(docx_path, pdf_path)
    # PDF-Thumbnail generieren
    thumbnail = generate_pdf_thumbnail(pdf_path)
    # Cleanup
    os.remove(pdf_path)
    return thumbnail
 ```
 3. **Image Handling** (Priorität 3):
 ```python
 from PIL import Image
 import io
 def generate_image_thumbnail(image_path: str) -> bytes:
    img = Image.open(image_path)
    img.thumbnail((200, 280), Image.Resampling.LANCZOS)
    buffer = io.BytesIO()
    img.save(buffer, format='PNG')
    return buffer.getvalue()
 ```
 4. **Thumbnail Upload zu EspoCRM**:
 ```python
 # EspoCRM unterstützt Preview-Images via Attachment API
 async def upload_thumbnail_to_espocrm(
    document_id: str, 
    thumbnail_bytes: bytes,
    espocrm_api
 ):
    # Create Attachment
    attachment_data = {
        'name': 'preview.png',
        'type': 'image/png',
        'role': 'Inline Attachment',
        'parentType': 'Document',
        'parentId': document_id,
        'field': 'previewImage'  # Custom field?
    }
    # Upload via EspoCRM Attachment API
    # POST /api/v1/Attachment mit multipart/form-data
    # TODO: espocrm.py muss upload_attachment() Methode bekommen
 ```
 **Offene Fragen:**
 - Welches Feld in EspoCRM Document für Preview? `previewImage`? `thumbnail`?
 - Größe des Thumbnails? (empfohlen: 200x280 oder 300x400)
 - Format: PNG oder JPEG?
 ## ❌ Noch nicht implementiert
 ### 5. xAI Service (`xai_service.py`)
 **Anforderungen:**
 - File Upload zu xAI (basierend auf `test_xai_collections_api.py`)
 - Add File zu Collections
 - Remove File von Collections
 - File Download von EspoCRM
 **Referenz-Code vorhanden:**
 - `/opt/motia-iii/bitbylaw/test_xai_collections_api.py` (630 Zeilen, alle xAI Operations getestet)
 **Implementierungs-Plan:**
 ```python
 class XAIService:
    def __init__(self, context=None):
        self.management_key = os.getenv('XAI_MANAGEMENT_KEY')
        self.api_key = os.getenv('XAI_API_KEY')
        self.context = context
    async def upload_file(self, file_content: bytes, filename: str) -> str:
        """Upload File zu xAI → returns file_id"""
        # Multipart/form-data upload
        # POST https://api.x.ai/v1/files
        pass
    async def add_to_collection(self, collection_id: str, file_id: str):
        """Add File zu Collection"""
        # POST https://management-api.x.ai/v1/collections/{collection_id}/documents/{file_id}
        pass
    async def remove_from_collection(self, collection_id: str, file_id: str):
        """Remove File von Collection"""
        # DELETE https://management-api.x.ai/v1/collections/{collection_id}/documents/{file_id}
        pass
    async def download_from_espocrm(self, attachment_id: str) -> bytes:
        """Download File von EspoCRM Attachment"""
        # GET https://crm.bitbylaw.com/api/v1/Attachment/file/{attachment_id}
        pass
 ```
 ## 📋 Integration Checklist
 ### Vollständiger Upload-Flow:
 1. ✅ Webhook empfangen → Event emittieren
 2. ✅ Event Handler: Lock acquire
 3. ✅ Document laden von EspoCRM
 4. ✅ Entscheidung: Sync nötig? (Datei-Status, Hash-Check, Collections)
 5. ⏳ Download File von EspoCRM
 6. ⏳ Hash berechnen (MD5/SHA)
 7. ⏳ Thumbnail generieren
 8. ❌ Upload zu xAI (falls neu oder Hash changed)
 9. ❌ Add zu Collections
 10. ⏳ Update EspoCRM Metadaten (xaiFileId, xaiCollections, xaiSyncedHash, thumbnail)
 11. ✅ Lock release
 ### Datei-Stati in EspoCRM:
 - **"Neu"**: Komplett neue Datei → xAI Upload + Collection Add
 - **"Geändert"**: File-Inhalt geändert → xAI Re-Upload + Collection Update
 - **"Gesynct"**: Erfolgreich gesynct, keine Änderungen
 - **"Fehler"**: Sync fehlgeschlagen (mit Error-Message)
 ### EspoCRM Custom Fields:
 **Erforderlich für Document Entity:**
 - `dateiStatus` (Enum): "Neu", "Geändert", "Gesynct", "Fehler"
 - `md5` (String): MD5 Hash des Files
 - `sha` (String): SHA Hash des Files  
 - `xaiFileId` (String): xAI File ID
 - `xaiCollections` (Array): JSON Array von Collection IDs
 - `xaiSyncedHash` (String): Hash beim letzten erfolgreichen Sync
 - `xaiSyncStatus` (Enum): "syncing", "synced", "failed"
 - `xaiSyncError` (Text): Fehlermeldung bei Sync-Fehler
 - `previewImage` (Attachment?): Vorschaubild
 ## 🚀 Nächste Schritte
 **Priorität 1: xAI Service**
 - Code aus `test_xai_collections_api.py` extrahieren
 - In `services/xai_service.py` übertragen
 - EspoCRM Download-Funktion implementieren
 **Priorität 2: Thumbnail-Generator**
 - Dependencies installieren
 - PDF-Thumbnail implementieren
 - EspoCRM Upload-Methode erweitern
 **Priorität 3: Integration testen**
 - Document in EspoCRM anlegen
 - Datei-Status auf "Neu" setzen
 - Webhook triggern
 - Logs analysieren
 ## 📚 Referenzen
 - **xAI API Tests**: `/opt/motia-iii/bitbylaw/test_xai_collections_api.py`
 - **EspoCRM API**: `services/espocrm.py`
 - **Beteiligte Sync** (Referenz-Implementierung): `steps/vmh/beteiligte_sync_event_step.py`
--- a/services/document_sync_utils.py
+++ b/services/document_sync_utils.py
@@ -162,6 +162,11 @@ class DocumentSync:
        """
        Entscheidet ob ein Document zu xAI synchronisiert werden muss
        Prüft:
        1. Datei-Status Feld ("Neu", "Geändert")
        2. Hash-Werte für Change Detection
        3. Related Entities mit xAI Collections
        Args:
            document: Vollständiges Document Entity von EspoCRM
@@ -178,9 +183,38 @@ class DocumentSync:
        xai_file_id = document.get('xaiFileId')
        xai_collections = document.get('xaiCollections') or []
        # Datei-Status und Hash-Felder
        datei_status = document.get('dateiStatus') or document.get('fileStatus')
        file_md5 = document.get('md5') or document.get('fileMd5')
        file_sha = document.get('sha') or document.get('fileSha')
        xai_synced_hash = document.get('xaiSyncedHash')  # Hash beim letzten xAI-Sync
        self._log(f"📋 Document Analysis: {doc_name} (ID: {doc_id})")
        self._log(f"   xaiFileId: {xai_file_id or 'N/A'}")
        self._log(f"   xaiCollections: {xai_collections}")
        self._log(f"   Datei-Status: {datei_status or 'N/A'}")
        self._log(f"   MD5: {file_md5[:16] if file_md5 else 'N/A'}...")
        self._log(f"   SHA: {file_sha[:16] if file_sha else 'N/A'}...")
        self._log(f"   xaiSyncedHash: {xai_synced_hash[:16] if xai_synced_hash else 'N/A'}...")
        # ═══════════════════════════════════════════════════════════════
        # PRIORITY CHECK: Datei-Status "Neu" oder "Geändert"
        # ═══════════════════════════════════════════════════════════════
        if datei_status in ['Neu', 'Geändert', 'neu', 'geändert', 'New', 'Changed']:
            self._log(f"🆕 Datei-Status: '{datei_status}' → xAI-Sync ERFORDERLICH")
            # Hole Collections (entweder existierende oder von Related Entities)
            if xai_collections:
                target_collections = xai_collections
            else:
                target_collections = await self._get_required_collections_from_relations(doc_id)
            if target_collections:
                return (True, target_collections, f"Datei-Status: {datei_status}")
            else:
                # Datei ist neu/geändert aber keine Collections gefunden
                self._log(f"⚠️  Datei-Status '{datei_status}' aber keine Collections gefunden - überspringe Sync")
                return (False, [], f"Datei-Status: {datei_status}, aber keine Collections")
        # ═══════════════════════════════════════════════════════════════
        # FALL 1: Document ist bereits in xAI UND Collections sind gesetzt
@@ -188,8 +222,19 @@ class DocumentSync:
        if xai_file_id and xai_collections:
            self._log(f"✅ Document bereits in xAI gesynct mit {len(xai_collections)} Collection(s)")
-            # Prüfe ob Update nötig (z.B. wenn File selbst geändert wurde)
+            # Prüfe ob File-Inhalt geändert wurde (Hash-Vergleich)
-            # TODO: Implementiere File-Hash-Vergleich für Update-Erkennung
+            current_hash = file_md5 or file_sha
            if current_hash and xai_synced_hash:
                if current_hash != xai_synced_hash:
                    self._log(f"🔄 Hash-Änderung erkannt! RESYNC erforderlich")
                    self._log(f"   Alt: {xai_synced_hash[:16]}...")
                    self._log(f"   Neu: {current_hash[:16]}...")
                    return (True, xai_collections, "File-Inhalt geändert (Hash-Mismatch)")
                else:
                    self._log(f"✅ Hash identisch - keine Änderung")
            else:
                self._log(f"⚠️  Keine Hash-Werte verfügbar für Vergleich")
            return (False, xai_collections, "Bereits gesynct, keine Änderung erkannt")
@@ -316,3 +361,97 @@ class DocumentSync:
        except Exception as e:
            self._log(f"❌ Fehler beim Laden von Download-Info: {e}", level='error')
            return None
    async def generate_thumbnail(self, file_path: str, mime_type: str) -> Optional[bytes]:
        """
        Generiert Vorschaubild (Thumbnail) für ein Document
        Unterstützt:
        - PDF: Erste Seite als Bild
        - DOCX/DOC: Konvertierung zu PDF, dann erste Seite
        - Images: Resize auf Thumbnail-Größe
        - Andere: Platzhalter-Icon basierend auf MIME-Type
        Args:
            file_path: Pfad zur Datei (lokal oder Download-URL)
            mime_type: MIME-Type des Documents
        Returns:
            Thumbnail als bytes (PNG/JPEG) oder None bei Fehler
        """
        self._log(f"🖼️  Thumbnail-Generierung für {mime_type}")
        # TODO: Implementierung
        # 
        # Benötigte Libraries:
        # - pdf2image (für PDF → Image)
        # - python-docx + docx2pdf (für DOCX → PDF → Image)  
        # - Pillow (PIL) für Image-Processing
        # - poppler-utils (System-Dependency für pdf2image)
        #
        # Implementierungs-Schritte:
        #
        # 1. PDF-Handling:
        #    from pdf2image import convert_from_path
        #    images = convert_from_path(file_path, first_page=1, last_page=1)
        #    thumbnail = images[0].resize((200, 280))
        #    return thumbnail_to_bytes(thumbnail)
        #
        # 2. DOCX-Handling:
        #    - Konvertiere zu temporärem PDF
        #    - Dann wie PDF behandeln
        #
        # 3. Image-Handling:
        #    from PIL import Image
        #    img = Image.open(file_path)
        #    img.thumbnail((200, 280))
        #    return image_to_bytes(img)
        #
        # 4. Fallback:
        #    - Generic file-type icon basierend auf MIME-Type
        self._log(f"⚠️  Thumbnail-Generierung noch nicht implementiert", level='warn')
        return None
    async def update_sync_metadata(
        self,
        document_id: str,
        xai_file_id: str,
        collection_ids: List[str],
        file_hash: Optional[str] = None,
        thumbnail_data: Optional[bytes] = None
    ) -> None:
        """
        Updated Document-Metadaten nach erfolgreichem xAI-Sync
        Args:
            document_id: EspoCRM Document ID
            xai_file_id: xAI File ID
            collection_ids: Liste der xAI Collection IDs
            file_hash: MD5/SHA Hash des gesyncten Files
            thumbnail_data: Vorschaubild als bytes
        """
        try:
            update_data = {
                'xaiFileId': xai_file_id,
                'xaiCollections': collection_ids,
                'dateiStatus': 'Gesynct',  # Status zurücksetzen
            }
            # Hash speichern für zukünftige Change Detection
            if file_hash:
                update_data['xaiSyncedHash'] = file_hash
            # Thumbnail als Attachment hochladen (falls vorhanden)
            if thumbnail_data:
                # TODO: Implementiere Thumbnail-Upload zu EspoCRM
                # EspoCRM unterstützt Preview-Images für Documents
                # Muss als separates Attachment hochgeladen werden
                self._log(f"⚠️  Thumbnail-Upload noch nicht implementiert", level='warn')
            await self.espocrm.update_entity('Document', document_id, update_data)
            self._log(f"✅ Sync-Metadaten aktualisiert für Document {document_id}")
        except Exception as e:
            self._log(f"❌ Fehler beim Update von Sync-Metadaten: {e}", level='error')
            raise