The HP Q3 2023 Threat Report [2] highlights that 80% of malware is delivered via email, with 12% bypassing detection technologies to reach endpoints. The 2023 Verizon Data Breach Report also indicates that 35% of ransomware infections originated from email. Two primary factors contribute to evasion: the volume and cost challenges of sandbox scanning, which lead to selective scanning and inadvertent bypasses, and the limitations of detection technologies like signature-based methods, sandbox[1] and machine learning, which rely on the final malicious payload for decision-making. However, evasive multi-stage malware and phishing URLs often lack malicious payload when analyzed by these technologies. Additionally, generative AI tools like FraudGPT and WormGPT facilitate the creation of new malicious payloads and phishing pages, further enabling malware to evade defenses and reach endpoints.
To address the challenge of detecting evasive malware and malicious URLs without requiring the final malicious payload, we will share the detailed design of an Neural Analysis and Correlation Engine (NACE) specifically designed to detect malicious attachments by understanding the semantics of the email and leveraging them as features instead of relying on the final malicious payload for its decision making. The NACE harnesses a layered approach employing supervised and unsupervised AI-based models leveraging transformer-based architecture to derive deeper meaning embedded within the email's body, text in the attachment, and subject.
We will first dive into the details of the semantics commonly used by threat actors to deliver malicious attachments, which lays the foundation of our approach. These details were derived from the analysis of a dataset of malicious emails. The text from the body of the email was extracted to create embeddings. UMAP aided in dimensionality reduction, and clusters were generated based on their density in the high-dimensional embedding space. These clusters represent different types of semantics employed by threat actors to deliver malicious attachments.
In the presentation we will share the details of our approach in which every incoming email undergoes zero-shot semantic analysis, similarity analysis using LLM to determine if it contains semantics typically used by the threat actors to deliver malicious attachments. Additionally, email's body is further analyzed for secondary semantics, including tone, sentiment, and other nuanced elements. Once semantics are identified, hierarchical topic, phrase topic modeling is then applied to uncover the relationships between various topics.
Primary and secondary semantics from the email, along with results from phrase hierarchical topic modeling, deep file parsing results of attachments, and email headers, are sent to the expert system. Contextual relationship between the features is used to derive the verdict of malicious and benign attachment without needing malicious payload. This comprehensive approach identifies malicious content without depending on the final payload, which is crucial for any detection technology.
Our presentation will show how LLM models can effectively detect evasive malicious attachments without depending on the analysis of the malicious payload, which typically occurs in the later stages of attachment analysis. Our approach is exemplified by our success in defending against real-world threats, in actual production traffic including HTML smuggling campaigns, Obfuscated SVG , Phishing Links behind CDN, CAPTCHA, Downloaders, Redirectors.
The presentation will conclude with results observed from the production traffic.
References: