🥷 [S2] Challenge 24

2023年9月12日 · 阅读需 4 分钟

Just KNIME It, Season 2 / Challenge 24 reference

Challenge question

Challenge 24: Fraudulent Email Address Detection

Level: Medium

Description: In this challenge you will take the role of cybersecurity analyst, and see if you can identify emails that are trying to pass as legitimate when they are in fact malicious. You notice that bad-actor emails try to trick the receiver by mimicking major email domains. For instance, you notice that @gnail, @gmial, etc. are trying to pass as @gmail. You then decide to get a count of all the domains: those that have the lowest count have a higher probability of being fraudulent. You must also check whether those low-count email domains are trying to pose as the major emails domains or not. Your answer should not mark @unique.com as fraudulent. Note: Try not to hard-code any variables in your workflow, but instead use mean or median for instance. Hint: Checking for string similarity might help.

Author: Victor Palacios

Dataset: Domains Data in the KNIME Hub

General idea

In this problem, our goal is to identify potentially malicious emails. This entails detecting emails with suspicious domain names or other characteristics. Here are a few potential solutions:

Use Classification Models: Train a supervised machine learning model to identify fraudulent emails based on example input and outputs. This model can consider characteristics such as the domain name and the local part of the email (the part before "@"). However, this method requires data annotation.
Domain Validation: Implement a system that validates domain names. Check for misspelled ones like "gmial.com," "gnail.com," and "somesiet.co," as these are likely to be malicious. Legitimate emails from major providers, such as gmail.com, should always have correct spelling. This validation method is suitable for our current dataset.
Check Public Blacklists: Utilize public blacklists of known malicious email domains and IP addresses available online. Verify incoming email addresses against these blacklists. If they appear, there is a high likelihood of a phishing attempt. This validation method seems effective, but I found that most lists available online are paid services, and I haven't come across any free ones.
Frequency Analysis: Malicious email accounts often send out a significantly higher volume of communications compared to normal users. By tracking the number of emails originating from a specific address, you can potentially identify malicious accounts. However, this method requires a large volume of emails to be effective.
URL Analysis: If these emails contain links, you can also analyze the URLs to determine if they lead to reputable sites. This method requires access to the email body for analysis.
AI and Machine Learning: In addition to supervised models, you could employ natural language processing (NLP) to analyze the body content of emails for signs of typical phishing language. Similar to the previous method, this approach also requires access to the email body for analysis.

but...

Here we are, utilizing neither of the aforementioned methods, but instead relying on a large language model.

When employing these large language models, two points must be acknowledged. Firstly, it is important to recognize that they may occasionally produce inaccurate or misleading information, referred to as "hallucinations." Secondly, formatting the results generated by these models can be challenging.

The first issue can only be resolved through human inspection, involving a process known as the "human-in-loop".

To address the second concern, I used OpenAI's function call API.

In a previous challenge, we had already utilized the OpenAI API, and in this instance, I sought to obtain the API key by retrieving it from the environment variable, rather than employing the previous method. In the "Variable Expressions" node, I constructed the request field, although it still feels somewhat cumbersome.

a = {
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a cybersecurity expert. You can easily distinguish between suspected scam emails, and I'm going to give you some email domains to help me determine whether this may be a scam email domain and why you made this determination. Here are the domains of the email:"
    },
    {
      "role": "user",
      "content": variable("domain")
    }
  ],
  "functions": [
    {
      "name": "distinguish_suspected_scam_email_domain",
      "description": "distinguish it is a scam email domain or not, and why you made this determination",
      "parameters": {
        "type": "object",
        "properties": {
          "check_results": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "checked_domain": {
                  "type": "string",
                  "description": "current checked domain"
                },
                "scam_email_domain_or_not": {
                  "type": "string",
                  "description": "distinguish it is a scam email domain or not, or not sure",
                  "enum": [
                    "yes",
                    "no",
                    "not sure"
                  ]
                },
                "reason": {
                  "type": "string",
                  "description": "why you made this determination"
                }
              },
              "required": [
                "scam_email_domain_or_not",
                "reason"
              ]
            }
          }
        },
        "required": ["check_results"]
      }
    }
  ]
}
JSON.stringify(a)

Regarding the remainder, it is straightforward. Naturally, when implementing this method on a larger scale, there will inevitably be associated costs.

Ok, the rest is "human-in-loop" part:

`domain`	`Concatenate(Email)`	`scam_email_domain_or_nots`	`reasons`
gmail.com	jts@gmail.com, xyz@gmail.com, detection@gmail.com, manyemails@gmail.com	no	Gmail is a reputable email service provider with strict security measures.
gmali.com	fraudster@gmali.com	yes	This domain is a misspelling of 'gmail.com', indicating a potential scam.
gmial.com	xyz@gmial.com	yes	This domain is a misspelling of 'gmail.com', indicating a potential scam.
gnail.com	deception@gnail.com	yes	This domain is a misspelling of 'gmail.com', indicating a potential scam.
somesiet.co	fakeemail@somesiet.co	not sure	This domain is not widely recognized and could potentially be suspicious. Further investigation is needed.
somesite.co	knimer@somesite.co, notfraud@somesite.co, fakester@somesite.co	no	This domain appears to be a valid website domain and does not raise immediate suspicion.
unique.com	notfraud@unique.com	no	This domain appears to be a valid website domain and does not raise immediate suspicion.

Any thoughts?

If we use environment variables to retrieve the key, it will be saved when the workflow is saved, which can be problematic in certain situations.
Using the "Variable Expressions" node to directly create the body of an HTTP post is highly effective.
Constructing the format of a "function call" can be a bit cumbersome, there may be a simpler approach available.
KNIME v5's OpenAI extension may be more effective in this scenario.

Challenge question​

General idea​

but...​

Any thoughts?​

Challenge question

General idea

but...

Any thoughts?