Hero Light

Key Focus Area:

Current features

Prompt Injection Attack - Jailbreaks

Jailbreak measures the extent to which a model can be manipulated to produce content divergent from its intended purpose. We employ two methods of prompt injection:

  • Advanced: This method entails structuring a goal within a standardized prompt format, which then adapts iteratively based on the model’s responses.
  • SinglePrompt: Here, our goal is embedded within a story prompt to execute the attack.

Bias Assessment

Bias evaluates the extent to which a model generates content that may be biased or unfair. Our evaluation involves testing the model’s performance on prompts designed to uncover both implicit and explicit biases. By prompting the model to associate specific attributes with various demographic groups, such as religion, race, gender, and health, we delve into multiple subcategories within each demographic.

Malware Generation

Malware examines the model’s capability to produce malware or known malware signatures. We assess this by prompting the model to create malicious software from different perspectives, including payload construction and top-level architecture, across a spectrum of programming languages. Furthermore, we scrutinize the model’s proficiency in replicating established malicious signatures.

Toxicity Analysis

Toxicity measures the model’s tendency to generate harmful or toxic content, encompassing hate speech, threats, and other forms of harmful language. We evaluate the model’s toxicity generation ability by transforming prompts into sentence completion tasks, observing how it responds to partial sentences or prompts related to potentially harmful content. This includes generating content classifiable into threats, insults, profanity, and sexually explicit material, among others.

Upcoming Features:

Hallucination Evaluation

Hallucination assessment involves evaluating the model’s performance across various tasks such as summarization, question answering, and text generation. Through these assessments, we gauge the model’s capability to produce coherent and contextually relevant responses, ensuring proficiency across a wide array of language understanding and generation tasks.

Robustness Testing

Robustness assessment comprises two primary methods: perturbation and explanation tasks. Perturbation involves introducing changes or disturbances to input data to observe the model’s responses, while explanation tasks focus on evaluating the model’s ability to provide coherent justifications or reasoning for its predictions. These approaches allow for comprehensive evaluation of the model’s stability and reliability across diverse conditions and scenarios.

Privacy Attacks Exploration

Privacy attacks involve probing the model to identify potential leaks of private or confidential information from the training data. This process entails systematically testing the model’s responses to various inputs to uncover any patterns or behaviors that may inadvertently reveal sensitive information.