What is Guardrails?

Guardrails is a powerful tool that help with faster adoption of Large Language Models (LLMs) in your organization. It provides an API to that detects and prevents security and privacy challenges such as Prompt Injection, Toxicity, NSFW, PII, and more.

What are detectors and what are they used for?

Detectors are a set of rules that Guardrails uses to detect and prevent security and privacy issues. They are used to identify potential issues in your code and provide guidance on how to fix them.

Prompt Injection Detector

Recognizes patterns that may indicate an prompt injection attack or jailbreak attempt protecting the system from malicious inputs. The detector returns a binary value (0 for benign, 1 for attack) for indicating whether the input is a potential injection attempt and also provides confidence levels for the detection.

Toxicity Detector

The detector provides a detailed analysis of the text’s toxicity, indicating the presence and levels of general toxicity, obscene language, insults, threats, and identity hate. The detector returns a list of dictionaries, each containing the following information:

  • score: The score of the text on a scale of 0 to 100, with higher scores indicating a higher level of toxicity.
  • type: The type of toxicity detected, such as “obscene”, “insult”, “threat”, or “identity_hate”.

NSFW Detector

Analyses text for content that may be inappropriate for workplace environments, such as explicit language or adult themes, ensuring professionalism and compliance with workplace standards. The detector returns a binary value (0 for sfw, 1 for nsfw) indicating whether the text is not safe for work.

PII Detector

Detects the presence of Personally Identifiable Information (PII) within the text, helping to prevent the unintentional disclosure of sensitive data. The detector returns secrets, PII, IP addresses, and URLs.

Topic Detector

Analyzes text to determine if it aligns with the given specific topic or field of interest. You can provide the topic keyword as a string or a list of strings. The detector returns a binary value (0 for off-topic, 1 for on-topic) indicating whether the text aligns with the given topic.

Keyword Detector

This Detector is designed to scan text for specific words or phrases that are predefined as significant, sensitive, or requiring special attention, such as banned or proprietary terms. In the provided output, you can expect to see which keywords were detected, the count of each keyword’s occurrence, and a version of the text with the detected keywords redacted to maintain confidentiality or compliance.

Policy Violation Detector

This detector checks if the generated text adheres to specified policies or guidelines. It helps ensure that the content complies with predefined rules and standards. The detector returns a binary value (0 for compliant, 1 for violation) indicating whether the text violates the specified policies.

Bias Detector

This detector identifies and mitigates potential biases in the generated text. It helps ensure that the content is fair and unbiased across various dimensions such as gender, race, or age. The detector returns a binary value (0 for unbiased, 1 for biased) indicating whether the text exhibits bias.

This detector identifies potential leaks of copyrighted material or intellectual property in the generated text. It helps protect against unintended use or distribution of protected content. The detector returns copyright_ip_similarity binary value (0 for not similar, 1 for similar) indicating whether the text contains copyrighted or proprietary information.

System Prompt Similarity Detector

This detector identifies potential leaks of system prompts or AI model information in the generated text. It helps maintain the integrity of the AI system by preventing unintended disclosures of internal prompts or model details. The detector returns system_prompt_similarity binary value (0 for not similar, 1 for similar) indicating whether the text contains system prompts or model information.

PII anonymization and de-anonymization

Anonymizes the request text and de-anonymizes the response text.

Hallucinations Detector

This detector takes a request text and the corresponding response text to check for potential Hallucination in the output of a LLM response. The detector outputs a boolean value indicating whether the response text contains Hallucination. The detector returns a boolean value (false for no hallucination, true for hallucination) indicating whether the text contains hallucination.

Adherence Detector

Analyses the adherence of a generated answer to a given context, ensuring that the response aligns with the provided information. The detector returns an adherence score indicating how well the answer adheres to the context. This detecor returns adherence_score binary value (0 for not adherent, 1 for adherent) indicating whether the text is adherent to the context.

Relevancy Detector

Analyses the relevance of a generated answer to a given question, ensuring that the response appropriately addresses the query. The detector returns a relevancy score indicating how well the answer relates to the question. This detector returns relevancy_score binary value (0 for not relevant, 1 for relevant) indicating whether the text is relevant to the question.