Building a PII Data Extraction and Scrubbing System with BAML

In this tutorial, you’ll learn how to create a robust PII (Personally Identifiable Information) data extraction and scrubbing system using BAML and GPT-4. By the end, you’ll have a working system that can identify, extract, and scrub various types of PII from text documents.

Prerequisites

  • Basic understanding of BAML syntax
  • Access to OpenAI API (you’ll need an API key)

Step 1: Define the Data Schema

First, let’s define what our PII data structure should look like. Create a new file called pii_extractor.baml and add the following schema:

pii_extractor.baml
1class PIIData {
2 index int
3 dataType string
4 value string
5}
6
7class PIIExtraction {
8 privateData PIIData[]
9 containsSensitivePII bool @description("E.g. SSN")
10}

This schema defines:

  • PIIData: A class representing a single piece of PII with its type and value
  • PIIExtraction: A container class that holds an array of PII data items and a sensitive data flag

Step 2: Create the Extraction Function

Next, let’s create the function that uses GPT-4 to extract PII. Add this to your pii_extractor.baml file:

pii_extractor.baml
1function ExtractPII(document: string) -> PIIExtraction {
2 client "openai/gpt-4o-mini"
3 prompt #"
4 Extract all personally identifiable information (PII) from the given document. Look for items like:
5 - Names
6 - Email addresses
7 - Phone numbers
8 - Addresses
9 - Social security numbers
10 - Dates of birth
11 - Any other personal data
12
13 {{ ctx.output_format }}
14
15 {{ _.role("user") }}
16
17 {{ document }}
18 "#
19}

Let’s break down what this function does:

  • Takes a document input as a string
  • Uses the gpt-4o-mini model
  • Provides clear guidelines for PII extraction in the prompt
  • Returns a PIIExtraction object containing all found PII data

Step 3: Test the Extractor

To ensure our PII extractor works correctly, let’s add some test cases:

pii_extractor.baml
1test BasicPIIExtraction {
2 functions [ExtractPII]
3 args {
4 document #"
5 John Doe was born on 01/02/1980.
6 His email is john.doe@email.com and phone is 555-123-4567.
7 He lives at 123 Main St, Springfield, IL 62704.
8 "#
9 }
10}
11
12test EmptyDocument {
13 functions [ExtractPII]
14 args {
15 document "This document contains no PII data."
16 }
17}

This is what it looks like in BAML playground after running the test:

You can try playing with the functions and tests online at https://www.promptfiddle.com/Pii-data-O4PmJ

Step 4: Implementing PII Extraction and Scrubbing

Now you can use the PII extractor to both identify and scrub sensitive information from your documents:

pii_scrubber.py
1from baml_client import b
2from baml_client.types import PIIExtraction
3from typing import Dict, Tuple
4
5def scrub_document(text: str) -> Tuple[str, Dict[str, str]]:
6 # Extract PII from the document
7 result = b.ExtractPII(text)
8
9 # Create a mapping of real values to scrubbed placeholders
10 scrubbed_text = text
11 pii_mapping = {}
12
13 # Process each PII item and replace with a placeholder
14 for pii_item in result.privateData:
15 pii_type = pii_item.dataType.upper()
16 placeholder = f"[{pii_type}_{pii_item.index}]"
17
18 # Store the mapping for reference
19 pii_mapping[placeholder] = pii_item.value
20
21 # Replace the PII with the placeholder
22 scrubbed_text = scrubbed_text.replace(pii_item.value, placeholder)
23
24 return scrubbed_text, pii_mapping
25
26def restore_document(scrubbed_text: str, pii_mapping: Dict[str, str]) -> str:
27 """Restore the original text using the PII mapping."""
28 restored_text = scrubbed_text
29 for placeholder, original_value in pii_mapping.items():
30 restored_text = restored_text.replace(placeholder, original_value)
31 return restored_text
32
33# Example usage
34document = """
35John Smith works at Tech Corp.
36You can reach him at john.smith@techcorp.com
37or call 555-0123 during business hours.
38His employee ID is TC-12345.
39"""
40
41# Scrub the document
42scrubbed_text, pii_mapping = scrub_document(document)
43
44print("Original Document:")
45print(document)
46print("\nScrubbed Document:")
47print(scrubbed_text)
48print("\nPII Mapping:")
49for placeholder, original in pii_mapping.items():
50 print(f"{placeholder}: {original}")
51
52# If needed, restore the original document
53restored_text = restore_document(scrubbed_text, pii_mapping)
54print("\nRestored Document:")
55print(restored_text)

This implementation provides several key features:

  1. PII Detection: Uses BAML’s ExtractPII function to identify PII
  2. Data Scrubbing: Replaces PII with descriptive placeholders
  3. Mapping Preservation: Maintains a mapping of placeholders to original values
  4. Restoration Capability: Allows restoration of the original text when needed

Example output:

1Original Document:
2
3John Smith works at Tech Corp.
4You can reach him at john.smith@techcorp.com
5or call 555-0123 during business hours.
6His employee ID is TC-12345.
7
8
9Scrubbed Document:
10
11[NAME_1] works at Tech Corp.
12You can reach him at [EMAIL_2]
13or call [PHONE_3] during business hours.
14His employee ID is [EMPLOYEE ID_4].
15
16
17PII Mapping:
18[NAME_1]: John Smith
19[EMAIL_2]: john.smith@techcorp.com
20[PHONE_3]: 555-0123
21[EMPLOYEE ID_4]: TC-12345
22
23Restored Document:
24
25John Smith works at Tech Corp.
26You can reach him at john.smith@techcorp.com
27or call 555-0123 during business hours.
28His employee ID is TC-12345.

Next Steps

Now that you have a working PII extractor, you can:

  • Add more specific PII types to look for
  • Implement validation for extracted PII (e.g., email format checking)
  • Create a more sophisticated prompt to handle edge cases
  • Add error handling for malformed documents
  • Integrate with your data privacy compliance system

Enhanced Security: Using Local Models

For organizations handling sensitive data, using cloud-based LLMs like OpenAI’s GPT models might not be suitable due to data privacy concerns. BAML supports using local models, which keeps all PII processing within your infrastructure.

In this example, we’re going to use a Ollama model. For more details on how to use Ollama with BAML, check out this page.

  1. First, define your local model client in pii_extractor.baml:
1// Please ensure you've got ollama set up with llama:3.1 installed
2//
3// ollama pull llama:3.1
4// ollama run llama:3.1
5client<llm> SecureLocalLLM {
6 provider "openai-generic"
7 options {
8 base_url "http://localhost:11434/v1"
9 model "llama3.1:latest"
10 temperature 0
11 default_role "user"
12 }
13}
  1. Update the ExtractPII function to use your local model:
1function ExtractPII(document: string) -> PIIExtraction {
2 // use a local model instead of openai
3 client SecureLocalLLM
4 prompt #"
5 Extract all personally identifiable information (PII) from the given document. Look for items like:
6 - Names
7 - Email addresses
8 - Phone numbers
9 - Addresses
10 - Social security numbers
11 - Dates of birth
12 - Any other personal data
13
14 {{ ctx.output_format }}
15
16 {{ _.role("user") }}
17
18 {{ document }}
19 "#
20}