PII, PHI, and PCI data overlap in ways that multiply compliance risk. Learn the differences, 2026 penalty structures, and how tokenization protects all three.

Personally identifiable information (PII), protected health information (PHI), and Payment Card Industry (PCI) data are the three categories of sensitive data that drive the majority of enterprise compliance obligations.
These categories overlap – e.g., a patient paying a hospital co-pay with a credit card creates a record that is simultaneously PII, PHI, and PCI data.
The average healthcare data breach costs $7.42 million (IBM, 2025), HIPAA penalties reach $2.19 million per violation category (HHS, 2026), and PCI non-compliance fines escalate to $100,000 per month (PCI DSS Guide).
Understanding the differences – and where they converge – is the first step toward building a unified data protection strategy.
Personally identifiable information is any data that can be used to identify or trace an individual, either alone or when combined with other information.
The National Institute of Standards and Technology (NIST) SP 800-122 defines PII broadly, and that breadth makes it the foundational category for data classification across all regulatory frameworks.
PII falls into two groups. Direct identifiers (e.g., Social Security numbers, passport numbers, driver's license numbers, and biometric data) can identify someone on their own.
Quasi-identifiers (e.g., ZIP code, date of birth, gender, or job title) cannot identify a person individually but can re-identify when combined, a risk documented in research by Carnegie Mellon's Latanya Sweeney showing that 87% of the U.S. population can be uniquely identified by ZIP code, date of birth, and gender alone. Every data discovery program must account for both types.
There is also a critical distinction between sensitive and non-sensitive PII. Sensitive PII (e.g., Social Security numbers, financial account numbers, medical records, biometric identifiers) requires encryption or equivalent protection because exposure causes direct harm.
Non-sensitive PII (e.g., name, email address, phone number) is often publicly available but still regulated under laws like GDPR and CCPA.
The table below breaks out the key examples by category, and a strong data security management program treats both tiers with appropriate controls.
PII is governed by a patchwork of regulations.
Across all frameworks, the obligation is the same: identify it, classify it, and protect it with controls proportional to its sensitivity – and automated data discovery and classification is the only reliable method at enterprise scale.
Protected health information is personally identifiable information that is linked to a healthcare service, diagnosis, treatment, or payment record and is held by a HIPAA-covered entity or business associate.
The HIPAA Privacy Rule (45 CFR § 164.501) defines PHI as individually identifiable health information relating to past, present, or future physical or mental health conditions, healthcare provision, or healthcare payment.
The key distinction: all PHI contains PII, but PII only becomes PHI when linked to a healthcare context under HIPAA jurisdiction. A thorough data classification process must flag this overlap explicitly.
HIPAA's Safe Harbor de-identification method (45 CFR § 164.514) lists 18 specific identifiers that must be removed or de-identified to strip PHI status from a record.
Those 18 HIPAA identifiers are:
Removing all 18 is the threshold for de-identification under Safe Harbor – anything short of that, and the record remains PHI.
There is a further distinction that matters for your technical controls.
Electronic protected health information (ePHI) is any PHI that is created, stored, transmitted, or received in electronic form.
While HIPAA's Privacy Rule covers all PHI formats (e.g., paper, oral, and electronic), the HIPAA Security Rule applies specifically to ePHI and mandates technical safeguards, including encryption, access controls, audit logging, and integrity verification.
If your organization stores health records digitally – and in 2026, virtually every covered entity does – the Security Rule's ePHI requirements are your operative compliance standard, and data access controls must be configured to enforce them.
PCI data is cardholder data (CHD) and sensitive authentication data (SAD) as defined by the PCI Security Standards Council (PCI SSC).
Any entity that stores, processes, or transmits cardholder data (e.g., merchants, payment processors, acquirers, issuers, and service providers) must comply with the PCI Data Security Standard (PCI DSS).
Understanding the distinction between CHD and SAD is critical because the storage rules differ, and PCI tokenization strategies must account for both.
Cardholder data includes the Primary Account Number (PAN), cardholder name, expiration date, and service code.
These elements can be stored post-authorization if they are protected in accordance with PCI DSS requirements, but the PAN is the defining element.
If a record includes a PAN, PCI DSS applies. Sensitive authentication data (e.g., full magnetic stripe data, CAV2/CVC2/CVV2/CID codes, and PINs or PIN blocks) must never be stored after authorization, regardless of the encryption method applied.
The table below maps these categories to their storage and protection rules.
PCI DSS 4.0 is now fully enforceable.
As of March 31, 2025, all 64 future-dated requirements that were previously listed as best practices became mandatory.
The PCI DSS 12 requirements – i.e., the compliance backbone – are organized into six objectives:
For Level 1 merchants processing over six million transactions annually, compliance requires a full Report on Compliance (RoC) – and understanding your SAQ type is the first step for everyone else.
The three data types share a common thread – i.e., they all contain information about people – but they diverge in scope, regulatory authority, and the consequences of mishandling them.
Your data protection and data classification strategy depends on understanding these distinctions.
The table below is the core reference for your data security and compliance teams.
The critical takeaway: PII is the umbrella. PHI and PCI data are specialized subsets that inherit all PII obligations and add their own regulatory layer on top.
If you protect PII properly, you have a data protection foundation – but you still need the additional controls specific to HIPAA and PCI DSS, including data loss prevention (DLP) capabilities that detect and block unauthorized movement of each data type.
The data security platforms guide breaks down how unified platforms address all three.
In theory, PII, PHI, and PCI data sit in distinct regulatory buckets. In practice, your databases, CRMs, and data lakes store them together – often in the same record.
That convergence is where most compliance failures originate, and it is why data discovery and classification must precede any protection strategy.
Consider healthcare billing.
A patient checks into a hospital, provides their name, date of birth, and insurance ID (PII), receives a diagnosis and treatment plan (PHI), and pays their co-pay with a credit card (PCI data).
That single transaction creates a record governed by HIPAA, PCI DSS, and state breach notification laws simultaneously. If that record is compromised, you face three parallel compliance response obligations – not one.
Insurance claims present the same convergence.
A health insurer processing a claim handles member PII (name, SSN, address), diagnosis and procedure codes (PHI), and payment routing information that may include cardholder data when reimbursements go to patient credit cards.
E-commerce adds another layer: an online pharmacy collecting customer PII and PCI data may also collect health-related purchase data that, depending on jurisdiction and business associate agreements, triggers PHI classification obligations.
The operational problem is that most enterprises treat these as separate compliance workstreams – i.e., separate tools, separate audits, separate budgets.
According to IBM's 2025 Cost of a Data Breach Report, the global average breach cost reached $4.44 million, with organizations using fragmented security tooling paying significantly more than those with unified platforms.
A single data security platform that classifies and protects all three data types through one control layer eliminates that fragmentation.
Every data type in this comparison carries distinct penalties, and the numbers have increased in 2026. Your compliance team needs current figures – i.e., not last year's – when calculating risk exposure and justifying data protection investments.
The U.S. Department of Health and Human Services (HHS) updated HIPAA penalty amounts effective January 28, 2026, applying a cost-of-living adjustment multiplier of 1.02598.
The four-tier penalty structure, as reported by HIPAA Journal, now stands at:
Annual caps range from $25,000 (Tier 1) to $1.5 million (Tier 4). De-identifying PHI through tokenization reduces your regulatory surface area because de-identified records fall outside HIPAA's definition of PHI.
PCI DSS is an industry standard, not a government regulation, but the financial consequences are severe.
According to PCI DSS Guide, acquiring banks impose escalating monthly fines: $5,000–$10,000 per month for the first three months, $25,000–$50,000 for months four through six, and up to $100,000 per month beyond six months.
Post-breach costs add $50,000 to over $500,000 for forensic investigation, remediation, and card reissuance.
ith all 64 future-dated requirements now mandatory since March 31, 2025, the single most effective step to reduce your PCI audit scope is tokenizing cardholder data so PANs never enter your environment.
Article 83 of the GDPR allows supervisory authorities to impose fines of up to €20 million or 4% of global annual revenue, whichever is higher, for the most serious infringements.
GDPR defines "personal data" more broadly than most PII definitions, covering any information relating to an identified or identifiable natural person – including online identifiers, location data, and cookie IDs.
Pseudonymization, including tokenization, is explicitly recognized under GDPR Article 4(5) as a risk-reduction measure, though it does not entirely exempt data from the GDPR's scope.
The California Privacy Rights Act (CPRA), which amended the original CCPA, imposes penalties of $2,500 per unintentional violation and $7,500 per intentional violation. It also grants consumers a private right of action for data breaches: $100 to $750 per consumer per incident.
For organizations handling California residents' data alongside cardholder data and health records, the compliance obligation stacks across all applicable frameworks.
The Gramm-Leach-Bliley Act (GLBA) governs non-public personal information (NPI) held by financial institutions – a category that overlaps heavily with PII but carries its own penalties: up to $100,000 per violation for institutions and $10,000 per violation for individuals.
NPI includes account numbers, transaction history, and credit data.
Financial institutions subject to GLBA, CCPA, and PCI DSS simultaneously face the densest regulatory overlap in any sector, and a unified data security approach is the only way to manage it without multiplying audit costs.
Protecting all three data types requires a layered data protection strategy – i.e., encompassing data classification, tokenization vs encryption decisions, data masking, and data loss prevention controls.
The five steps below apply across PII, PHI, and PCI data, and each step builds on the one before it. A data security best practices framework starts here.
You cannot protect data you have not identified.
Automated data discovery tools scan databases, file shares, cloud storage, SaaS applications, and mainframe environments to locate PII, PHI, and PCI data wherever it resides — including in systems your team may have forgotten about.
Classification engines then categorize each discovered record using pattern recognition, AI models, and custom rulesets.
The output is a map: here is where your PII lives, here is where your PHI lives, here is where your PCI data lives, and here is where they overlap.
According to IBM's 2025 report, breaches involving shadow data — data that organizations did not know existed — cost 16% more than average breaches.
Most enterprises also have dark data — information stored in legacy systems, archived databases, or decommissioned applications that were never inventoried.
A discovery tool that covers mainframe environments, cloud, and on-premises stores is the only way to achieve complete visibility.
Tokenization replaces sensitive data elements — PANs, Social Security numbers, medical record numbers — with non-reversible tokens that retain format but carry no exploitable value.
The original data is stored in an isolated token vault, and the tokens flowing through your systems are meaningless to an attacker.
For PCI data, the scope implications are decisive.
The PCI Security Standards Council treats an encrypted PAN as equivalent to cleartext for scoping purposes because encryption is reversible — if you hold the key, you can produce the original PAN.
Tokenized PANs, by contrast, exit PCI DSS scope entirely when the tokenization system meets PCI SSC requirements. That scope reduction translates directly to simpler SAQ questionnaires, smaller audit surface, and lower compliance costs.
For PHI, tokenization enables HIPAA Safe Harbor de-identification.
When the 18 HIPAA identifiers are replaced with tokens, the resulting record no longer qualifies as PHI under HIPAA — reducing your regulatory obligations for that data.
For PII broadly, tokenized records reduce the impact of a breach to near zero because exposed tokens cannot be reversed to recover the original data. A single tokenization platform can protect PII, PHI, and PCI data simultaneously across the same infrastructure.
Encryption is the baseline protection layer across every framework. AES-256 for data at rest and TLS 1.2+ for data in transit are the minimum standards expected by HIPAA, PCI DSS, and GDPR.
There is a critical nuance: encryption protects data from unauthorized access, but it does not reduce PCI DSS compliance scope.
As the PCI SSC's tokenization guidance clarifies, encrypted cardholder data remains in scope because the encryption is reversible.
For HIPAA, the Security Rule lists encryption as an "addressable" safeguard for ePHI — not technically mandatory, but the HHS Office for Civil Rights expects it, and failing to encrypt without a documented alternative explanation is a finding in most audits.
The takeaway: encryption is necessary but not sufficient. The tokenization vs. encryption question is not either/or — pair tokenization with encryption and masking for maximum scope reduction and data protection.
Data access control enforces who can see and interact with sensitive data.
Role-based access control (RBAC) and least-privilege principles are required across all three frameworks: HIPAA's minimum necessary standard, PCI DSS Requirement 7 (restrict access by business need-to-know), and GDPR's data minimization principle.
Audit logging — who accessed what, when, and from where — is mandatory under HIPAA's Security Rule (audit controls), PCI DSS Requirement 10 (log and monitor all access to cardholder data), and GDPR's accountability principle.
Real-time monitoring and anomaly detection are the operational layer that turns static logs into actionable threat intelligence — the foundation of any data loss prevention program.
Organizations that deploy data masking alongside access controls add a further safeguard: even authorized users see only the data elements they need, with sensitive fields dynamically masked based on role and context.
A breach response plan is required by HIPAA (notification within 60 days to HHS for breaches affecting 500+ individuals), PCI DSS (immediate notification to the acquiring bank and card brands with forensic investigation), and GDPR (72-hour notification to the supervisory authority).
Without a tested plan, your organization's response will be reactive, slow, and expensive — and the leading data breach risks compound when multiple regulatory clocks start simultaneously.
According to IBM's 2025 Cost of a Data Breach Report, organizations that regularly conduct tabletop exercises reduce their average breach cost by $232,000.
Healthcare breaches take the longest to identify and contain — 279 days on average, five weeks longer than the global average.
A documented, rehearsed incident response plan is the difference between a contained event and a catastrophic data breach that triggers all three regulatory response tracks simultaneously.
Most organizations approach PII, PHI, and PCI compliance as separate workstreams.
They purchase separate tools, engage separate auditors, and maintain separate policy libraries for each regulatory domain. That fragmentation is the primary cost multiplier — and it is the pattern that a unified data security platform eliminates.
A single tokenization layer applied at the data level — across mainframes, cloud databases, SaaS applications, and hybrid environments — can satisfy PCI DSS scope reduction, HIPAA Safe Harbor de-identification, and GDPR pseudonymization requirements simultaneously.
According to IBM's 2025 Cost of a Data Breach Report, the global average breach cost is $4.44 million, and healthcare leads all industries at $7.42 million. Organizations that consolidate their security tooling spend less per breach and detect breaches faster than those running fragmented, siloed tools.
Healthcare organizations processing payments are the clearest example.
A single patient record containing PII (name, address, SSN), PHI (diagnosis codes, treatment history), and PCI data (payment card for co-pays) passes through a single tokenization engine rather than three separate compliance tools.
The result: fewer tools, fewer audits, faster compliance cycles, lower total cost — and a data protection architecture that scales without tripling your vendor roster every time a new regulation takes effect.
DataStealth is a data security platform purpose-built for enterprises managing overlapping sensitive data obligations across hybrid environments.
Bilal is the Content Strategist at DataStealth. He's a recognized defence and security analyst who's researching the growing importance of cybersecurity and data protection in enterprise-sized organizations.