Is it safe to put sensitive data into ChatGPT? Not by default: prompts can be stored, trained on, and exposed. See the enterprise risks and how to fix it.

Is it safe to put sensitive data into ChatGPT? In short, no, at least not by default.
On the consumer tiers (i.e., Free, Plus, and Go), anything typed into the prompt can be stored, reviewed by a person, and used to train future models, which means sensitive data can surface well beyond the control of whoever entered it. The takeaway is that consumer ChatGPT serves as a public channel for confidential information, and it is safest to treat it as such.
The business tiers (Team, Enterprise, and the API) change the training default, and one stronger control changes the picture outright, i.e., neutralizing the data before it reaches the model. This guide on AI data security will cover the following on how to achieve that stronger data security posture:
Most of the confusion about whether ChatGPT is safe and whether it keeps anything confidential stems from a gap between what users assume happens to a prompt and what actually happens. The assumption is that a chat is ephemeral, much like a search query that vanishes once the tab is closed.
The reality is closer to email sitting on a server: the text is retained, can be read, and, on the consumer side, can be reused.
The data privacy answer, therefore, depends on which tier is in use, and the defaults are not especially generous for consumers. Closing that gap is the first step in reducing the data sprawl that quietly pushes sensitive information into tools the security team never sees.
It does. ChatGPT saves the inputs that are typed into it, and on consumer accounts, those conversations are retained on OpenAI's servers, where they can be reviewed by authorized staff.
Deleting a chat helps, though it is not quite the clean break most users imagine, i.e., OpenAI's standard practice is to remove deleted conversations within roughly 30 days, unless a legal obligation requires it to hold them longer.
That caveat is easy to miss, and it is precisely the kind of detail that turns an ordinary prompt into a leading data breach risk the moment confidential information is involved.
Retention and confidentiality are easily conflated, and they are not the same thing.
A 2025 court order in the New York Times litigation forced OpenAI to preserve even deleted ChatGPT conversations for a period before the company returned to its standard 30-day deletion practice. The lesson is that what happens to the data is partly outside OpenAI's hands once litigation or law enforcement enters the picture.
In other words, a data security platform is wise to treat any external service as untrusted by default, since the service's own retention promises can be overridden.
On the consumer tiers, the inputs can be used to improve future models unless the user explicitly opts out. For business products, the default is the opposite: OpenAI states that it does not train on inputs or outputs from ChatGPT Business, Enterprise, or the API by default.
This single distinction does a lot of work, because it is the difference between data privacy being a contractual guarantee and being a setting the user has to remember to toggle.
That is why the blunt question of whether ChatGPT is safe yields different answers at each tier.
Consumer ChatGPT may use the data for training and is governed by personal settings, whereas ChatGPT Team, Enterprise, Edu, and the API are contractually excluded from training and add administrative controls.
For the broader version of that question, our companion guide on ChatGPT enterprise security works through the tier differences in more detail.
The prompts are not the only thing captured, even if they are the obvious part.
OpenAI also collects metadata alongside the conversation content – e.g., the IP address, device and browser details, and broad usage patterns.
For a team building a data classification program, the takeaway is that the exposure surface is wider than the text of any single message, and it grows a little further every time an employee reaches for the tool.
The risk here is neither hypothetical nor rare, which is the part that tends to surprise people. Cyberhaven's 2025 AI Adoption and Risk Report found that 34.8% of the corporate data employees put into AI tools is now sensitive, up from 10.7% two years earlier.
Put differently, once more than a third of what staff paste is confidential, the act of putting sensitive data into ChatGPT stops being an edge case and becomes the dominant shadow AI exposure pattern.
Some categories of sensitive data simply do not belong in a consumer chatbot.
The list is familiar enough, e.g., personally identifiable information (PII), protected health information (PHI), payment card data, login credentials, source code and intellectual property, financial records, and confidential client or company information.
The reason the list matters is that each item maps to a different regulator, which is also why a single data de-identification control has to be broad enough to cover all of them at once.
It helps to be precise about the three categories that drive most enterprise compliance, because they overlap in ways that multiply the exposure.
PII is any data that identifies a person; PHI is the regulated subset of PII tied to healthcare under HIPAA; and payment data is cardholder data governed by PCI DSS.
The awkward part is that a single record can be all three at once, e.g., a patient paying a co-pay by card, as our breakdown of PII vs. PHI vs. PCI lays out. One careless prompt can therefore trip three separate penalty regimes rather than one.
In 2023, Samsung engineers pasted proprietary source code and internal meeting notes into ChatGPT, and the company responded by restricting employee use of the tool shortly afterward. The detail that matters is who was involved: skilled staff using a productivity tool in good faith, not careless juniors ignoring a memo.
That is the reason employee-discipline strategies tend to fail at scale, and why managed data security has to operate at the data layer instead of at the level of individual judgment.
The exposure does not stop at the prompt, either. Infostealer malware now harvests saved ChatGPT credentials by the hundreds of thousands, and a clear majority of employees reach AI through personal accounts that bypass enterprise logging and retention controls entirely.
The trouble with a personal account is that it makes the shadow data problem invisible to the very people who are accountable for protecting it.
ChatGPT is also no longer a simple text box, which widens the problem in less obvious ways.
Connected apps and retrieval-augmented generation (RAG) integrations let the model reach into SaaS systems that may already be over-permissioned, so sensitive data can leak without anyone pasting anything at all.
This is the upstream version of the risk, i.e., the data moves on its own, and it is the part most teams underestimate when they picture a data breach risk as someone typing a secret into a box.
Prompt injection adds another vector, in which instructions hidden within a document or web page hijack the model into leaking the context it can see.
Stack malicious browser extensions and third-party breaches in the AI supply chain on top, and the underlying logic becomes clear, i.e., controls that protect the data itself outlast any control that merely watches the perimeter, which is the thinking behind data-centric enforcement.
Putting regulated data into ChatGPT is a compliance event in its own right, not only a security one. IBM's 2025 Cost of a Data Breach Report puts the global average breach at $4.44 million, with shadow-AI-related breaches averaging $4.63 million, i.e., roughly $670,000 more per incident than a conventional one.
The same report is fairly blunt about why this keeps happening. It found that 97% of organizations with an AI-related breach lacked proper AI access controls, and that 63% had no AI governance policy at all, which is the reason data access control belongs in the same conversation as data privacy rather than in a separate one.
Regulators have since caught up with the behaviour. High-risk obligations under the EU AI Act take effect on 2 August 2026, and the Act's headline penalties are steep: up to €35 million or 7% of global turnover for prohibited practices, with high-risk non-compliance carrying up to €15 million or 3% of global turnover.
Layer GDPR, HIPAA, and PCI DSS on top of that, and a single careless prompt can become a multi-regulator problem, which is the sort of outcome a data protection platform is built to head off before it starts.
Opting out feels like the obvious fix, and it is also the most common source of false comfort. The settings do help, but they address only one of the three ways the data is exposed, and not the two that matter most for confidential information.
In other words, the toggle solves the training problem while leaving retention and review untouched, which is why the durable answer to the data privacy problem still runs through acting on the data that flows into AI tools rather than adjusting a preference.
The mechanics are simple enough. To stop consumer ChatGPT from training on the inputs, open Settings, go to Data Controls, and turn off 'Improve the model for everyone.'
For one-off sessions, Temporary Chat starts a conversation that does not appear in history, does not create memories, and is not used to train the model.
Both are sensible hygiene practices, and both belong in any data security best-practice baseline, i.e., a starting point rather than the finish line.
Opting out of training does not stop retention or review; OpenAI still stores conversations for about 30 days for abuse monitoring, and those logs persist even when training is disabled.
Opt-out is not the same as deletion, and deletion is not the same as confidentiality, which is the chain of reasoning that makes de-identifying the data before input the only setting that protects it.
A training opt-out changes what OpenAI may do with the data going forward; it does nothing about data that has already been retained, reviewed as part of an abuse investigation, or frozen under a legal hold, like the one in the New York Times case.
If the sensitive value is sitting in a retained conversation, then the only question that still matters is the tokenization one, i.e., whether a breach of that store actually exposes anything real.
There is a spectrum of controls here, and the honest framing is that they are not equally effective. The popular options reduce risk at the margins, whereas one approach removes it entirely. Ranking them from table stakes up to the structural fix is the quickest way to see where most data security programs stop short of the goal.
The baseline is to move staff onto ChatGPT Team or Enterprise, where inputs are excluded from training by default, and to publish a clear AI-usage policy alongside it.
This is necessary, though it is not sufficient on its own, because a policy documents awareness rather than prevents exposure, and a tier does nothing the moment an employee opens a personal account.
That gap between policy and behaviour is why shadow AI keeps growing, even at companies that believe they have already banned it.
'Just don't paste it' rests on perfect employee judgment at scale, and the 34.8% figure is fairly direct evidence that the judgment does not hold across thousands of prompts a day.
Telling people to be careful is not really a control; it is a hope, and it leaves confidential, sensitive data exposed whenever someone is in a hurry, which is one of the leading data-breach risks for any enterprise.
Reactive data loss prevention (DLP) is the next rung up, and it still falls short of the mark. DLP monitors, alerts on, or blocks sensitive data in motion, but it assumes that real data is moving in the first place; it generates a great deal of alert fatigue, and it is routinely bypassed through browsers and personal accounts.
The difference is between watching the doors and emptying the vault, i.e., DLP detection versus data neutralization: the former tells you a leak happened, while the latter ensures there is nothing worth leaking.
The control that actually changes the outcome is to neutralize the data before the prompt leaves the network. Replace PII, PHI, and payment data with format-preserving tokens, and ChatGPT only ever receives valueless substitutes, while the real values never cross the perimeter at all.
This is the data-centric approach that no amount of employee discipline or after-the-fact detection can match, because it works on the data rather than on the people or the network around it.
The tokens themselves are deterministic and format-preserving, so the AI workflows still function normally on the substituted values.
The useful property is that a token holds no exploitable value and has no mathematical path back to the original, i.e., an exfiltrated token is worthless, and under PCI DSS, tokenized data leaves audit scope entirely.
These controls are easy to confuse, and the differences are exactly what decides the breach exposure. Encryption transforms data with a reversible key, so a stolen key (e.g., one lifted in a breach) exposes the data again, and encrypted card data remains within PCI scope.
Tokenization replaces the data with a valueless surrogate that has no key to steal and falls out of compliance scope, which is the structural reason it tends to beat encryption for this particular use case.
DataStealth sits at the network layer, inline between users and the destinations they send data to, intercepting traffic before it leaves the trust boundary.
It deploys without agents, code changes, or API integrations, so it protects data across mainframes, databases, cloud, and SaaS from a single data security platform – i.e., it slots in through a simple configuration change inside the trust boundary.
For the ChatGPT case specifically, the platform identifies the sensitive elements in a prompt in real time and replaces them with format-preserving tokens before the data leaves the network.
The model then receives neutralized substitutes, and the sensitive value never crosses the perimeter, which means that retention, training, and any future breach of OpenAI's systems have nothing real to expose.
That capability spans the compliance obligations of a single architecture rather than several.
Tokenized cardholder data falls outside PCI DSS scope, de-identified PHI reduces HIPAA exposure, and the same controls support GDPR and EU AI Act audit requirements, as our tokenization versus encryption analysis sets out.
The bigger question then shifts in a useful way, i.e., rather than asking whether to ban AI, the organization can ask how to make it safe to use by default.
What most security teams miss is a subtler point than 'can the tool be trusted.' The more useful question is whether the data needs to be real when it reaches ChatGPT in the first place, and with inline tokenization, the answer is that it does not.
Once the value is a token, one can see the whole debate about retention, training, and breach quietly losing its force, because there is no longer a real secret on the other side of the prompt.
The takeaway is that productivity and data security are not really opposed, provided the data is protected before it travels. DataStealth maps onto the specific risks this guide has worked through:
DataStealth is a data security platform (DSP) that allows organizations to discover, classify, and protect their most sensitive data and documents, ensuring that sensitive data and documents are secure and meet applicable regulatory requirements.